# LDPC Decoder for Photon-Starved Optical Communication **Target:** Efabless chipIgnite (SkyWater 130nm, Caravel harness) **Authors:** CAH + partner **Date:** February 2026 --- ## 1. What We're Building A low-density parity-check (LDPC) decoder ASIC for photon-starved optical links -- deep space optical, underwater optical, or any scenario where photons are scarce. The receiver uses single-photon detectors, meaning received signals are soft probabilities, not clean 0/1 bits. LDPC decoding extracts reliable data from these noisy probabilistic measurements. The decoder will be fabricated on the Efabless chipIgnite shuttle using SkyWater 130nm, fitting inside the ~10 mm^2 Caravel user area. ### System Overview ![System Architecture](../data/plots/system_architecture.png) The full signal chain: 1. **Photon detector** -- single-photon avalanche diodes (SPADs) or superconducting nanowire detectors count individual photons per time slot 2. **LLR computation** (PicoRV32 firmware) -- converts photon counts to 6-bit log-likelihood ratios (LLRs) representing the probability each bit is 0 vs 1 3. **LDPC decoder** (our ASIC) -- iterative belief-propagation decoder that uses the redundancy in the code to correct errors 4. **Decoded data** -- 32 information bits per codeword, delivered via Wishbone bus --- ## 2. The Channel: Photon-Counting Optical This is NOT a typical AWGN (Gaussian noise) channel. The noise model is Poisson: - **Bit = 0 transmitted:** detector sees background photons only, Poisson(lambda_b) - **Bit = 1 transmitted:** detector sees signal + background, Poisson(lambda_s + lambda_b) ![Channel Model](../data/plots/channel_model.png) The LLR for each received photon count y is: ``` LLR(y) = lambda_s - y * ln((lambda_s + lambda_b) / lambda_b) ``` At lambda_s = 3 photons/slot (our operating region), there's significant overlap between the bit=0 and bit=1 distributions. Without coding, BER would be unacceptable. LDPC coding provides the gain needed to achieve reliable communication at these photon-starved levels. **Key insight:** The whole point of soft-decision LDPC decoding is that we pass the *probability* of each bit to the decoder, not a hard 0/1 decision. Hard-decision decoding would lose 2-3 dB of coding gain -- unacceptable when every dB matters. --- ## 3. Code Parameters | Parameter | Value | Notes | |-----------|-------|-------| | Code type | QC-LDPC | Quasi-cyclic for hardware efficiency | | Rate | 1/8 (R = 0.125) | k=32 info bits, n=256 coded bits | | Base matrix | 7 x 8 | 7 check node groups, 8 variable node groups | | Lifting factor Z | 32 | n = 8 * 32 = 256 | | Quantization | 6-bit signed | Range [-32, +31], 1 sign + 5 magnitude | | Max iterations | 30 | With early termination on syndrome check | | Algorithm | Offset min-sum | ~0.2 dB from sum-product, no multipliers | | Scheduling | Layered (row-serial) | ~2x faster convergence vs flooding | ### Why Rate 1/8? Extreme redundancy for very low SNR. The Shannon limit at rate 1/8 is ~0.47 photons/slot. Practical LDPC codes can approach within 1-3 dB of this limit depending on code design and decoder quality. ### Why QC-LDPC? Quasi-cyclic structure means the entire decoder is parameterized by a small 7x8 base matrix of circulant shift values. This enables: - **Z=32 parallel processing** -- 32 VN/CN processors work in parallel - **Barrel shifter routing** -- circulant permutations implemented as a single barrel shifter - **Compact storage** -- only the base matrix (56 entries) needs to be stored --- ## 4. Decoder Architecture The decoder processes one base-matrix row per cycle in a layered schedule: ``` For each iteration (up to 30): For each row r = 0..6: ← 7 layers For each z = 0..31: ← 32 in parallel 1. Read beliefs for connected VNs 2. Apply barrel shift (circulant permutation) 3. Subtract old CN->VN messages (get VN->CN extrinsic) 4. Min-sum CN update (find min1, min2, XOR signs) 5. Compute new CN->VN messages 6. Add new messages to extrinsic (get updated beliefs) 7. Write back updated beliefs 8. Store new CN->VN messages Check syndrome: if all zero, stop early ``` ### Hardware Blocks | Block | Size | Notes | |-------|------|-------| | **LLR RAM** | 256 x 6-bit | Channel LLR storage, written by Wishbone | | **Message RAM** | 1792 x 6-bit | CN->VN messages: 7 rows x 8 cols x 32 = 1792 | | **VN Update** | 32 parallel | Saturating add/subtract, 6-bit | | **CN Update** | 32 parallel | Min-find + offset, no multipliers | | **Barrel Shifter** | 32-wide, 6-bit | Log2(32) = 5 stages of muxing | | **Syndrome Checker** | 32-wide XOR | Parity accumulator per row | | **Iteration Controller** | FSM | Row sequencer + early termination | | **Wishbone Interface** | Slave, 32-bit | Register map for CPU control | ### Why Offset Min-Sum (Not Sum-Product)? Sum-product decoding requires tanh/atanh LUTs or multipliers. At Sky130: - A 6-bit multiplier costs ~200 gates - We need 32 of them = 6,400 gates just for CN update - Plus the LUT storage for tanh approximation Offset min-sum uses only **comparators and adders**: - Find minimum magnitude among inputs (~5 comparators per CN) - XOR the signs - Subtract a small fixed offset (1 in our case) - Total: ~100 gates per CN unit The penalty is only ~0.2-0.5 dB vs optimal sum-product. A worthwhile trade at Sky130. --- ## 5. Code Optimization Journey We built a complete Python model + optimization pipeline to find the best LDPC code for our channel. Here's how we progressively improved from 8.1 dB above Shannon limit down to 3.4 dB. ### 5.1 Starting Point: Original Staircase The initial design used a simple IRA (Irregular Repeat-Accumulate) staircase: - **VN degrees:** [7, 2, 2, 2, 2, 2, 2, 1] -- info column connects to all rows, parity columns have low connectivity - **DE threshold:** 5.23 photons/slot - **Gap to Shannon:** 10*log10(5.23/0.47) = **10.5 dB** The weak column (degree 1) and low average connectivity limit performance. ### 5.2 Density Evolution Optimization We used Monte Carlo density evolution to exhaustively search all 3^7 = 2,187 possible VN degree distributions (parity columns with degrees in {2, 3, 4}): 1. Enumerate all candidates 2. Filter by row-degree constraints (dc in [3, 6]) 3. Group into 36 unique distributions (DE cares about distribution, not ordering) 4. Coarse screen at lambda_s=2.0 (quick convergence test) 5. Fine binary-search threshold for survivors **Winner:** [7, 4, 4, 4, 4, 3, 3, 3] ![Degree Distributions](../data/plots/degree_distributions.png) The optimized distribution has much higher average parity degree (3.6 vs 1.9), meaning more connections per check node, better error correction, and lower threshold. ![DE Threshold Comparison](../data/plots/threshold_comparison.png) ### 5.3 Base Matrix Construction Having the optimal degree distribution, we used a PEG (Progressive Edge Growth) inspired algorithm to assign actual circulant shift values: 1. Fix the staircase backbone (guarantees encodability) 2. Randomly place extra connections for target degrees 3. Optimize shifts for maximum girth (shortest cycle length) 4. Verify full rank and encodability ![Constructed Base Matrix](../data/plots/base_matrix_heatmap.png) The constructed matrix achieves girth = 6 at Z=32, with full rank (224/224) and verified encoding. ### 5.4 FER Validation Monte Carlo frame error rate simulation confirms the DE optimization translates to real FER gains: ![FER Comparison](../data/plots/fer_comparison.png) At lambda_s = 5 photons/slot, the optimized code has **13x fewer frame errors** than the original staircase. --- ## 6. Advanced Optimizations (Closing the Shannon Gap) Beyond the base code optimization, we implemented three techniques to push closer to Shannon: ### 6.1 Normalized Min-Sum Instead of subtracting a fixed offset from the CN update magnitude: ``` offset mode: mag = max(0, mag - 1) normalized mode: mag = floor(mag * alpha) # alpha = 0.875 ``` Normalized min-sum is better for low-rate codes because the correction scales with the magnitude rather than being a fixed offset. At alpha=0.875 (hardware-friendly: `mag - mag>>3`), the threshold improves from 3.05 to ~2.9 photons/slot. **RTL impact:** Replace the subtractor with a shift-and-subtract: `mag - (mag >> 3)`. Same gate count. ### 6.2 Larger Block Length (Z=128) Increasing the lifting factor from Z=32 to Z=128: - Codeword length: 256 -> 1024 bits - Information bits: 32 -> 128 per codeword - Same base matrix structure, different shift values - Girth improves from 6 to 8 (larger Z allows better shift optimization) ![Z=128 FER Comparison](../data/plots/z128_fer_comparison.png) Longer codes have better finite-length performance (closer to DE threshold). The tradeoff is more hardware (4x the RAMs and processing units). ### 6.3 Spatially-Coupled LDPC (SC-LDPC) The most powerful technique: threshold saturation. SC-LDPC codes replicate the base matrix along a chain of L positions with coupling width w, creating a convolutional-like structure. **How it works:** ``` Position: 0 1 2 3 ... L-1 | | | | | CN grp 0: [B0]-[B1] CN grp 1: [B0]-[B1] CN grp 2: [B0]-[B1] CN grp 3: [B0]-[B1] ... ... ``` Each CN position connects to two adjacent VN positions via component matrices B0 and B1 (split from the base matrix). The boundary positions have fewer connections, making them easier to decode. Once boundaries converge, a "wave" of correct decoding propagates inward. **Threshold saturation:** The BP (belief propagation) threshold approaches the MAP (maximum a posteriori) threshold. This is a fundamental information-theoretic result -- spatial coupling unlocks performance that regular LDPC codes cannot achieve. ![SC-LDPC Threshold Saturation](../data/plots/sc_threshold_comparison.png) ### 6.4 Results Summary ![Threshold Progression](../data/plots/threshold_progression.png) | Stage | Threshold (photons/slot) | Gap to Shannon | |-------|-------------------------|----------------| | Original staircase (offset MS) | 5.23 | 10.5 dB | | DE-optimized degrees (offset MS) | 3.05 | 8.1 dB | | + Normalized min-sum (alpha=0.875) | ~2.9 | 7.9 dB | | + Spatially-coupled LDPC | **1.03** | **3.4 dB** | | Shannon limit (rate 1/8) | 0.47 | 0 dB | **We closed 7.1 dB of the original 10.5 dB gap.** --- ## 7. What Needs to Be Built (RTL) ### 7.1 Phase 1: Basic Decoder (chipIgnite Target) This is what goes on the ASIC. Conservative, proven architecture: | Module | Description | Estimated Area | |--------|-------------|----------------| | `ldpc_decoder_top` | Top-level with Wishbone slave | -- | | `wishbone_interface` | Register map, control/status | ~0.05 mm^2 | | `llr_ram` | 256 x 6-bit dual-port SRAM | ~0.1 mm^2 | | `msg_ram` | 1792 x 6-bit (CN->VN messages) | ~0.5 mm^2 | | `vn_update_array` | 32x saturating add/sub units | ~0.1 mm^2 | | `cn_update_array` | 32x min-find + offset/normalize | ~0.2 mm^2 | | `barrel_shifter_z32` | 32-wide, 6-bit, 5-stage mux | ~0.15 mm^2 | | `iteration_controller` | Row FSM + early termination | ~0.05 mm^2 | | `syndrome_checker` | 32-wide XOR tree | ~0.05 mm^2 | | `hard_decision_out` | Sign-bit extraction | ~0.01 mm^2 | | **Total** | | **~1.2 mm^2** | **Timing target:** 150 MHz (aggressive for Sky130, may relax to 100 MHz) **Critical path:** CN update array (min-find across variable-degree check nodes) ### 7.2 Phase 2: Enhanced Decoder (Future / FPGA Prototype) For an FPGA prototype, we have more flexibility: - **Normalized min-sum:** Replace `mag - offset` with `mag - (mag >> 3)` -- trivial RTL change, ~0.5 dB better - **Configurable Z:** Support Z=32 and Z=128 via parameterized barrel shifter - **Multiple H-matrices:** Store several base matrices in a small ROM, selectable at runtime - **SC-LDPC windowed decoder:** Requires more memory (L positions x message storage) but same CN/VN units ### 7.3 Key RTL Design Decisions **Memory architecture:** - LLR RAM: single-port is fine (write during load, read during decode) - Message RAM: must be dual-port or time-multiplexed (read old + write new in same cycle) - At Sky130, use OpenRAM-generated SRAMs or register files **Datapath:** - All 6-bit signed arithmetic, no multipliers - CN update: comparator tree for min1/min2, XOR chain for signs - Barrel shifter: standard log-shifter, 5 stages for Z=32 **Control:** - Simple row-serial FSM: load -> iterate(row 0..6) -> check syndrome -> repeat or done - Wishbone interface: start bit, busy flag, convergence flag, iteration count --- ## 8. File Map ``` ldpc_optical/ rtl/ # SystemVerilog RTL (to be implemented) tb/ # Verilator testbenches (to be implemented) model/ ldpc_sim.py # Bit-exact behavioral model (reference for RTL) ldpc_analysis.py # Code analysis tools (rate sweep, matrix compare) density_evolution.py # DE optimizer + matrix construction sc_ldpc.py # SC-LDPC construction + windowed decoder test_density_evolution.py # 24 tests for DE/optimization test_sc_ldpc.py # 9 tests for SC-LDPC test_ldpc.py # 19 tests for base decoder model test_ldpc_analysis.py # 18 tests for analysis tools plot_de_results.py # Generate all plots data/ h_matrix.json # Base matrix definition de_results.json # DE optimization results (Z=32) de_results_z128.json # Z=128 results sc_ldpc_results.json # SC-LDPC threshold results plots/ # All generated figures docs/ project_report.md # This document ``` --- ## 9. Running the Python Model ```bash # Quick demo: encode, channel, decode at several SNR points python3 model/ldpc_sim.py # Generate RTL test vectors python3 model/ldpc_sim.py --gen-vectors # Run density evolution optimizer python3 model/density_evolution.py full --seed 42 # Find best normalized min-sum alpha python3 model/density_evolution.py alpha-sweep # SC-LDPC threshold + FER comparison python3 model/sc_ldpc.py full # Generate all plots python3 model/plot_de_results.py # Run all 70 tests python3 -m pytest model/ -v ``` --- ## 10. Next Steps 1. **RTL implementation** -- Start with `cn_update_array` and `vn_update_array` (most critical blocks), validate against Python bit-exact model 2. **Verilator testbench** -- Use `ldpc_sim.py --gen-vectors` to create golden test vectors 3. **OpenLane synthesis** -- Target Sky130, measure area and timing 4. **FPGA prototype** -- Validate on an FPGA board before tapeout 5. **SC-LDPC exploration** -- If area permits, consider adding windowed SC-LDPC support for future versions --- ## Appendix: Shannon Limit Calculation For the binary-input Poisson channel at rate R = 1/8: ``` Channel capacity: C(lambda_s) = max_p [H(Y) - H(Y|X)] Shannon limit: minimum lambda_s where C(lambda_s) >= R = 0.125 Result: lambda_s* = 0.47 photons/slot ``` This means with a perfect (infinite-length, optimal-decoded) code at rate 1/8, you need at least 0.47 photons per slot to communicate reliably. Our SC-LDPC code achieves 1.03 photons/slot with a practical decoder -- within 3.4 dB of this fundamental limit.