ldpc_optical/docs/project_report.md

# LDPC Decoder for Photon-Starved Optical Communication

**Target:** Efabless chipIgnite (SkyWater 130nm, Caravel harness)

**Authors:** CAH + partner

**Date:** February 2026

---

## 1. What We're Building

A low-density parity-check (LDPC) decoder ASIC for photon-starved optical links -- deep space optical, underwater optical, or any scenario where photons are scarce. The receiver uses single-photon detectors, meaning received signals are soft probabilities, not clean 0/1 bits. LDPC decoding extracts reliable data from these noisy probabilistic measurements.

The decoder will be fabricated on the Efabless chipIgnite shuttle using SkyWater 130nm, fitting inside the ~10 mm^2 Caravel user area.

### System Overview

![System Architecture](../data/plots/system_architecture.png)

The full signal chain:
1. **Photon detector** -- single-photon avalanche diodes (SPADs) or superconducting nanowire detectors count individual photons per time slot
2. **LLR computation** (PicoRV32 firmware) -- converts photon counts to 6-bit log-likelihood ratios (LLRs) representing the probability each bit is 0 vs 1
3. **LDPC decoder** (our ASIC) -- iterative belief-propagation decoder that uses the redundancy in the code to correct errors
4. **Decoded data** -- 32 information bits per codeword, delivered via Wishbone bus

---

## 2. The Channel: Photon-Counting Optical

This is NOT a typical AWGN (Gaussian noise) channel. The noise model is Poisson:

- **Bit = 0 transmitted:** detector sees background photons only, Poisson(lambda_b)
- **Bit = 1 transmitted:** detector sees signal + background, Poisson(lambda_s + lambda_b)

![Channel Model](../data/plots/channel_model.png)

The LLR for each received photon count y is:

```
LLR(y) = lambda_s - y * ln((lambda_s + lambda_b) / lambda_b)
```

At lambda_s = 3 photons/slot (our operating region), there's significant overlap between the bit=0 and bit=1 distributions. Without coding, BER would be unacceptable. LDPC coding provides the gain needed to achieve reliable communication at these photon-starved levels.

**Key insight:** The whole point of soft-decision LDPC decoding is that we pass the *probability* of each bit to the decoder, not a hard 0/1 decision. Hard-decision decoding would lose 2-3 dB of coding gain -- unacceptable when every dB matters.

---

## 3. Code Parameters

| Parameter | Value | Notes |
|-----------|-------|-------|
| Code type | QC-LDPC | Quasi-cyclic for hardware efficiency |
| Rate | 1/8 (R = 0.125) | k=32 info bits, n=256 coded bits |
| Base matrix | 7 x 8 | 7 check node groups, 8 variable node groups |
| Lifting factor Z | 32 | n = 8 * 32 = 256 |
| Quantization | 6-bit signed | Range [-32, +31], 1 sign + 5 magnitude |
| Max iterations | 30 | With early termination on syndrome check |
| Algorithm | Offset min-sum | ~0.2 dB from sum-product, no multipliers |
| Scheduling | Layered (row-serial) | ~2x faster convergence vs flooding |

### Why Rate 1/8?

Extreme redundancy for very low SNR. The Shannon limit at rate 1/8 is ~0.47 photons/slot. Practical LDPC codes can approach within 1-3 dB of this limit depending on code design and decoder quality.

### Why QC-LDPC?

Quasi-cyclic structure means the entire decoder is parameterized by a small 7x8 base matrix of circulant shift values. This enables:
- **Z=32 parallel processing** -- 32 VN/CN processors work in parallel
- **Barrel shifter routing** -- circulant permutations implemented as a single barrel shifter
- **Compact storage** -- only the base matrix (56 entries) needs to be stored

---

## 4. Decoder Architecture

The decoder processes one base-matrix row per cycle in a layered schedule:

```
For each iteration (up to 30):
    For each row r = 0..6:       ← 7 layers
        For each z = 0..31:      ← 32 in parallel
            1. Read beliefs for connected VNs
            2. Apply barrel shift (circulant permutation)
            3. Subtract old CN->VN messages (get VN->CN extrinsic)
            4. Min-sum CN update (find min1, min2, XOR signs)
            5. Compute new CN->VN messages
            6. Add new messages to extrinsic (get updated beliefs)
            7. Write back updated beliefs
            8. Store new CN->VN messages
    Check syndrome: if all zero, stop early
```

### Hardware Blocks

| Block | Size | Notes |
|-------|------|-------|
| **LLR RAM** | 256 x 6-bit | Channel LLR storage, written by Wishbone |
| **Message RAM** | 1792 x 6-bit | CN->VN messages: 7 rows x 8 cols x 32 = 1792 |
| **VN Update** | 32 parallel | Saturating add/subtract, 6-bit |
| **CN Update** | 32 parallel | Min-find + offset, no multipliers |
| **Barrel Shifter** | 32-wide, 6-bit | Log2(32) = 5 stages of muxing |
| **Syndrome Checker** | 32-wide XOR | Parity accumulator per row |
| **Iteration Controller** | FSM | Row sequencer + early termination |
| **Wishbone Interface** | Slave, 32-bit | Register map for CPU control |

### Why Offset Min-Sum (Not Sum-Product)?

Sum-product decoding requires tanh/atanh LUTs or multipliers. At Sky130:
- A 6-bit multiplier costs ~200 gates
- We need 32 of them = 6,400 gates just for CN update
- Plus the LUT storage for tanh approximation

Offset min-sum uses only **comparators and adders**:
- Find minimum magnitude among inputs (~5 comparators per CN)
- XOR the signs
- Subtract a small fixed offset (1 in our case)
- Total: ~100 gates per CN unit

The penalty is only ~0.2-0.5 dB vs optimal sum-product. A worthwhile trade at Sky130.

---

## 5. Code Optimization Journey

We built a complete Python model + optimization pipeline to find the best LDPC code for our channel. Here's how we progressively improved from 8.1 dB above Shannon limit down to 3.4 dB.

### 5.1 Starting Point: Original Staircase

The initial design used a simple IRA (Irregular Repeat-Accumulate) staircase:

- **VN degrees:** [7, 2, 2, 2, 2, 2, 2, 1] -- info column connects to all rows, parity columns have low connectivity
- **DE threshold:** 5.23 photons/slot
- **Gap to Shannon:** 10*log10(5.23/0.47) = **10.5 dB**

The weak column (degree 1) and low average connectivity limit performance.

### 5.2 Density Evolution Optimization

We used Monte Carlo density evolution to exhaustively search all 3^7 = 2,187 possible VN degree distributions (parity columns with degrees in {2, 3, 4}):

1. Enumerate all candidates
2. Filter by row-degree constraints (dc in [3, 6])
3. Group into 36 unique distributions (DE cares about distribution, not ordering)
4. Coarse screen at lambda_s=2.0 (quick convergence test)
5. Fine binary-search threshold for survivors

**Winner:** [7, 4, 4, 4, 4, 3, 3, 3]

![Degree Distributions](../data/plots/degree_distributions.png)

The optimized distribution has much higher average parity degree (3.6 vs 1.9), meaning more connections per check node, better error correction, and lower threshold.

![DE Threshold Comparison](../data/plots/threshold_comparison.png)

### 5.3 Base Matrix Construction

Having the optimal degree distribution, we used a PEG (Progressive Edge Growth) inspired algorithm to assign actual circulant shift values:

1. Fix the staircase backbone (guarantees encodability)
2. Randomly place extra connections for target degrees
3. Optimize shifts for maximum girth (shortest cycle length)
4. Verify full rank and encodability

![Constructed Base Matrix](../data/plots/base_matrix_heatmap.png)

The constructed matrix achieves girth = 6 at Z=32, with full rank (224/224) and verified encoding.

### 5.4 FER Validation

Monte Carlo frame error rate simulation confirms the DE optimization translates to real FER gains:

![FER Comparison](../data/plots/fer_comparison.png)

At lambda_s = 5 photons/slot, the optimized code has **13x fewer frame errors** than the original staircase.

---

## 6. Advanced Optimizations (Closing the Shannon Gap)

Beyond the base code optimization, we implemented three techniques to push closer to Shannon:

### 6.1 Normalized Min-Sum

Instead of subtracting a fixed offset from the CN update magnitude:
```
offset mode:     mag = max(0, mag - 1)
normalized mode: mag = floor(mag * alpha)    # alpha = 0.875
```

Normalized min-sum is better for low-rate codes because the correction scales with the magnitude rather than being a fixed offset. At alpha=0.875 (hardware-friendly: `mag - mag>>3`), the threshold improves from 3.05 to ~2.9 photons/slot.

**RTL impact:** Replace the subtractor with a shift-and-subtract: `mag - (mag >> 3)`. Same gate count.

### 6.2 Larger Block Length (Z=128)

Increasing the lifting factor from Z=32 to Z=128:
- Codeword length: 256 -> 1024 bits
- Information bits: 32 -> 128 per codeword
- Same base matrix structure, different shift values
- Girth improves from 6 to 8 (larger Z allows better shift optimization)

![Z=128 FER Comparison](../data/plots/z128_fer_comparison.png)

Longer codes have better finite-length performance (closer to DE threshold). The tradeoff is more hardware (4x the RAMs and processing units).

### 6.3 Spatially-Coupled LDPC (SC-LDPC)

The most powerful technique: threshold saturation. SC-LDPC codes replicate the base matrix along a chain of L positions with coupling width w, creating a convolutional-like structure.

**How it works:**
```
Position:  0    1    2    3    ...   L-1
           |    |    |    |          |
CN grp 0: [B0]-[B1]
CN grp 1:      [B0]-[B1]
CN grp 2:           [B0]-[B1]
CN grp 3:                [B0]-[B1]
  ...                          ...
```

Each CN position connects to two adjacent VN positions via component matrices B0 and B1 (split from the base matrix). The boundary positions have fewer connections, making them easier to decode. Once boundaries converge, a "wave" of correct decoding propagates inward.

**Threshold saturation:** The BP (belief propagation) threshold approaches the MAP (maximum a posteriori) threshold. This is a fundamental information-theoretic result -- spatial coupling unlocks performance that regular LDPC codes cannot achieve.

![SC-LDPC Threshold Saturation](../data/plots/sc_threshold_comparison.png)

### 6.4 Results Summary

![Threshold Progression](../data/plots/threshold_progression.png)

| Stage | Threshold (photons/slot) | Gap to Shannon |
|-------|-------------------------|----------------|
| Original staircase (offset MS) | 5.23 | 10.5 dB |
| DE-optimized degrees (offset MS) | 3.05 | 8.1 dB |
| + Normalized min-sum (alpha=0.875) | ~2.9 | 7.9 dB |
| + Spatially-coupled LDPC | **1.03** | **3.4 dB** |
| Shannon limit (rate 1/8) | 0.47 | 0 dB |

**We closed 7.1 dB of the original 10.5 dB gap.**

---

## 7. Frame Synchronization (No Preamble)

A critical receiver problem: the stream of photon counts is continuous. There are no headers, preambles, or synchronization markers. The receiver must find where each 256-bit codeword starts before it can decode anything.

Traditional approaches insert a known preamble sequence before each block. This wastes precious photons at low SNR. Instead, we exploit the LDPC code structure itself for synchronization.

### 7.1 The Insight: Syndrome as a Lock Detector

A valid codeword satisfies all parity checks (syndrome weight = 0). A random 256-bit window from the wrong alignment fails most checks (expected syndrome weight ~M/2 = 112 out of 224). This huge gap between "correct alignment" and "wrong alignment" is a free synchronization signal -- no preamble overhead needed.

### 7.2 Acquisition Algorithm

```
1. SYNDROME SCREENING (cheap)
   For each of 256 candidate offsets:
     - Extract 256-sample window from stream
     - Hard-decision: positive LLR → 0, negative → 1
     - Compute syndrome weight (just XOR with H matrix)
     - Cost: ~1/30th of a full decode per candidate

2. FULL DECODE (expensive, but rarely needed)
   For candidates with syndrome weight < 50:
     - Run full iterative min-sum decoding (up to 30 iterations)
     - If converged: candidate is promising

3. CONFIRMATION
   - Decode the next 2 consecutive frames at that offset
   - If both converge: LOCK ACQUIRED
   - Total cost: ~11 equivalent decodes (screening + 3 full decodes)
```

### 7.3 Re-Synchronization

If the offset drifts (e.g., clock slip), the receiver first searches locally within +/-16 positions of the last known offset. Only if that fails does it fall back to full acquisition. Local re-sync is nearly instant since it screens ~33 candidates instead of 256.

### 7.4 RTL Implications

The frame sync logic is simple hardware:
- **Syndrome screening:** reuses the same syndrome checker already in the decoder. Just feed it hard decisions from different offsets.
- **State machine:** ACQUIRE -> LOCKED -> RESYNC (on decode failure) -> local search -> fallback to ACQUIRE
- **No extra memory:** operates on the incoming LLR stream, one window at a time
- **Firmware option:** could also run entirely in PicoRV32 firmware if area is tight, since it's not time-critical (only runs once at startup or after link loss)

---

## 8. What Needs to Be Built (RTL)

### 8.1 Phase 1: Basic Decoder (chipIgnite Target)

This is what goes on the ASIC. Conservative, proven architecture:

| Module | Description | Estimated Area |
|--------|-------------|----------------|
| `ldpc_decoder_top` | Top-level with Wishbone slave | -- |
| `wishbone_interface` | Register map, control/status | ~0.05 mm^2 |
| `llr_ram` | 256 x 6-bit dual-port SRAM | ~0.1 mm^2 |
| `msg_ram` | 1792 x 6-bit (CN->VN messages) | ~0.5 mm^2 |
| `vn_update_array` | 32x saturating add/sub units | ~0.1 mm^2 |
| `cn_update_array` | 32x min-find + offset/normalize | ~0.2 mm^2 |
| `barrel_shifter_z32` | 32-wide, 6-bit, 5-stage mux | ~0.15 mm^2 |
| `iteration_controller` | Row FSM + early termination | ~0.05 mm^2 |
| `syndrome_checker` | 32-wide XOR tree | ~0.05 mm^2 |
| `hard_decision_out` | Sign-bit extraction | ~0.01 mm^2 |
| **Total** | | **~1.2 mm^2** |

**Timing target:** 150 MHz (aggressive for Sky130, may relax to 100 MHz)

**Critical path:** CN update array (min-find across variable-degree check nodes)

### 8.2 Phase 2: Enhanced Decoder (Future / FPGA Prototype)

For an FPGA prototype, we have more flexibility:

- **Normalized min-sum:** Replace `mag - offset` with `mag - (mag >> 3)` -- trivial RTL change, ~0.5 dB better
- **Configurable Z:** Support Z=32 and Z=128 via parameterized barrel shifter
- **Multiple H-matrices:** Store several base matrices in a small ROM, selectable at runtime
- **SC-LDPC windowed decoder:** Requires more memory (L positions x message storage) but same CN/VN units

### 8.3 Key RTL Design Decisions

**Memory architecture:**
- LLR RAM: single-port is fine (write during load, read during decode)
- Message RAM: must be dual-port or time-multiplexed (read old + write new in same cycle)
- At Sky130, use OpenRAM-generated SRAMs or register files

**Datapath:**
- All 6-bit signed arithmetic, no multipliers
- CN update: comparator tree for min1/min2, XOR chain for signs
- Barrel shifter: standard log-shifter, 5 stages for Z=32

**Control:**
- Simple row-serial FSM: load -> iterate(row 0..6) -> check syndrome -> repeat or done
- Wishbone interface: start bit, busy flag, convergence flag, iteration count

---

## 9. File Map

```
ldpc_optical/
  rtl/                          # SystemVerilog RTL (to be implemented)
  tb/                           # Verilator testbenches (to be implemented)
  model/
    ldpc_sim.py                 # Bit-exact behavioral model (reference for RTL)
    ldpc_analysis.py            # Code analysis tools (rate sweep, matrix compare)
    density_evolution.py        # DE optimizer + matrix construction
    sc_ldpc.py                  # SC-LDPC construction + windowed decoder
    frame_sync.py               # Preamble-less frame synchronization
    test_density_evolution.py   # 24 tests for DE/optimization
    test_frame_sync.py          # Frame sync tests
    test_sc_ldpc.py             # 9 tests for SC-LDPC
    test_ldpc.py                # 19 tests for base decoder model
    test_ldpc_analysis.py       # 18 tests for analysis tools
    plot_de_results.py          # Generate all plots
  data/
    h_matrix.json               # Base matrix definition
    de_results.json             # DE optimization results (Z=32)
    de_results_z128.json        # Z=128 results
    sc_ldpc_results.json        # SC-LDPC threshold results
    plots/                      # All generated figures
  docs/
    project_report.md           # This document
```

---

## 10. Running the Python Model

```bash
# Quick demo: encode, channel, decode at several SNR points
python3 model/ldpc_sim.py

# Generate RTL test vectors
python3 model/ldpc_sim.py --gen-vectors

# Run density evolution optimizer
python3 model/density_evolution.py full --seed 42

# Find best normalized min-sum alpha
python3 model/density_evolution.py alpha-sweep

# SC-LDPC threshold + FER comparison
python3 model/sc_ldpc.py full

# Frame synchronization demo
python3 model/frame_sync.py                  # Quick demo at lam_s=5
python3 model/frame_sync.py --sweep          # Acquisition sweep over SNR
python3 model/frame_sync.py --resync-test    # Re-sync robustness test

# Generate all plots
python3 model/plot_de_results.py

# Run all 70 tests
python3 -m pytest model/ -v
```

---

## 11. Next Steps

1. **RTL implementation** -- Start with `cn_update_array` and `vn_update_array` (most critical blocks), validate against Python bit-exact model
2. **Verilator testbench** -- Use `ldpc_sim.py --gen-vectors` to create golden test vectors
3. **OpenLane synthesis** -- Target Sky130, measure area and timing
4. **FPGA prototype** -- Validate on an FPGA board before tapeout
5. **SC-LDPC exploration** -- If area permits, consider adding windowed SC-LDPC support for future versions

---

## Appendix: Shannon Limit Calculation

For the binary-input Poisson channel at rate R = 1/8:

```
Channel capacity: C(lambda_s) = max_p [H(Y) - H(Y|X)]
Shannon limit: minimum lambda_s where C(lambda_s) >= R = 0.125
Result: lambda_s* = 0.47 photons/slot
```

This means with a perfect (infinite-length, optimal-decoded) code at rate 1/8, you need at least 0.47 photons per slot to communicate reliably. Our SC-LDPC code achieves 1.03 photons/slot with a practical decoder -- within 3.4 dB of this fundamental limit.