docs(hardening): add wrapper attempt history through v8-v11 + LVS-fix lessons

Document the full wrapper hardening trail: - Mar 12-13 wrapper_v2/v3/v4 results, mpw_precheck 17/19, and 5/5 GLS pass - May 7-11 v6-v11 LVS-cosmetic-fix attempts (all seven failed) The v6-v11 series tried to eliminate the 208 cosmetic LVS pin-match errors via per-pin conb_1 tieoffs and placement tweaks. All failed because the errors are a Magic SPICE-extraction limitation (constant- tied output nets collapse into shared power/ground at extract time), not a hardening defect. Documented so future sessions don't re-explore this dead end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs: add Run 5-7 hardening results and lessons learned
2026-05-12 23:13:11 -06:00 · 2026-03-10 19:42:23 -06:00 · 2026-03-10 19:42:09 -06:00
2 changed files with 400 additions and 125 deletions
--- a/docs/hardening-results.md
+++ b/docs/hardening-results.md
@@ -189,9 +189,306 @@ The `syndrome_cnt = syndrome_cnt + 1` accumulation pattern creates a carry chain

 Wishbone address input (`wb_adr_i`) has -2.47 ns setup violation. Fixable by registering the address at the decoder boundary.

+## Run 5: `syndrome_pipeline` (Mar 3, 2026) — COMPLETED (timing violations)
+- **RTL**: Pipelined CN + syndrome pipeline (SYNDROME_S1 + SYNDROME_S2 with serial popcount)
+- **Config**: `CLOCK_PERIOD=20` (50 MHz), `SYNTH_STRATEGY=AREA 0`, `RUN_ANTENNA_REPAIR=false`
+- **Die area**: 2800 x 1760 µm (4.93 mm²)
+- **Result**: All 75 steps completed. DRC/LVS clean.
+- **TT Setup WNS**: **-28.98 ns** — no improvement from Run 4
+- **Root cause**: Yosys serializes `syndrome_cnt = syndrome_cnt + 1` loop-carried dependency into ~48 ns chain
+- **Lesson**: Splitting parity + popcount into 2 cycles helps nothing if the popcount itself is still serial
+
+## Run 6: `balanced_popcount` (Mar 4, 2026) — COMPLETED (TT timing MET!)
+- **RTL**: Pipelined CN + syndrome pipeline with balanced 4-wide adder tree popcount
+- **Config**: `CLOCK_PERIOD=20` (50 MHz), `SYNTH_STRATEGY=AREA 0`, `RUN_ANTENNA_REPAIR=false`
+- **Die area**: 2800 x 1760 µm (4.93 mm²)
+- **Result**: All 75 steps completed. DRC/LVS clean. **TT timing met!**
+
+### Physical Results
+| Metric | Result |
+|--------|--------|
+| Magic DRC | **Clean** |
+| KLayout DRC | **Clean** |
+| LVS | **Clean** (0 errors, 0 unmatched) |
+| Antenna violating nets | 1,687 (no repair attempted) |
+
+### Area & Utilization
+| Metric | Value |
+|--------|-------|
+| Die area | 4,928,000 µm² (4.93 mm²) |
+| Instance count | 186,915 |
+| Instance area | 1,367,580 µm² (1.37 mm²) |
+| Core utilization | 28.2% |
+| Sequential cells | 18,056 |
+| Timing repair buffers | 27,864 |
+
+### Timing (post-route, CLOCK_PERIOD = 20 ns / 50 MHz target)
+| Corner | Setup WNS (ns) | Setup TNS (ns) | Hold WNS (ns) | Hold TNS (ns) |
+|--------|----------------|-----------------|----------------|----------------|
+| nom_tt_025C_1v80 | **0.0** | 0 | -0.45 | -10.5 |
+| nom_ss_100C_1v60 | **-9.18** | -12,474.4 | -0.17 | -0.21 |
+| nom_ff_n40C_1v95 | **0.0** | 0 | -0.37 | -38.6 |
+| max_ss_100C_1v60 | -10.45 | -15,896.8 | -0.44 | -0.87 |
+
+### Estimated Max Frequency
+- **TT corner**: **50 MHz — TIMING MET**
+- **SS corner**: Critical path ~40 ns → **~25 MHz** (up from ~11 MHz)
+- **FF corner**: **50 MHz — TIMING MET**
+
+### New Critical Path (SS corner)
+| Item | Value |
+|------|-------|
+| Startpoint | `u_core.col_idx[0]` (column index register) |
+| Endpoint | `u_core.beliefs` registers |
+| Slack | -9.18 ns (nom_ss) |
+| Data arrival time | 40.15 ns |
+| Description | Belief update mux path during LAYER_READ/LAYER_WRITE |
+
+The syndrome path is NO LONGER critical. The new bottleneck is the column-indexed mux/barrel-shifter path used during belief reads and writes.
+
+### Key Observations
+1. **Balanced popcount tree eliminated the syndrome bottleneck** — WNS improved from -28.98 ns to 0.0 ns at TT
+2. TT and FF corners now fully meet 50 MHz timing
+3. SS corner still fails (-9.18 ns) due to a different path: belief update mux indexed by col_idx
+4. Hold violations are minor (-0.45 ns) and can be fixed with post-route optimization
+5. 1,687 antenna violations need to be addressed (antenna repair was disabled)
+
+## Updated Summary Table
+
+| Run | RTL | Key Change | Antenna | Status | TT Setup WNS | Max Freq (TT) |
+|-----|-----|------------|---------|--------|-------------|---------------|
+| 1 | Unpipelined | — | Heuristic | **FAILED** | — | — |
+| 2 | Unpipelined | — | Iterative | **COMPLETED** | -27.13 ns | ~21 MHz |
+| 3 | Pipelined CN | CN pipeline | Iterative | **FAILED** | — | — |
+| 4 | Pipelined CN | CN pipeline | None | **COMPLETED** | -28.86 ns | ~20 MHz |
+| 5 | + Syndrome pipeline | Serial popcount | None | **COMPLETED** | -28.98 ns | ~20 MHz |
+| 6 | + Balanced popcount | Adder tree | None | **COMPLETED** | **0.0 ns** | **50 MHz** |
+
+## Run 7a: `pipelined_layer2` (Mar 9, 2026) — FAILED
+- **RTL**: Run 6 + LAYER_WRITE split into LAYER_WRITE_ADDR + LAYER_WRITE_DATA
+- **Config**: `CLOCK_PERIOD=20`, `DIODE_ON_PORTS=in`, `HEURISTIC_ANTENNA_THRESHOLD=200`
+- **Failure**: `GRT-0118` routing congestion — heuristic diode insertion on input ports added too many cells
+- **Lesson**: Any heuristic diode insertion causes GRT failure on this design
+
+## Run 7b: `pipelined_layer3` (Mar 9, 2026) — FAILED
+- **RTL**: Same as 7a (LAYER_WRITE_ADDR/DATA split)
+- **Config**: `DIODE_ON_PORTS=none`, `RUN_HEURISTIC_DIODE_INSERTION=false`
+- **Failure**: Post-CTS resizer diverged — 2.5+ hours at 100% CPU, memory climbing linearly, never converging
+- **Lesson**: LAYER_WRITE pipeline split creates too many paths for OpenROAD resizer
+
+## Run 7c: `pre_shift` (Mar 9, 2026) — FAILED
+- **RTL**: Run 6 + pre-registered H_BASE shift lookahead (`H_BASE[row_idx][col_idx+1]`)
+- **Config**: Same as 7b
+- **Failure**: `GPL-0302` placement density overflow — 150K cells at 41.3% exceeded 40% target
+- **Root cause**: Yosys cannot fold H_BASE constants through registers → full 256:1 write mux explosion (~2x cell count vs Run 6's 83K)
+- **Lesson**: Registering H_BASE shift values prevents Yosys constant folding
+
+## Run 7d: `run6_baseline` (Mar 9, 2026) — FAILED
+- **RTL**: Reverted to Run 6 baseline (identical RTL)
+- **Config**: `DIODE_ON_PORTS=in` (inadvertently left from earlier runs), `RUN_HEURISTIC_DIODE_INSERTION=false`
+- **Cells**: 85,500
+- **Failure**: `GRT-0118` routing congestion
+- **Root cause**: `DIODE_ON_PORTS=in` inserts diodes on input ports even when heuristic insertion is disabled
+
+## Run 7e: `run6b_nodiode` (Mar 10, 2026) — FAILED
+- **RTL**: Run 6 baseline
+- **Config**: `DIODE_ON_PORTS=none`, hold margins 0.5/0.3 (from config.json), reused `run6_baseline` synthesis
+- **Failure**: Post-CTS resizer diverged (9+ GiB memory, 3+ hours, never converged)
+- **Root cause**: Reusing synthesis from a run with different config (`DIODE_ON_PORTS=in`) produces a subtly different netlist that causes PnR divergence
+
+## Run 7f: `run6_clean` (Mar 10, 2026) — FAILED
+- **RTL**: Run 6 baseline, clean full run from scratch
+- **Config**: `DIODE_ON_PORTS=none`, hold margins 0.5/0.3
+- **Cells**: 85,500
+- **Hold buffers inserted**: 35,506
+- **Failure**: `GRT-0118` routing congestion
+- **Root cause**: Higher hold slack margins (0.5/0.3 vs balanced_popcount's 0.4/0.2) caused 13K extra hold buffers (35K vs 22K), pushing routing congestion over GRT threshold
+
+## Run 7g: `run6_fixhold` (Mar 10, 2026) — FAILED
+- **RTL**: Run 6 baseline, reused `run6_clean` synthesis
+- **Config**: `DIODE_ON_PORTS=none`, hold margins 0.4/0.2 (matching balanced_popcount)
+- **Failure**: Post-CTS resizer diverged (14+ GiB, 3.5+ hours)
+- **Root cause**: Yosys non-determinism — `run6_clean` synthesis produced a slightly different cell mix that didn't route cleanly despite identical config
+
+## Run 7h: `run6_reuse_bp` (Mar 10, 2026) — COMPLETED (reproduces Run 6!)
+- **RTL**: Run 6 baseline, **reused balanced_popcount's actual synthesis netlist**
+- **Config**: `DIODE_ON_PORTS=none`, hold margins 0.4/0.2
+- **Result**: All stages completed. DRC/LVS clean. TT timing met!
+- **Hold buffers**: 22,095 (identical to balanced_popcount)
+
+### Physical Results
+| Metric | Result |
+|--------|--------|
+| Magic DRC | **Clean** |
+| KLayout DRC | **Clean** |
+| LVS | **Clean** (circuits match uniquely) |
+| Antenna violating nets | 1,687 (repair disabled) |
+| Antenna violating pins | 3,416 (repair disabled) |
+
+### Area & Utilization
+| Metric | Value |
+|--------|-------|
+| Die area | 4,928,000 µm² (4.93 mm²) |
+| Instance count | 186,915 |
+| Instance area | 1,367,580 µm² (1.37 mm²) |
+| Core utilization | 28.2% |
+
+### Timing (post-route, CLOCK_PERIOD = 20 ns / 50 MHz target)
+| Corner | Setup WNS (ns) | Setup TNS (ns) | Hold WNS (ns) | Hold TNS (ns) |
+|--------|----------------|-----------------|----------------|----------------|
+| nom_tt_025C_1v80 | **+3.28** | 0 | -0.45 | -10.5 |
+| nom_ss_100C_1v60 | **-9.18** | -12,474 | -0.17 | -0.21 |
+| nom_ff_n40C_1v95 | **+5.93** | 0 | -0.37 | -38.6 |
+| max_ss_100C_1v60 | -10.45 | -15,897 | -0.44 | -0.87 |
+| min_tt_025C_1v80 | +3.71 | 0 | -0.26 | -1.66 |
+| max_tt_025C_1v80 | +2.90 | 0 | -0.62 | -29.5 |
+
+### Key Observations
+1. **Results identical to Run 6** — confirms that the balanced_popcount synthesis netlist is the key ingredient
+2. Yosys non-determinism is significant: re-synthesizing the same RTL with same config produces netlists that fail PnR
+3. Hold violations (1,543 total) are all on input port paths (`wb_dat_i`, `wb_adr_i`), zero reg-to-reg — fixable with input delay constraints
+4. Max slew violations (4,112) and max cap violations (655) concentrated in SS corner
+
+## Updated Summary Table
+
+| Run | RTL | Key Change | Antenna | Status | TT Setup WNS | Max Freq (TT) |
+|-----|-----|------------|---------|--------|-------------|---------------|
+| 1 | Unpipelined | — | Heuristic | **FAILED** | — | — |
+| 2 | Unpipelined | — | Iterative | **COMPLETED** | -27.13 ns | ~21 MHz |
+| 3 | Pipelined CN | CN pipeline | Iterative | **FAILED** | — | — |
+| 4 | Pipelined CN | CN pipeline | None | **COMPLETED** | -28.86 ns | ~20 MHz |
+| 5 | + Syndrome pipeline | Serial popcount | None | **COMPLETED** | -28.98 ns | ~20 MHz |
+| 6 | + Balanced popcount | Adder tree | None | **COMPLETED** | **0.0 ns** | **50 MHz** |
+| 7a | + LAYER_WRITE split | ADDR/DATA pipeline | Heuristic | **FAILED** | — | — |
+| 7b | + LAYER_WRITE split | ADDR/DATA pipeline | None | **FAILED** (resizer) | — | — |
+| 7c | + pre_shift | H_BASE lookahead | None | **FAILED** (GPL) | — | — |
+| 7d | Run 6 baseline | DIODE_ON_PORTS=in | None | **FAILED** (GRT) | — | — |
+| 7e | Run 6 baseline | Reuse wrong synth | None | **FAILED** (resizer) | — | — |
+| 7f | Run 6 baseline | Hold margins 0.5/0.3 | None | **FAILED** (GRT) | — | — |
+| 7g | Run 6 baseline | Reuse run6_clean synth | None | **FAILED** (resizer) | — | — |
+| 7h | Run 6 baseline | **Reuse BP synth** | None | **COMPLETED** | **+3.28 ns** | **50 MHz** |
+
+## Key Lessons Learned (Run 7 Series)
+
+1. **LAYER_WRITE pipeline is not viable**: Any register between col_idx and H_BASE causes either cell explosion (Yosys can't fold constants through registers) or PnR divergence (too many paths for resizer)
+2. **Heuristic diode insertion always fails**: Both `RUN_HEURISTIC_DIODE_INSERTION=true` and `DIODE_ON_PORTS=in` cause GRT-0118 congestion
+3. **Hold slack margins matter**: 0.5/0.3 inserts 35K hold buffers → GRT failure. 0.4/0.2 inserts 22K → passes
+4. **Yosys synthesis is non-deterministic**: Re-synthesizing identical RTL+config produces different netlists with different PnR outcomes. The balanced_popcount synthesis netlist is the only one proven to complete
+5. **Config must be consistent**: Reusing synthesis from a run with different config settings causes PnR divergence
+6. **Run 6's balanced_popcount synthesis netlist is the golden reference** — all future PnR runs should reuse it
+
+## Wrapper Hardening (Mar 12-13, 2026)
+
+### wrapper_v2 — COMPLETED (LVS fail)
+- **Config**: `SYNTH_ELABORATE_ONLY=true`, `FP_PDN_ENABLE_RAILS=false`
+- **Result**: DRC clean, but LVS fails — 3 standard cells (inv_2 + 2x conb_1) have floating VPWR/VGND
+- **Root cause**: Without power rails, wrapper std cells have no power connection
+
+### wrapper_v3 — ABORTED (208 LVS pin-match errors)
+- **Config**: `SYNTH_ELABORATE_ONLY=true`, `FP_PDN_ENABLE_RAILS=true`, `ERROR_ON_LVS_ERROR=true`
+- **Result**: DRC clean, XOR clean, power pins connected. Flow aborted at LVS check.
+- **LVS issue**: 206 constant-tied output pins merged during Magic SPICE extraction
+
+### wrapper_v4 — COMPLETED (golden wrapper)
+- **Config**: Same as v3 but `ERROR_ON_LVS_ERROR=false`
+- **Result**: All 69 stages completed. DRC clean (Magic + KLayout). XOR clean.
+- **LVS**: 208 pin-match errors (cosmetic — device classes equivalent)
+- **Pin merging**: Magic SPICE extraction merges io_oeb[37:0], io_out[37:0], la_data_out[127:0], user_irq[2:1] into shared constant nets, losing individual pin labels
+
+## Precheck Results (Mar 13, 2026)
+
+| # | Check | Result |
+|---|-------|--------|
+| 1 | License | PASSED (SPDX sub-check: 1727 non-compliant venv files) |
+| 2 | Makefile | **PASSED** |
+| 3 | Default | **PASSED** |
+| 4 | Documentation | **PASSED** |
+| 5 | Top Cell | **PASSED** |
+| 6 | Consistency | **PASSED** |
+| 7 | GPIO-Defines | **PASSED** |
+| 8 | XOR | **PASSED** |
+| 9 | Magic DRC | **PASSED** |
+| 10 | KLayout FEOL | FAILED (SIGSEGV crash, NOT real DRC) |
+| 11 | KLayout BEOL | **PASSED** |
+| 12 | KLayout Offgrid | **PASSED** |
+| 13 | KLayout Metal Density | **PASSED** |
+| 14 | KLayout Pin Labels | **PASSED** |
+| 15 | KLayout ZeroArea | **PASSED** |
+| 16 | Spike Check | **PASSED** |
+| 17 | Illegal Cellname | **PASSED** |
+| 18 | OEB | **PASSED** |
+| 19 | LVS | FAILED (3 cosmetic pin mismatches) |
+
+**17 PASSED, 2 FAILED.** Both failures are non-functional:
+- KLayout FEOL: Tool crash (signal 11), not a DRC violation
+- LVS: "Top level cell failed pin matching" — 3 cosmetic mismatches:
+  - `io_oeb[9]` in layout only (Magic kept 1 label for merged constant net)
+  - `user_irq[2]` in layout only (same issue)
+  - `vssd2` in netlist only (PDN power net not labeled as port)
+  - CVC: 0 errors. Device classes: equivalent.
+
+## Gate-Level Simulation Results (Mar 13, 2026)
+
+All 5 cocotb tests passed in GL mode (iverilog + caravel_cocotb, no SDF annotation):
+
+| Test | Status | Sim Time (ns) | Wall Time (s) | GPIO[7:0] | Errors |
+|------|--------|---------------|----------------|-----------|--------|
+| ldpc_basic | **PASS** | 854,225 | 1,814 | 0xAB | 0 |
+| ldpc_noisy | **PASS** | 1,011,550 | 2,720 | 0xAB | 0 |
+| ldpc_max_iter | **PASS** | 1,104,525 | 3,393 | 0xAB | 0 |
+| ldpc_back_to_back | **PASS** | 1,140,375 | 3,371 | 0xAB | 0 |
+| ldpc_demo | **PASS** | 1,251,050 | 3,612 | 0xAB | 0 |
+
+- iverilog compilation: ~2h18m per test (1.1GB sim.vvp), 8.2GB RAM
+- Simulation: ~30-60 min per test (5-9GB VCD waveform)
+- All tests ran on snoke (247GB RAM), 4 tests in parallel
+- GPIO[7:0] = 0xAB is the firmware success code for all tests
+- No X-propagation or timing race issues observed
+
+## Wrapper Hardening Attempts (May 7-11, 2026) — Failed LVS Cosmetic-Fix Series
+
+After the May 1 `cf_wrapper_v5` golden run landed (commit `74ad20a` to origin / `1fcdc1d` to gitea) with 208 cosmetic LVS pin-match errors, a series of seven follow-up runs tried to eliminate those errors. **All seven failed.** The errors are a Magic SPICE-extraction limitation, not a hardening defect — no amount of RTL/placement tweaking will change Magic's behavior.
+
+### Timeline
+
+| Run | Date | Strategy | Result |
+|-----|------|----------|--------|
+| v6 | May 7 | First post-PDN-swap retry (commit `8cc8414` landed config changes); same wrapper RTL | Flow completed but KLayout crashed in final manufacturability step; same 208 LVS errors |
+| v7 | May 7 | Same as v6, re-run | Aborted mid-routing on `[DRT-0349]` LEF58_ENCLOSURE warnings — routing never completed |
+| v8 | May 8 | `manual_tieoffs.vh` with 206 per-pin `conb_1` cells + `manual_placements.json` placing each cell adjacent to its target pin; mprj moved `[60,15] → [60,200]` to make room | Flow completed; **same 208 LVS errors** — Magic still merged all constant-tied outputs. STA failed on `min_ss_100C_1v60` and `nom_tt_025C_1v80` corners |
+| v9 | May 9 | Same as v8 with `ERROR_ON_TR_DRC=false` to push through routing | **1780 routing DRC errors** (deferred). Magic streamout completed but DRC was never clean |
+| v10 | May 11 | Same family of placement tweaks | **1362 routing DRC errors** (deferred); same failure mode as v9 |
+| v11 | May 11 | One more attempt | Interrupted at step 01 (yosys-jsonheader); no harden process running |
+
+### Why every attempt failed
+
+The 208 LVS errors all come from **Magic SPICE extraction collapsing constant-tied nets**:
+
+- `la_data_out[127:0]` — all 128 bits tied to `1'b0` → Magic extracts as a single GND net → 127 pin labels lost (only one kept arbitrarily, often none)
+- `io_out[37:0]` — all 38 bits tied to `1'b0` → same merge
+- `io_oeb[37:0]` — all 38 bits tied to `1'b1` → merged into VDD net (Magic keeps the label for `io_oeb[9]` for unknown reasons)
+- `user_irq[2:1]` — tied to `2'b0` → merged into GND
+
+The v8 attempt — putting each pin behind its own `sky130_fd_sc_hd__conb_1` cell — does not break the merge because Magic's extractor still resolves each `conb_1` output as the constant `VPWR` or `VGND` and collapses them onto the global power/ground nets at the extracted-SPICE level. Per-pin cells generate distinct logical nets in the Verilog netlist but not distinct extracted nets in the layout. **Netgen itself reports "Device classes equivalent" and "Cell pin lists altered to match"** — the failure is bookkeeping, not electrical.
+
+### Approaches proven non-viable (don't try again)
+
+1. **Per-pin `conb_1` cells in the wrapper Verilog** — v8 disproved this. Magic optimizes them onto the constant nets.
+2. **Per-pin manual placement of tieoff cells** — placement doesn't change extraction behavior.
+3. **mprj location shifts** to make room for tieoff rows — doesn't help; cosmetic LVS persists.
+4. **Pushing routing-DRC tolerance up** (v9, v10) — produces broken layouts (1300–1800 routing DRC errors), worse than starting state.
+
+### Approaches that *could* work but were not attempted (deferred — too risky pre-deadline)
+
+1. **Drive 206 dummy zero outputs from inside `ldpc_decoder_top`** — would force each wrapper output to come from a distinct extracted macro pin instead of a constant-tied wrapper net. Requires a fresh macro re-harden, which risks breaking Run 6's golden timing on a non-deterministic Yosys run. 4–6 hour cost, high regression risk.
+2. **Post-extraction `.mag` editing** to add per-pin port labels — brittle and tool-specific; would not survive a re-harden.
+3. **Formal LVS waiver** (the chosen May 12 path) — document the cosmetic nature of the errors, cite netgen's own "Device classes equivalent" line, and submit alongside the submission packet.
+
+### Key lesson
+
+**The 208 LVS pin-match errors are not fixable with wrapper-only hardening.** Magic SPICE-extraction behavior is the root cause. Future sessions should not re-litigate this — either fix it inside the macro (re-harden risk) or formally waive it.
+
 ## Next Steps
- Implement syndrome pipeline (SYNDROME_S1 + SYNDROME_S2) to cut critical path from ~49 ns to ~16 ns
- Register Wishbone address input to fix secondary violation
- Re-synthesize with AREA 0 and run PnR to verify timing improvement
- Consider increasing die area for antenna repair headroom
- Consider `SYNTH_STRATEGY=AREA 1` as middle ground between AREA 0 and AREA 2
+- Submit with a formal LVS waiver (see `chip_ignite/docs/LVS_WAIVER.md`)
+- Confirm `cf precheck` and `cf verify ldpc_basic --sim gl` still pass on the HEAD wrapper state
+- `cf push` before 2026-05-13 deadline
--- a/rtl/ldpc_decoder_core.sv
+++ b/rtl/ldpc_decoder_core.sv
@@ -2,14 +2,11 @@
 //
 // Layered scheduling processes one base-matrix row at a time.
 // For each row, we:
-//   1. Read VN beliefs for all Z columns connected to this row
-//   2. Subtract old CN->VN messages to get VN->CN messages
-//   3. Run CN min-sum update
-//   4. Add new CN->VN messages back to VN beliefs
-//   5. Write updated beliefs back
-//
-// This converges ~2x faster than flooding and needs only one message memory
-// (CN->VN messages for current layer, overwritten each layer).
+//   1. LAYER_READ (8 cycles): Read beliefs, subtract old messages → vn_to_cn
+//   2. CN_STAGE1 (1 cycle): Sign/mag extract, min-find (registered)
+//   3. CN_STAGE2 (1 cycle): Extrinsic output generation
+//   4. LAYER_WRITE (8 cycles): Write beliefs + update CN->VN messages
+// Total: 18 cycles/layer × 7 layers + 3 (syndrome) = 129 cycles/iteration

 module ldpc_decoder_core #(
    parameter N_BASE    = 8,
@@ -116,8 +113,9 @@ module ldpc_decoder_core #(
        IDLE,
        INIT,            // Initialize beliefs from channel LLRs, zero messages
        LAYER_READ,      // Read Z beliefs for each of DC columns in current row
-        CN_UPDATE,      // Run min-sum CN update on gathered messages
-        LAYER_WRITE,    // Write updated beliefs and new CN->VN messages
+        CN_STAGE1,       // Pipeline stage 1: sign/mag extract, min-find
+        CN_STAGE2,       // Pipeline stage 2: extrinsic output generation
+        LAYER_WRITE,     // Write beliefs + update CN->VN messages
        SYNDROME_S1,     // Syndrome pipeline stage 1: compute parity bits
        SYNDROME_S2,     // Syndrome pipeline stage 2: popcount parity vector
        SYNDROME_DONE,   // Read registered syndrome result
@@ -131,9 +129,16 @@ module ldpc_decoder_core #(
    logic [2:0]  col_idx;       // current column being read/written (0..N_BASE-1)
    logic [4:0]  effective_max_iter;

-    // Working registers for current layer CN update
-    logic signed [Q-1:0] vn_to_cn [DC][Z];  // VN->CN messages for current row
-    logic signed [Q-1:0] cn_to_vn [DC][Z];  // new CN->VN messages (output of min-sum)
+    // Working registers for current layer
+    logic signed [Q-1:0] vn_to_cn [DC][Z];
+    logic signed [Q-1:0] cn_to_vn [DC][Z];
+
+    // CN pipeline stage 1 intermediate registers
+    logic [DC-1:0]  s1_signs    [Z];
+    logic           s1_sign_xor [Z];
+    logic [Q-2:0]   s1_min1     [Z];
+    logic [Q-2:0]   s1_min2     [Z];
+    logic [2:0]     s1_min1_idx [Z];

    // Syndrome pipeline registers
    logic [M_BASE*Z-1:0] parity_vec;  // 224-bit registered parity results
@@ -165,14 +170,15 @@ module ldpc_decoder_core #(
        case (state)
            IDLE:        if (start) state_next = INIT;
            INIT:        state_next = LAYER_READ;
-            LAYER_READ:  if (col_idx == N_BASE - 1) state_next = CN_UPDATE;
-            CN_UPDATE:   state_next = LAYER_WRITE;
+            LAYER_READ:  if (col_idx == N_BASE - 1) state_next = CN_STAGE1;
+            CN_STAGE1:   state_next = CN_STAGE2;
+            CN_STAGE2:   state_next = LAYER_WRITE;
            LAYER_WRITE: begin
                if (col_idx == N_BASE - 1) begin
                    if (row_idx == M_BASE - 1)
                        state_next = SYNDROME_S1;
                    else
-                        state_next = LAYER_READ;  // next row
+                        state_next = LAYER_READ;
                end
            end
            SYNDROME_S1: state_next = SYNDROME_S2;
@@ -183,7 +189,7 @@ module ldpc_decoder_core #(
                else if (iter_cnt >= effective_max_iter)
                    state_next = DONE;
                else
-                    state_next = LAYER_READ;  // next iteration
+                    state_next = LAYER_READ;
            end
            DONE:        if (!start) state_next = IDLE;
            default:     state_next = IDLE;
@@ -269,43 +275,86 @@ module ldpc_decoder_core #(
                        col_idx <= col_idx + 1;
                end

-                CN_UPDATE: begin
-                    // Min-sum update for all Z check nodes in current row
-                    // Each CN has DC=8 incoming messages (one per column)
+                // =============================================================
+                // CN Pipeline Stage 1: Extract signs/mags, find min1/min2
+                // =============================================================
+                CN_STAGE1: begin
                    for (int z = 0; z < Z; z++) begin
-                        // Min-sum: pass individual VN->CN messages directly
-                        cn_min_sum(vn_to_cn[0][z], vn_to_cn[1][z],
-                                   vn_to_cn[2][z], vn_to_cn[3][z],
-                                   vn_to_cn[4][z], vn_to_cn[5][z],
-                                   vn_to_cn[6][z], vn_to_cn[7][z],
-                                   cn_to_vn[0][z], cn_to_vn[1][z],
-                                   cn_to_vn[2][z], cn_to_vn[3][z],
-                                   cn_to_vn[4][z], cn_to_vn[5][z],
-                                   cn_to_vn[6][z], cn_to_vn[7][z]);
+                        logic [DC-1:0]  signs_w;
+                        logic           sign_xor_w;
+                        logic [Q-2:0]   mags_w [DC];
+                        logic [Q-2:0]   min1_w, min2_w;
+                        int             min1_idx_w;
+
+                        sign_xor_w = 1'b0;
+                        for (int i = 0; i < DC; i++) begin
+                            logic [Q-1:0] abs_val;
+                            signs_w[i] = vn_to_cn[i][z][Q-1];
+                            if (vn_to_cn[i][z][Q-1]) begin
+                                abs_val = ~vn_to_cn[i][z] + 1'b1;
+                                mags_w[i] = (abs_val[Q-1]) ? {(Q-1){1'b1}} : abs_val[Q-2:0];
+                            end else begin
+                                mags_w[i] = vn_to_cn[i][z][Q-2:0];
                            end
-                    col_idx <= '0;  // prepare for LAYER_WRITE
+                            sign_xor_w = sign_xor_w ^ signs_w[i];
                        end

+                        min1_w = {(Q-1){1'b1}};
+                        min2_w = {(Q-1){1'b1}};
+                        min1_idx_w = 0;
+                        for (int i = 0; i < DC; i++) begin
+                            if (mags_w[i] < min1_w) begin
+                                min2_w     = min1_w;
+                                min1_w     = mags_w[i];
+                                min1_idx_w = i;
+                            end else if (mags_w[i] < min2_w) begin
+                                min2_w = mags_w[i];
+                            end
+                        end
+
+                        s1_signs[z]    = signs_w;
+                        s1_sign_xor[z] = sign_xor_w;
+                        s1_min1[z]     = min1_w;
+                        s1_min2[z]     = min2_w;
+                        s1_min1_idx[z] = min1_idx_w[2:0];
+                    end
+                end
+
+                // =============================================================
+                // CN Pipeline Stage 2: Compute extrinsic outputs + pre-register
+                // first LAYER_WRITE shift value
+                // =============================================================
+                CN_STAGE2: begin
+                    for (int z = 0; z < Z; z++) begin
+                        for (int j = 0; j < DC; j++) begin
+                            logic [Q-2:0] mag_out;
+                            logic         sign_out;
+
+                            mag_out  = (j[2:0] == s1_min1_idx[z]) ? s1_min2[z] : s1_min1[z];
+                            mag_out  = (mag_out > 5'd1) ? (mag_out - 5'd1) : 5'd0;
+                            sign_out = s1_sign_xor[z] ^ s1_signs[z][j];
+
+                            cn_to_vn[j][z] <= sign_out ? (~{1'b0, mag_out} + 1'b1) : {1'b0, mag_out};
+                        end
+                    end
+                    col_idx <= '0;
+                end
+
+                // =============================================================
+                // LAYER_WRITE: Write beliefs and update CN->VN messages
+                // =============================================================
                LAYER_WRITE: begin
-                    // Write back: update beliefs and store new CN->VN messages
-                    // Skip unconnected columns (H_BASE == -1)
                    if (H_BASE[row_idx][col_idx] >= 0) begin
                        for (int z = 0; z < Z; z++) begin
-                            int bit_idx;
                            int shifted_z;
-                            logic signed [Q-1:0] new_msg;
-                            logic signed [Q-1:0] old_extrinsic;
+                            int bit_idx;

                            shifted_z = (z + H_BASE[row_idx][col_idx]) % Z;
                            bit_idx   = int'(col_idx) * Z + shifted_z;
-                            new_msg   = cn_to_vn[col_idx][z];
-                            old_extrinsic = vn_to_cn[col_idx][z];

-                            // belief = extrinsic (VN->CN) + new CN->VN message
-                            beliefs[bit_idx] <= sat_add(old_extrinsic, new_msg);
-
-                            // Store new message for next iteration
-                            msg_cn2vn[row_idx][col_idx][z] <= new_msg;
+                            beliefs[bit_idx] <= sat_add(vn_to_cn[col_idx][z],
+                                                        cn_to_vn[col_idx][z]);
+                            msg_cn2vn[row_idx][col_idx][z] <= cn_to_vn[col_idx][z];
                        end
                    end

@@ -386,78 +435,7 @@ module ldpc_decoder_core #(
    end

    // =========================================================================
-    // Min-sum CN update function
-    // =========================================================================
-
-    // Offset min-sum for DC=8 inputs (individual ports for iverilog compatibility)
-    // For each output j: sign = XOR of all other signs, magnitude = min of all other magnitudes - offset
-    task automatic cn_min_sum(
-        input  logic signed [Q-1:0] in0, in1, in2, in3,
-                                     in4, in5, in6, in7,
-        output logic signed [Q-1:0] out0, out1, out2, out3,
-                                     out4, out5, out6, out7
-    );
-        logic signed [Q-1:0] ins [DC];
-        logic [DC-1:0] signs;
-        logic [Q-2:0]  mags [DC];
-        logic          sign_xor;
-        logic [Q-2:0]  min1, min2;
-        int            min1_idx;
-        logic signed [Q-1:0] outs [DC];
-
-        ins[0] = in0; ins[1] = in1; ins[2] = in2; ins[3] = in3;
-        ins[4] = in4; ins[5] = in5; ins[6] = in6; ins[7] = in7;
-
-        // Extract signs and magnitudes
-        // Note: -32 (100000) has magnitude 32 which overflows 5-bit field to 0.
-        // Clamp to 31 (max representable magnitude) to avoid corruption.
-        sign_xor = 1'b0;
-        for (int i = 0; i < DC; i++) begin
-            logic [Q-1:0] abs_val;
-            signs[i] = ins[i][Q-1];
-            if (ins[i][Q-1]) begin
-                abs_val = ~ins[i] + 1'b1;
-                // If abs_val overflowed (input was most negative), clamp
-                mags[i] = (abs_val[Q-1]) ? {(Q-1){1'b1}} : abs_val[Q-2:0];
-            end else begin
-                mags[i] = ins[i][Q-2:0];
-            end
-            sign_xor = sign_xor ^ signs[i];
-        end
-
-        // Find two smallest magnitudes
-        min1 = {(Q-1){1'b1}};
-        min2 = {(Q-1){1'b1}};
-        min1_idx = 0;
-        for (int i = 0; i < DC; i++) begin
-            if (mags[i] < min1) begin
-                min2     = min1;
-                min1     = mags[i];
-                min1_idx = i;
-            end else if (mags[i] < min2) begin
-                min2 = mags[i];
-            end
-        end
-
-        // Compute extrinsic outputs with offset correction
-        for (int j = 0; j < DC; j++) begin
-            logic [Q-2:0] mag_out;
-            logic          sign_out;
-
-            mag_out  = (j == min1_idx) ? min2 : min1;
-            // Offset correction (subtract 1 in integer representation)
-            mag_out  = (mag_out > 1) ? (mag_out - 1) : {(Q-1){1'b0}};
-            sign_out = sign_xor ^ signs[j];
-
-            outs[j] = sign_out ? (~{1'b0, mag_out} + 1) : {1'b0, mag_out};
-        end
-
-        out0 = outs[0]; out1 = outs[1]; out2 = outs[2]; out3 = outs[3];
-        out4 = outs[4]; out5 = outs[5]; out6 = outs[6]; out7 = outs[7];
-    endtask
-
-    // =========================================================================
-    // Saturating arithmetic helpers (Yosys-compatible: no return, no complex concat)
+    // Saturating arithmetic (Yosys-compatible)
    // =========================================================================

    function automatic logic signed [Q-1:0] sat_add(