Files
ldpc_optical/docs/hardening-results.md
cah 1f4b62454f docs: add Run 5-7 hardening results and lessons learned
- Run 5: syndrome pipeline with serial popcount (no improvement)
- Run 6: balanced popcount adder tree — TT timing MET at 50 MHz
- Run 7 series (8 attempts): LAYER_WRITE pipeline exploration
  - LAYER_WRITE split not viable (cell explosion / PnR divergence)
  - Yosys synthesis non-determinism documented
  - Hold margin sensitivity (0.4/0.2 vs 0.5/0.3) identified
  - Run 7h reproduces Run 6 by reusing golden synthesis netlist

Key finding: balanced_popcount synthesis netlist is the golden reference
for all future PnR iterations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 19:42:23 -06:00

20 KiB

LDPC Decoder Hardening Results

Run 1: 26_02_25_21_11 (Feb 25, 2026) — FAILED

  • RTL: Original (unpipelined CN update)
  • Config: CLOCK_PERIOD=20 (50 MHz), RUN_HEURISTIC_DIODE_INSERTION=true, HEURISTIC_ANTENNA_THRESHOLD=110
  • Die area: 2800 x 1760 µm (4.93 mm²)
  • Failure: GRT-0118 routing congestion after heuristic diode insertion (66,016 diodes added)
  • Notes: Initial global routing passed (0 overflow, 39% routing utilization). Diode insertion nearly doubled cell count, causing re-routing congestion failure.

Run 2: reuse_synth (Feb 27, 2026) — COMPLETED (timing violations)

  • RTL: Original (unpipelined CN update) — reused synthesis netlist from Run 1
  • Config: CLOCK_PERIOD=20 (50 MHz), RUN_HEURISTIC_DIODE_INSERTION=false, RUN_ANTENNA_REPAIR=true
  • Die area: 2800 x 1760 µm (4.93 mm²)
  • Result: All 70 steps completed. GDS generated. Deferred timing errors.

Physical Results

Metric Result
Magic DRC Clean
KLayout DRC Clean
LVS Clean (0 errors, 0 unmatched)
XOR (Magic vs KLayout) Clean
Illegal overlap Clean
Power grid violations 0
Antenna violating nets 658
Antenna violating pins 905

Area & Utilization

Metric Value
Die area 4,928,000 µm² (4.93 mm²)
Core area 4,846,670 µm²
Instance count 184,663
Instance area 1,303,260 µm² (1.30 mm²)
Core utilization 26.9%
Sequential cells 16,967
Combinational cells 61,366
Timing repair buffers 23,709
Fill cells 415,149
Tap cells 69,228

Timing (post-route, CLOCK_PERIOD = 20 ns / 50 MHz target)

Corner Setup WNS (ns) Setup TNS (ns) Hold WNS (ns) Hold TNS (ns) Setup Violations
nom_tt_025C_1v80 -27.13 -234.9 -0.32 -3.76 9
nom_ss_100C_1v60 -70.58 -29,946.3 0.06 0 5,463
nom_ff_n40C_1v95 -10.18 -86.3 -0.26 -12.4
Worst across all -71.40 -34,329.1 -0.47 -26.4

Estimated Max Frequency

  • TT corner: Critical path ~47 ns → ~21 MHz
  • SS corner: Critical path ~91 ns → ~11 MHz
  • FF corner: Critical path ~30 ns → ~33 MHz

Power (TT corner)

Component Power (W)
Internal 0.0554
Switching 0.0273
Leakage ~0.002 mW
Total 0.0827

Key Observations

  1. Disabling heuristic diode insertion fixed the routing congestion failure from Run 1
  2. 658 antenna violations remain — iterative antenna repair was not sufficient. May need to re-enable heuristic insertion with a higher threshold or use DIODE_ON_PORTS
  3. Setup timing is severely violated — critical path is ~47 ns at TT, far from 20 ns target
  4. This run used the unpipelined RTL (synthesis reused from Run 1 which predated the CN pipeline split)
  5. Next run should re-synthesize with pipelined CN update RTL to see if timing improves

Run 3: pipelined_pnr (Mar 1, 2026) — FAILED

  • RTL: Pipelined CN update (CN_STAGE1 + CN_STAGE2)
  • Config: CLOCK_PERIOD=20 (50 MHz), SYNTH_STRATEGY=AREA 0, RUN_HEURISTIC_DIODE_INSERTION=false, RUN_ANTENNA_REPAIR=true
  • Die area: 2800 x 1760 µm (4.93 mm²)
  • Failure: GRT-0118 routing congestion during iterative antenna repair (step 36), after 13+ hours of repair loops
  • Notes: Iterative antenna repair kept inserting diodes and re-routing until congestion became too high. Same root cause as Run 1 but via different mechanism.

Run 3b: pipelined_synth (Feb 28, 2026) — STILL RUNNING

  • RTL: Pipelined CN update
  • Config: SYNTH_STRATEGY=AREA 2 — synthesis only
  • Status: ABC pass 2 (tech mapping) running 20+ hours. AREA 2 is far too aggressive for this design size. Do not use AREA 2 for this design.

Run 4: pipelined_noantenna (Mar 2, 2026) — COMPLETED (timing violations)

  • RTL: Pipelined CN update (CN_STAGE1 + CN_STAGE2)
  • Config: CLOCK_PERIOD=20 (50 MHz), SYNTH_STRATEGY=AREA 0, RUN_HEURISTIC_DIODE_INSERTION=false, RUN_ANTENNA_REPAIR=false
  • Die area: 2800 x 1760 µm (4.93 mm²)
  • Result: All 69 steps completed. GDS generated. Deferred timing errors. No antenna repair attempted.

Physical Results

Metric Result
Magic DRC Clean
KLayout DRC Clean
LVS Clean (0 errors, 0 unmatched)
XOR (Magic vs KLayout) Clean
Illegal overlap Clean
Antenna violating nets 1,707 (no repair attempted)
Antenna violating pins 3,319 (no repair attempted)

Area & Utilization

Metric Value
Die area 4,928,000 µm² (4.93 mm²)
Instance count 183,774
Instance area 1,351,790 µm² (1.35 mm²)
Core utilization 27.9%

Timing (post-route, CLOCK_PERIOD = 20 ns / 50 MHz target)

Corner Setup WNS (ns) Setup TNS (ns) Hold WNS (ns) Hold TNS (ns)
nom_tt_025C_1v80 -28.86 -348.0 -0.08 -0.15
nom_ss_100C_1v60 -74.22 -20,536.0 -0.07 -0.07
nom_ff_n40C_1v95 -11.04 -93.8 -0.12 -2.15
min_tt_025C_1v80 -28.39 -251.0 0 0
max_tt_025C_1v80 -29.36 -725.1 -0.24 -2.15

Estimated Max Frequency

  • TT corner: Critical path ~49 ns → ~20 MHz
  • SS corner: Critical path ~94 ns → ~11 MHz
  • FF corner: Critical path ~31 ns → ~32 MHz

Power (TT corner)

Metric Value
Total 0.0858 W

Key Observations

  1. Pipelined CN update did NOT improve timing — TT WNS is -28.86 ns vs -27.13 ns (unpipelined Run 2). Slightly worse, possibly due to AREA 0 vs AREA 2 synth strategy difference.
  2. Hold violations are much smaller than Run 2 (-0.08 vs -0.32 ns), nearly clean.
  3. Antenna violations increased to 1,707 nets (vs 658 in Run 2) without any repair — AREA 0 produces a less antenna-friendly netlist.
  4. The critical path is still ~47-49 ns, suggesting the bottleneck is NOT the CN update pipeline stage but something else (likely the large mux/barrel shifter or belief update logic).
  5. SYNTH_STRATEGY=AREA 2 takes 20+ hours for ABC tech mapping on this design — never use it. AREA 0 completed in reasonable time.

Summary Table

Run RTL Synth Antenna Status TT Setup WNS Max Freq (TT)
1 Unpipelined AREA 2 Heuristic 110µm FAILED (congestion)
2 Unpipelined AREA 2 Iterative COMPLETED -27.13 ns ~21 MHz
3 Pipelined AREA 0 Iterative FAILED (congestion)
3b Pipelined AREA 2 — (synth only) Still running (20+ hrs)
4 Pipelined AREA 0 None COMPLETED -28.86 ns ~20 MHz

Critical Path Analysis (from Run 4, pipelined_noantenna)

Path Summary

Item Value
Startpoint u_core.beliefs[0][5] (beliefs register, bit 5 of element 0)
Endpoint syndrome_weight[7] (MSB of syndrome weight counter)
RTL location SYNDROME state in ldpc_decoder_core.sv, lines 363-385
Slack -28.859 ns (VIOLATED)
Total combinational delay 47.67 ns
Logic levels 222 (171 XOR/XNOR + 51 adder/mux)
Logic vs wire delay 99.7% logic / 0.3% wire

All 8 worst setup violators fan out from beliefs[0][5] to syndrome_weight[7:0].

What the Critical Path Computes

The SYNDROME state computes the full syndrome check in a single clock cycle:

  1. Parity computation (171 XOR levels, 33.9 ns): XOR the sign bits of all beliefs connected to each check node — 7 rows x 32 z-elements x up to 3 columns = 224 parity bits, reading from 256 belief sign bits.
  2. Population count (51 adder levels, 13.6 ns): Sum all 224 parity results into an 8-bit syndrome_cnt.

The syndrome_cnt = syndrome_cnt + 1 accumulation pattern creates a carry chain dependency that serializes everything.

Delay Breakdown

Segment Delay (ns) Cells Description
Source CLK-to-Q 0.795 1 (dfxtp_4) beliefs[0][5] register output
Parity XOR chain 33.888 171 (xor2/xnor2) XOR reduction across belief sign bits
Popcount adder tree 13.634 51 (and/or/aoi/oai) 224-bit popcount to 8-bit count
State MUX 0.148 1 (mux2_1) FSM output mux
Wire (interconnect) 0.149 0.3% of total — negligible
Total 48.614 222 levels

Proposed Fix: 2-3 Stage Syndrome Pipeline

SYNDROME_S1 (cycle 1, ~16 ns): Compute all 224 parity bits in parallel. Each parity is only 2-3 XOR operations deep (one per connected column). Register the 224-bit parity_vec.

SYNDROME_S2 (cycle 2, ~14 ns): Popcount the 224-bit parity vector via balanced adder tree. Register the 8-bit syndrome_weight and syndrome_ok flag.

SYNDROME_DONE (cycle 3): Already exists — reads syndrome_ok.

Estimated post-fix critical path: ~14-16 ns (comfortably under 20 ns / 50 MHz). Latency impact: +1-2 cycles per iteration (negligible at 30 iterations).

Secondary Violations

Wishbone address input (wb_adr_i) has -2.47 ns setup violation. Fixable by registering the address at the decoder boundary.

Run 5: syndrome_pipeline (Mar 3, 2026) — COMPLETED (timing violations)

  • RTL: Pipelined CN + syndrome pipeline (SYNDROME_S1 + SYNDROME_S2 with serial popcount)
  • Config: CLOCK_PERIOD=20 (50 MHz), SYNTH_STRATEGY=AREA 0, RUN_ANTENNA_REPAIR=false
  • Die area: 2800 x 1760 µm (4.93 mm²)
  • Result: All 75 steps completed. DRC/LVS clean.
  • TT Setup WNS: -28.98 ns — no improvement from Run 4
  • Root cause: Yosys serializes syndrome_cnt = syndrome_cnt + 1 loop-carried dependency into ~48 ns chain
  • Lesson: Splitting parity + popcount into 2 cycles helps nothing if the popcount itself is still serial

Run 6: balanced_popcount (Mar 4, 2026) — COMPLETED (TT timing MET!)

  • RTL: Pipelined CN + syndrome pipeline with balanced 4-wide adder tree popcount
  • Config: CLOCK_PERIOD=20 (50 MHz), SYNTH_STRATEGY=AREA 0, RUN_ANTENNA_REPAIR=false
  • Die area: 2800 x 1760 µm (4.93 mm²)
  • Result: All 75 steps completed. DRC/LVS clean. TT timing met!

Physical Results

Metric Result
Magic DRC Clean
KLayout DRC Clean
LVS Clean (0 errors, 0 unmatched)
Antenna violating nets 1,687 (no repair attempted)

Area & Utilization

Metric Value
Die area 4,928,000 µm² (4.93 mm²)
Instance count 186,915
Instance area 1,367,580 µm² (1.37 mm²)
Core utilization 28.2%
Sequential cells 18,056
Timing repair buffers 27,864

Timing (post-route, CLOCK_PERIOD = 20 ns / 50 MHz target)

Corner Setup WNS (ns) Setup TNS (ns) Hold WNS (ns) Hold TNS (ns)
nom_tt_025C_1v80 0.0 0 -0.45 -10.5
nom_ss_100C_1v60 -9.18 -12,474.4 -0.17 -0.21
nom_ff_n40C_1v95 0.0 0 -0.37 -38.6
max_ss_100C_1v60 -10.45 -15,896.8 -0.44 -0.87

Estimated Max Frequency

  • TT corner: 50 MHz — TIMING MET
  • SS corner: Critical path ~40 ns → ~25 MHz (up from ~11 MHz)
  • FF corner: 50 MHz — TIMING MET

New Critical Path (SS corner)

Item Value
Startpoint u_core.col_idx[0] (column index register)
Endpoint u_core.beliefs registers
Slack -9.18 ns (nom_ss)
Data arrival time 40.15 ns
Description Belief update mux path during LAYER_READ/LAYER_WRITE

The syndrome path is NO LONGER critical. The new bottleneck is the column-indexed mux/barrel-shifter path used during belief reads and writes.

Key Observations

  1. Balanced popcount tree eliminated the syndrome bottleneck — WNS improved from -28.98 ns to 0.0 ns at TT
  2. TT and FF corners now fully meet 50 MHz timing
  3. SS corner still fails (-9.18 ns) due to a different path: belief update mux indexed by col_idx
  4. Hold violations are minor (-0.45 ns) and can be fixed with post-route optimization
  5. 1,687 antenna violations need to be addressed (antenna repair was disabled)

Updated Summary Table

Run RTL Key Change Antenna Status TT Setup WNS Max Freq (TT)
1 Unpipelined Heuristic FAILED
2 Unpipelined Iterative COMPLETED -27.13 ns ~21 MHz
3 Pipelined CN CN pipeline Iterative FAILED
4 Pipelined CN CN pipeline None COMPLETED -28.86 ns ~20 MHz
5 + Syndrome pipeline Serial popcount None COMPLETED -28.98 ns ~20 MHz
6 + Balanced popcount Adder tree None COMPLETED 0.0 ns 50 MHz

Run 7a: pipelined_layer2 (Mar 9, 2026) — FAILED

  • RTL: Run 6 + LAYER_WRITE split into LAYER_WRITE_ADDR + LAYER_WRITE_DATA
  • Config: CLOCK_PERIOD=20, DIODE_ON_PORTS=in, HEURISTIC_ANTENNA_THRESHOLD=200
  • Failure: GRT-0118 routing congestion — heuristic diode insertion on input ports added too many cells
  • Lesson: Any heuristic diode insertion causes GRT failure on this design

Run 7b: pipelined_layer3 (Mar 9, 2026) — FAILED

  • RTL: Same as 7a (LAYER_WRITE_ADDR/DATA split)
  • Config: DIODE_ON_PORTS=none, RUN_HEURISTIC_DIODE_INSERTION=false
  • Failure: Post-CTS resizer diverged — 2.5+ hours at 100% CPU, memory climbing linearly, never converging
  • Lesson: LAYER_WRITE pipeline split creates too many paths for OpenROAD resizer

Run 7c: pre_shift (Mar 9, 2026) — FAILED

  • RTL: Run 6 + pre-registered H_BASE shift lookahead (H_BASE[row_idx][col_idx+1])
  • Config: Same as 7b
  • Failure: GPL-0302 placement density overflow — 150K cells at 41.3% exceeded 40% target
  • Root cause: Yosys cannot fold H_BASE constants through registers → full 256:1 write mux explosion (~2x cell count vs Run 6's 83K)
  • Lesson: Registering H_BASE shift values prevents Yosys constant folding

Run 7d: run6_baseline (Mar 9, 2026) — FAILED

  • RTL: Reverted to Run 6 baseline (identical RTL)
  • Config: DIODE_ON_PORTS=in (inadvertently left from earlier runs), RUN_HEURISTIC_DIODE_INSERTION=false
  • Cells: 85,500
  • Failure: GRT-0118 routing congestion
  • Root cause: DIODE_ON_PORTS=in inserts diodes on input ports even when heuristic insertion is disabled

Run 7e: run6b_nodiode (Mar 10, 2026) — FAILED

  • RTL: Run 6 baseline
  • Config: DIODE_ON_PORTS=none, hold margins 0.5/0.3 (from config.json), reused run6_baseline synthesis
  • Failure: Post-CTS resizer diverged (9+ GiB memory, 3+ hours, never converged)
  • Root cause: Reusing synthesis from a run with different config (DIODE_ON_PORTS=in) produces a subtly different netlist that causes PnR divergence

Run 7f: run6_clean (Mar 10, 2026) — FAILED

  • RTL: Run 6 baseline, clean full run from scratch
  • Config: DIODE_ON_PORTS=none, hold margins 0.5/0.3
  • Cells: 85,500
  • Hold buffers inserted: 35,506
  • Failure: GRT-0118 routing congestion
  • Root cause: Higher hold slack margins (0.5/0.3 vs balanced_popcount's 0.4/0.2) caused 13K extra hold buffers (35K vs 22K), pushing routing congestion over GRT threshold

Run 7g: run6_fixhold (Mar 10, 2026) — FAILED

  • RTL: Run 6 baseline, reused run6_clean synthesis
  • Config: DIODE_ON_PORTS=none, hold margins 0.4/0.2 (matching balanced_popcount)
  • Failure: Post-CTS resizer diverged (14+ GiB, 3.5+ hours)
  • Root cause: Yosys non-determinism — run6_clean synthesis produced a slightly different cell mix that didn't route cleanly despite identical config

Run 7h: run6_reuse_bp (Mar 10, 2026) — COMPLETED (reproduces Run 6!)

  • RTL: Run 6 baseline, reused balanced_popcount's actual synthesis netlist
  • Config: DIODE_ON_PORTS=none, hold margins 0.4/0.2
  • Result: All stages completed. DRC/LVS clean. TT timing met!
  • Hold buffers: 22,095 (identical to balanced_popcount)

Physical Results

Metric Result
Magic DRC Clean
KLayout DRC Clean
LVS Clean (circuits match uniquely)
Antenna violating nets 1,687 (repair disabled)
Antenna violating pins 3,416 (repair disabled)

Area & Utilization

Metric Value
Die area 4,928,000 µm² (4.93 mm²)
Instance count 186,915
Instance area 1,367,580 µm² (1.37 mm²)
Core utilization 28.2%

Timing (post-route, CLOCK_PERIOD = 20 ns / 50 MHz target)

Corner Setup WNS (ns) Setup TNS (ns) Hold WNS (ns) Hold TNS (ns)
nom_tt_025C_1v80 +3.28 0 -0.45 -10.5
nom_ss_100C_1v60 -9.18 -12,474 -0.17 -0.21
nom_ff_n40C_1v95 +5.93 0 -0.37 -38.6
max_ss_100C_1v60 -10.45 -15,897 -0.44 -0.87
min_tt_025C_1v80 +3.71 0 -0.26 -1.66
max_tt_025C_1v80 +2.90 0 -0.62 -29.5

Key Observations

  1. Results identical to Run 6 — confirms that the balanced_popcount synthesis netlist is the key ingredient
  2. Yosys non-determinism is significant: re-synthesizing the same RTL with same config produces netlists that fail PnR
  3. Hold violations (1,543 total) are all on input port paths (wb_dat_i, wb_adr_i), zero reg-to-reg — fixable with input delay constraints
  4. Max slew violations (4,112) and max cap violations (655) concentrated in SS corner

Updated Summary Table

Run RTL Key Change Antenna Status TT Setup WNS Max Freq (TT)
1 Unpipelined Heuristic FAILED
2 Unpipelined Iterative COMPLETED -27.13 ns ~21 MHz
3 Pipelined CN CN pipeline Iterative FAILED
4 Pipelined CN CN pipeline None COMPLETED -28.86 ns ~20 MHz
5 + Syndrome pipeline Serial popcount None COMPLETED -28.98 ns ~20 MHz
6 + Balanced popcount Adder tree None COMPLETED 0.0 ns 50 MHz
7a + LAYER_WRITE split ADDR/DATA pipeline Heuristic FAILED
7b + LAYER_WRITE split ADDR/DATA pipeline None FAILED (resizer)
7c + pre_shift H_BASE lookahead None FAILED (GPL)
7d Run 6 baseline DIODE_ON_PORTS=in None FAILED (GRT)
7e Run 6 baseline Reuse wrong synth None FAILED (resizer)
7f Run 6 baseline Hold margins 0.5/0.3 None FAILED (GRT)
7g Run 6 baseline Reuse run6_clean synth None FAILED (resizer)
7h Run 6 baseline Reuse BP synth None COMPLETED +3.28 ns 50 MHz

Key Lessons Learned (Run 7 Series)

  1. LAYER_WRITE pipeline is not viable: Any register between col_idx and H_BASE causes either cell explosion (Yosys can't fold constants through registers) or PnR divergence (too many paths for resizer)
  2. Heuristic diode insertion always fails: Both RUN_HEURISTIC_DIODE_INSERTION=true and DIODE_ON_PORTS=in cause GRT-0118 congestion
  3. Hold slack margins matter: 0.5/0.3 inserts 35K hold buffers → GRT failure. 0.4/0.2 inserts 22K → passes
  4. Yosys synthesis is non-deterministic: Re-synthesizing identical RTL+config produces different netlists with different PnR outcomes. The balanced_popcount synthesis netlist is the only one proven to complete
  5. Config must be consistent: Reusing synthesis from a run with different config settings causes PnR divergence
  6. Run 6's balanced_popcount synthesis netlist is the golden reference — all future PnR runs should reuse it

Next Steps

  • Address antenna violations (1,687 nets) for tapeout — try GRT_ANTENNA_ITERS with reused BP synthesis
  • Fix hold violations via input delay constraints (all are input port paths)
  • Consider relaxing SS target or adding pipeline stage to belief update mux for SS corner improvement
  • Investigate making Yosys synthesis deterministic (fixed random seed, etc.) for reproducible builds