From 1f4b62454fbebd0860ced739cad98dec466e7c53 Mon Sep 17 00:00:00 2001 From: cah Date: Tue, 10 Mar 2026 19:42:23 -0600 Subject: [PATCH] docs: add Run 5-7 hardening results and lessons learned MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Run 5: syndrome pipeline with serial popcount (no improvement) - Run 6: balanced popcount adder tree — TT timing MET at 50 MHz - Run 7 series (8 attempts): LAYER_WRITE pipeline exploration - LAYER_WRITE split not viable (cell explosion / PnR divergence) - Yosys synthesis non-determinism documented - Hold margin sensitivity (0.4/0.2 vs 0.5/0.3) identified - Run 7h reproduces Run 6 by reusing golden synthesis netlist Key finding: balanced_popcount synthesis netlist is the golden reference for all future PnR iterations. Co-Authored-By: Claude Opus 4.6 --- docs/hardening-results.md | 197 +++++++++++++++++++++++++++++++++++++- 1 file changed, 192 insertions(+), 5 deletions(-) diff --git a/docs/hardening-results.md b/docs/hardening-results.md index cb950c5..f104e69 100644 --- a/docs/hardening-results.md +++ b/docs/hardening-results.md @@ -189,9 +189,196 @@ The `syndrome_cnt = syndrome_cnt + 1` accumulation pattern creates a carry chain Wishbone address input (`wb_adr_i`) has -2.47 ns setup violation. Fixable by registering the address at the decoder boundary. +## Run 5: `syndrome_pipeline` (Mar 3, 2026) — COMPLETED (timing violations) +- **RTL**: Pipelined CN + syndrome pipeline (SYNDROME_S1 + SYNDROME_S2 with serial popcount) +- **Config**: `CLOCK_PERIOD=20` (50 MHz), `SYNTH_STRATEGY=AREA 0`, `RUN_ANTENNA_REPAIR=false` +- **Die area**: 2800 x 1760 µm (4.93 mm²) +- **Result**: All 75 steps completed. DRC/LVS clean. +- **TT Setup WNS**: **-28.98 ns** — no improvement from Run 4 +- **Root cause**: Yosys serializes `syndrome_cnt = syndrome_cnt + 1` loop-carried dependency into ~48 ns chain +- **Lesson**: Splitting parity + popcount into 2 cycles helps nothing if the popcount itself is still serial + +## Run 6: `balanced_popcount` (Mar 4, 2026) — COMPLETED (TT timing MET!) +- **RTL**: Pipelined CN + syndrome pipeline with balanced 4-wide adder tree popcount +- **Config**: `CLOCK_PERIOD=20` (50 MHz), `SYNTH_STRATEGY=AREA 0`, `RUN_ANTENNA_REPAIR=false` +- **Die area**: 2800 x 1760 µm (4.93 mm²) +- **Result**: All 75 steps completed. DRC/LVS clean. **TT timing met!** + +### Physical Results +| Metric | Result | +|--------|--------| +| Magic DRC | **Clean** | +| KLayout DRC | **Clean** | +| LVS | **Clean** (0 errors, 0 unmatched) | +| Antenna violating nets | 1,687 (no repair attempted) | + +### Area & Utilization +| Metric | Value | +|--------|-------| +| Die area | 4,928,000 µm² (4.93 mm²) | +| Instance count | 186,915 | +| Instance area | 1,367,580 µm² (1.37 mm²) | +| Core utilization | 28.2% | +| Sequential cells | 18,056 | +| Timing repair buffers | 27,864 | + +### Timing (post-route, CLOCK_PERIOD = 20 ns / 50 MHz target) +| Corner | Setup WNS (ns) | Setup TNS (ns) | Hold WNS (ns) | Hold TNS (ns) | +|--------|----------------|-----------------|----------------|----------------| +| nom_tt_025C_1v80 | **0.0** | 0 | -0.45 | -10.5 | +| nom_ss_100C_1v60 | **-9.18** | -12,474.4 | -0.17 | -0.21 | +| nom_ff_n40C_1v95 | **0.0** | 0 | -0.37 | -38.6 | +| max_ss_100C_1v60 | -10.45 | -15,896.8 | -0.44 | -0.87 | + +### Estimated Max Frequency +- **TT corner**: **50 MHz — TIMING MET** +- **SS corner**: Critical path ~40 ns → **~25 MHz** (up from ~11 MHz) +- **FF corner**: **50 MHz — TIMING MET** + +### New Critical Path (SS corner) +| Item | Value | +|------|-------| +| Startpoint | `u_core.col_idx[0]` (column index register) | +| Endpoint | `u_core.beliefs` registers | +| Slack | -9.18 ns (nom_ss) | +| Data arrival time | 40.15 ns | +| Description | Belief update mux path during LAYER_READ/LAYER_WRITE | + +The syndrome path is NO LONGER critical. The new bottleneck is the column-indexed mux/barrel-shifter path used during belief reads and writes. + +### Key Observations +1. **Balanced popcount tree eliminated the syndrome bottleneck** — WNS improved from -28.98 ns to 0.0 ns at TT +2. TT and FF corners now fully meet 50 MHz timing +3. SS corner still fails (-9.18 ns) due to a different path: belief update mux indexed by col_idx +4. Hold violations are minor (-0.45 ns) and can be fixed with post-route optimization +5. 1,687 antenna violations need to be addressed (antenna repair was disabled) + +## Updated Summary Table + +| Run | RTL | Key Change | Antenna | Status | TT Setup WNS | Max Freq (TT) | +|-----|-----|------------|---------|--------|-------------|---------------| +| 1 | Unpipelined | — | Heuristic | **FAILED** | — | — | +| 2 | Unpipelined | — | Iterative | **COMPLETED** | -27.13 ns | ~21 MHz | +| 3 | Pipelined CN | CN pipeline | Iterative | **FAILED** | — | — | +| 4 | Pipelined CN | CN pipeline | None | **COMPLETED** | -28.86 ns | ~20 MHz | +| 5 | + Syndrome pipeline | Serial popcount | None | **COMPLETED** | -28.98 ns | ~20 MHz | +| 6 | + Balanced popcount | Adder tree | None | **COMPLETED** | **0.0 ns** | **50 MHz** | + +## Run 7a: `pipelined_layer2` (Mar 9, 2026) — FAILED +- **RTL**: Run 6 + LAYER_WRITE split into LAYER_WRITE_ADDR + LAYER_WRITE_DATA +- **Config**: `CLOCK_PERIOD=20`, `DIODE_ON_PORTS=in`, `HEURISTIC_ANTENNA_THRESHOLD=200` +- **Failure**: `GRT-0118` routing congestion — heuristic diode insertion on input ports added too many cells +- **Lesson**: Any heuristic diode insertion causes GRT failure on this design + +## Run 7b: `pipelined_layer3` (Mar 9, 2026) — FAILED +- **RTL**: Same as 7a (LAYER_WRITE_ADDR/DATA split) +- **Config**: `DIODE_ON_PORTS=none`, `RUN_HEURISTIC_DIODE_INSERTION=false` +- **Failure**: Post-CTS resizer diverged — 2.5+ hours at 100% CPU, memory climbing linearly, never converging +- **Lesson**: LAYER_WRITE pipeline split creates too many paths for OpenROAD resizer + +## Run 7c: `pre_shift` (Mar 9, 2026) — FAILED +- **RTL**: Run 6 + pre-registered H_BASE shift lookahead (`H_BASE[row_idx][col_idx+1]`) +- **Config**: Same as 7b +- **Failure**: `GPL-0302` placement density overflow — 150K cells at 41.3% exceeded 40% target +- **Root cause**: Yosys cannot fold H_BASE constants through registers → full 256:1 write mux explosion (~2x cell count vs Run 6's 83K) +- **Lesson**: Registering H_BASE shift values prevents Yosys constant folding + +## Run 7d: `run6_baseline` (Mar 9, 2026) — FAILED +- **RTL**: Reverted to Run 6 baseline (identical RTL) +- **Config**: `DIODE_ON_PORTS=in` (inadvertently left from earlier runs), `RUN_HEURISTIC_DIODE_INSERTION=false` +- **Cells**: 85,500 +- **Failure**: `GRT-0118` routing congestion +- **Root cause**: `DIODE_ON_PORTS=in` inserts diodes on input ports even when heuristic insertion is disabled + +## Run 7e: `run6b_nodiode` (Mar 10, 2026) — FAILED +- **RTL**: Run 6 baseline +- **Config**: `DIODE_ON_PORTS=none`, hold margins 0.5/0.3 (from config.json), reused `run6_baseline` synthesis +- **Failure**: Post-CTS resizer diverged (9+ GiB memory, 3+ hours, never converged) +- **Root cause**: Reusing synthesis from a run with different config (`DIODE_ON_PORTS=in`) produces a subtly different netlist that causes PnR divergence + +## Run 7f: `run6_clean` (Mar 10, 2026) — FAILED +- **RTL**: Run 6 baseline, clean full run from scratch +- **Config**: `DIODE_ON_PORTS=none`, hold margins 0.5/0.3 +- **Cells**: 85,500 +- **Hold buffers inserted**: 35,506 +- **Failure**: `GRT-0118` routing congestion +- **Root cause**: Higher hold slack margins (0.5/0.3 vs balanced_popcount's 0.4/0.2) caused 13K extra hold buffers (35K vs 22K), pushing routing congestion over GRT threshold + +## Run 7g: `run6_fixhold` (Mar 10, 2026) — FAILED +- **RTL**: Run 6 baseline, reused `run6_clean` synthesis +- **Config**: `DIODE_ON_PORTS=none`, hold margins 0.4/0.2 (matching balanced_popcount) +- **Failure**: Post-CTS resizer diverged (14+ GiB, 3.5+ hours) +- **Root cause**: Yosys non-determinism — `run6_clean` synthesis produced a slightly different cell mix that didn't route cleanly despite identical config + +## Run 7h: `run6_reuse_bp` (Mar 10, 2026) — COMPLETED (reproduces Run 6!) +- **RTL**: Run 6 baseline, **reused balanced_popcount's actual synthesis netlist** +- **Config**: `DIODE_ON_PORTS=none`, hold margins 0.4/0.2 +- **Result**: All stages completed. DRC/LVS clean. TT timing met! +- **Hold buffers**: 22,095 (identical to balanced_popcount) + +### Physical Results +| Metric | Result | +|--------|--------| +| Magic DRC | **Clean** | +| KLayout DRC | **Clean** | +| LVS | **Clean** (circuits match uniquely) | +| Antenna violating nets | 1,687 (repair disabled) | +| Antenna violating pins | 3,416 (repair disabled) | + +### Area & Utilization +| Metric | Value | +|--------|-------| +| Die area | 4,928,000 µm² (4.93 mm²) | +| Instance count | 186,915 | +| Instance area | 1,367,580 µm² (1.37 mm²) | +| Core utilization | 28.2% | + +### Timing (post-route, CLOCK_PERIOD = 20 ns / 50 MHz target) +| Corner | Setup WNS (ns) | Setup TNS (ns) | Hold WNS (ns) | Hold TNS (ns) | +|--------|----------------|-----------------|----------------|----------------| +| nom_tt_025C_1v80 | **+3.28** | 0 | -0.45 | -10.5 | +| nom_ss_100C_1v60 | **-9.18** | -12,474 | -0.17 | -0.21 | +| nom_ff_n40C_1v95 | **+5.93** | 0 | -0.37 | -38.6 | +| max_ss_100C_1v60 | -10.45 | -15,897 | -0.44 | -0.87 | +| min_tt_025C_1v80 | +3.71 | 0 | -0.26 | -1.66 | +| max_tt_025C_1v80 | +2.90 | 0 | -0.62 | -29.5 | + +### Key Observations +1. **Results identical to Run 6** — confirms that the balanced_popcount synthesis netlist is the key ingredient +2. Yosys non-determinism is significant: re-synthesizing the same RTL with same config produces netlists that fail PnR +3. Hold violations (1,543 total) are all on input port paths (`wb_dat_i`, `wb_adr_i`), zero reg-to-reg — fixable with input delay constraints +4. Max slew violations (4,112) and max cap violations (655) concentrated in SS corner + +## Updated Summary Table + +| Run | RTL | Key Change | Antenna | Status | TT Setup WNS | Max Freq (TT) | +|-----|-----|------------|---------|--------|-------------|---------------| +| 1 | Unpipelined | — | Heuristic | **FAILED** | — | — | +| 2 | Unpipelined | — | Iterative | **COMPLETED** | -27.13 ns | ~21 MHz | +| 3 | Pipelined CN | CN pipeline | Iterative | **FAILED** | — | — | +| 4 | Pipelined CN | CN pipeline | None | **COMPLETED** | -28.86 ns | ~20 MHz | +| 5 | + Syndrome pipeline | Serial popcount | None | **COMPLETED** | -28.98 ns | ~20 MHz | +| 6 | + Balanced popcount | Adder tree | None | **COMPLETED** | **0.0 ns** | **50 MHz** | +| 7a | + LAYER_WRITE split | ADDR/DATA pipeline | Heuristic | **FAILED** | — | — | +| 7b | + LAYER_WRITE split | ADDR/DATA pipeline | None | **FAILED** (resizer) | — | — | +| 7c | + pre_shift | H_BASE lookahead | None | **FAILED** (GPL) | — | — | +| 7d | Run 6 baseline | DIODE_ON_PORTS=in | None | **FAILED** (GRT) | — | — | +| 7e | Run 6 baseline | Reuse wrong synth | None | **FAILED** (resizer) | — | — | +| 7f | Run 6 baseline | Hold margins 0.5/0.3 | None | **FAILED** (GRT) | — | — | +| 7g | Run 6 baseline | Reuse run6_clean synth | None | **FAILED** (resizer) | — | — | +| 7h | Run 6 baseline | **Reuse BP synth** | None | **COMPLETED** | **+3.28 ns** | **50 MHz** | + +## Key Lessons Learned (Run 7 Series) + +1. **LAYER_WRITE pipeline is not viable**: Any register between col_idx and H_BASE causes either cell explosion (Yosys can't fold constants through registers) or PnR divergence (too many paths for resizer) +2. **Heuristic diode insertion always fails**: Both `RUN_HEURISTIC_DIODE_INSERTION=true` and `DIODE_ON_PORTS=in` cause GRT-0118 congestion +3. **Hold slack margins matter**: 0.5/0.3 inserts 35K hold buffers → GRT failure. 0.4/0.2 inserts 22K → passes +4. **Yosys synthesis is non-deterministic**: Re-synthesizing identical RTL+config produces different netlists with different PnR outcomes. The balanced_popcount synthesis netlist is the only one proven to complete +5. **Config must be consistent**: Reusing synthesis from a run with different config settings causes PnR divergence +6. **Run 6's balanced_popcount synthesis netlist is the golden reference** — all future PnR runs should reuse it + ## Next Steps -- Implement syndrome pipeline (SYNDROME_S1 + SYNDROME_S2) to cut critical path from ~49 ns to ~16 ns -- Register Wishbone address input to fix secondary violation -- Re-synthesize with AREA 0 and run PnR to verify timing improvement -- Consider increasing die area for antenna repair headroom -- Consider `SYNTH_STRATEGY=AREA 1` as middle ground between AREA 0 and AREA 2 +- Address antenna violations (1,687 nets) for tapeout — try `GRT_ANTENNA_ITERS` with reused BP synthesis +- Fix hold violations via input delay constraints (all are input port paths) +- Consider relaxing SS target or adding pipeline stage to belief update mux for SS corner improvement +- Investigate making Yosys synthesis deterministic (fixed random seed, etc.) for reproducible builds