Reproducing RQ2 — RepoAudit C/C++ Benchmark

A benchmark applying SPECA to the ICML 2025 RepoAudit benchmark (15 OSS C/C++ projects, average 251K LoC, ground truth 35 + 5). RQ2 of the paper.

Results (paper)

Metric	Value
Precision (Sonnet 4.5)	88.9% — matches the highest published baseline
Beyond-GT candidates	12 (author-validated), of which 2 confirmed by upstream maintainers
Average wall-time (Phase 03)	4.4 minutes per project

File layout

benchmarks/rq2a/
├── visualize.py                  # Generates baseline + SPECA comparison figures
├── evaluate.py                   # Evaluates SPECA outputs
├── analyze_deep.py               # Deep-dive analysis of FPs
├── ground_truth_bugs.yaml        # 35 + 5 adjudicated bugs
├── published_baselines.yaml      # Baselines from the RepoAudit paper
└── README.md

Locations of generated artifacts (all need to be restored from the Release tag):

benchmarks/results/rq2a/
├── speca/                  ← Sonnet 4.5 (main results)
├── speca_sonnet4/          ← Sonnet 4 (model-comparison control)
├── speca_deepseek_r1/      ← DeepSeek R1 (matched-backbone control)
└── figures/                ← rq2a_*.png (regenerated by visualize.py)

Restore commands:

bash benchmarks/scripts/restore-results.sh bench-rq2a-<date>-speca
bash benchmarks/scripts/restore-results.sh bench-rq2a-<date>-sonnet4
bash benchmarks/scripts/restore-results.sh bench-rq2a-<date>-deepseek_r1
bash benchmarks/scripts/restore-results.sh bench-rq2a-<date>-figures

Find the latest <date> with gh release list --repo NyxFoundation/speca | grep '^bench-rq2a-'.

How to reproduce

A. Regenerate figures only from existing SPECA outputs (no API cost required)

# Baselines only
uv run python3 benchmarks/rq2a/visualize.py

# Overlay Sonnet 4.5 (assuming speca/ is already restored)
uv run python3 benchmarks/rq2a/visualize.py \
  --speca-results benchmarks/results/rq2a/speca/speca_summary.json

# Cross-model comparison (for symmetric-comparison + adherence figures)
uv run python3 benchmarks/rq2a/visualize.py \
  --speca-multi \
    "Sonnet 4.5=benchmarks/results/rq2a/speca/speca_summary.json" \
    "Sonnet 4=benchmarks/results/rq2a/speca_sonnet4/speca_summary.json" \
    "DeepSeek R1=benchmarks/results/rq2a/speca_deepseek_r1/speca_summary.json"

Outputs: 8 PNGs + 1 LaTeX table → benchmarks/results/rq2a/figures/.

B. Run SPECA end-to-end on the 15 projects

# 1. Clone the RepoAudit dataset into target_workspace/
gh workflow run rq2a-01-setup-dataset.yml

# 2. Run SPECA (workflows are split per model)
gh workflow run rq2a-03-audit-map-sonnet4.yml -f projects=all
gh workflow run rq2a-03-audit-map-deepseek-r1.yml -f projects=all

# 3. Evaluation + visualization
gh workflow run rq2a-04-evaluate-sonnet4.yml -f projects=all

C. Beyond-GT candidate review

The 18 author-validated beyond-GT candidates from the Sonnet 4 run are listed in benchmarks/results/rq2a/REVIEW_GUIDE.md along with provenance and cross-model confirmation information.

Applying to other codebases

The harness is benchmark-agnostic. To evaluate a new C/C++ (or other-language) project:

Create an outputs/TARGET_INFO.json per target codebase (target_repo + target_commit)
Define the scope in outputs/BUG_BOUNTY_SCOPE.json (target modules / excluded paths)
Run SPECA with scripts/run_phase.py --target 04
If you prepare a comparison reference in the benchmarks/rq2a/published_baselines.yaml format, the existing visualization scripts pick it up directly

Results (paper)​

File layout​

How to reproduce​

A. Regenerate figures only from existing SPECA outputs (no API cost required)​

B. Run SPECA end-to-end on the 15 projects​

C. Beyond-GT candidate review​

Applying to other codebases​