Reproducing RQ2 — RepoAudit C/C++ Benchmark
A benchmark applying SPECA to the ICML 2025 RepoAudit benchmark (15 OSS C/C++ projects, average 251K LoC, ground truth 35 + 5). RQ2 of the paper.
Results (paper)
| Metric | Value |
|---|---|
| Precision (Sonnet 4.5) | 88.9% — matches the highest published baseline |
| Beyond-GT candidates | 12 (author-validated), of which 2 confirmed by upstream maintainers |
| Average wall-time (Phase 03) | 4.4 minutes per project |
File layout
benchmarks/rq2a/
├── visualize.py # Generates baseline + SPECA comparison figures
├── evaluate.py # Evaluates SPECA outputs
├── analyze_deep.py # Deep-dive analysis of FPs
├── ground_truth_bugs.yaml # 35 + 5 adjudicated bugs
├── published_baselines.yaml # Baselines from the RepoAudit paper
└── README.md
Locations of generated artifacts (all need to be restored from the Release tag):
benchmarks/results/rq2a/
├── speca/ ← Sonnet 4.5 (main results)
├── speca_sonnet4/ ← Sonnet 4 (model-comparison control)
├── speca_deepseek_r1/ ← DeepSeek R1 (matched-backbone control)
└── figures/ ← rq2a_*.png (regenerated by visualize.py)
Restore commands:
bash benchmarks/scripts/restore-results.sh bench-rq2a-<date>-speca
bash benchmarks/scripts/restore-results.sh bench-rq2a-<date>-sonnet4
bash benchmarks/scripts/restore-results.sh bench-rq2a-<date>-deepseek_r1
bash benchmarks/scripts/restore-results.sh bench-rq2a-<date>-figures
Find the latest <date> with gh release list --repo NyxFoundation/speca | grep '^bench-rq2a-'.
How to reproduce
A. Regenerate figures only from existing SPECA outputs (no API cost required)
# Baselines only
uv run python3 benchmarks/rq2a/visualize.py
# Overlay Sonnet 4.5 (assuming speca/ is already restored)
uv run python3 benchmarks/rq2a/visualize.py \
--speca-results benchmarks/results/rq2a/speca/speca_summary.json
# Cross-model comparison (for symmetric-comparison + adherence figures)
uv run python3 benchmarks/rq2a/visualize.py \
--speca-multi \
"Sonnet 4.5=benchmarks/results/rq2a/speca/speca_summary.json" \
"Sonnet 4=benchmarks/results/rq2a/speca_sonnet4/speca_summary.json" \
"DeepSeek R1=benchmarks/results/rq2a/speca_deepseek_r1/speca_summary.json"
Outputs: 8 PNGs + 1 LaTeX table → benchmarks/results/rq2a/figures/.
B. Run SPECA end-to-end on the 15 projects
# 1. Clone the RepoAudit dataset into target_workspace/
gh workflow run rq2a-01-setup-dataset.yml
# 2. Run SPECA (workflows are split per model)
gh workflow run rq2a-03-audit-map-sonnet4.yml -f projects=all
gh workflow run rq2a-03-audit-map-deepseek-r1.yml -f projects=all
# 3. Evaluation + visualization
gh workflow run rq2a-04-evaluate-sonnet4.yml -f projects=all
C. Beyond-GT candidate review
The 18 author-validated beyond-GT candidates from the Sonnet 4 run are listed in benchmarks/results/rq2a/REVIEW_GUIDE.md along with provenance and cross-model confirmation information.
Applying to other codebases
The harness is benchmark-agnostic. To evaluate a new C/C++ (or other-language) project:
- Create an
outputs/TARGET_INFO.jsonper target codebase (target_repo+target_commit) - Define the scope in
outputs/BUG_BOUNTY_SCOPE.json(target modules / excluded paths) - Run SPECA with
scripts/run_phase.py --target 04 - If you prepare a comparison reference in the
benchmarks/rq2a/published_baselines.yamlformat, the existing visualization scripts pick it up directly
See also Project structure.