Skip to main content

Reproducing RQ2 — RepoAudit C/C++ Benchmark

A benchmark applying SPECA to the ICML 2025 RepoAudit benchmark (15 OSS C/C++ projects, average 251K LoC, ground truth 35 + 5). RQ2 of the paper.

Results (paper)

MetricValue
Precision (Sonnet 4.5)88.9% — matches the highest published baseline
Beyond-GT candidates12 (author-validated), of which 2 confirmed by upstream maintainers
Average wall-time (Phase 03)4.4 minutes per project

File layout

benchmarks/rq2a/
├── visualize.py # Generates baseline + SPECA comparison figures
├── evaluate.py # Evaluates SPECA outputs
├── analyze_deep.py # Deep-dive analysis of FPs
├── ground_truth_bugs.yaml # 35 + 5 adjudicated bugs
├── published_baselines.yaml # Baselines from the RepoAudit paper
└── README.md

Locations of generated artifacts (all need to be restored from the Release tag):

benchmarks/results/rq2a/
├── speca/ ← Sonnet 4.5 (main results)
├── speca_sonnet4/ ← Sonnet 4 (model-comparison control)
├── speca_deepseek_r1/ ← DeepSeek R1 (matched-backbone control)
└── figures/ ← rq2a_*.png (regenerated by visualize.py)

Restore commands:

bash benchmarks/scripts/restore-results.sh bench-rq2a-<date>-speca
bash benchmarks/scripts/restore-results.sh bench-rq2a-<date>-sonnet4
bash benchmarks/scripts/restore-results.sh bench-rq2a-<date>-deepseek_r1
bash benchmarks/scripts/restore-results.sh bench-rq2a-<date>-figures

Find the latest <date> with gh release list --repo NyxFoundation/speca | grep '^bench-rq2a-'.

How to reproduce

A. Regenerate figures only from existing SPECA outputs (no API cost required)

# Baselines only
uv run python3 benchmarks/rq2a/visualize.py

# Overlay Sonnet 4.5 (assuming speca/ is already restored)
uv run python3 benchmarks/rq2a/visualize.py \
--speca-results benchmarks/results/rq2a/speca/speca_summary.json

# Cross-model comparison (for symmetric-comparison + adherence figures)
uv run python3 benchmarks/rq2a/visualize.py \
--speca-multi \
"Sonnet 4.5=benchmarks/results/rq2a/speca/speca_summary.json" \
"Sonnet 4=benchmarks/results/rq2a/speca_sonnet4/speca_summary.json" \
"DeepSeek R1=benchmarks/results/rq2a/speca_deepseek_r1/speca_summary.json"

Outputs: 8 PNGs + 1 LaTeX table → benchmarks/results/rq2a/figures/.

B. Run SPECA end-to-end on the 15 projects

# 1. Clone the RepoAudit dataset into target_workspace/
gh workflow run rq2a-01-setup-dataset.yml

# 2. Run SPECA (workflows are split per model)
gh workflow run rq2a-03-audit-map-sonnet4.yml -f projects=all
gh workflow run rq2a-03-audit-map-deepseek-r1.yml -f projects=all

# 3. Evaluation + visualization
gh workflow run rq2a-04-evaluate-sonnet4.yml -f projects=all

C. Beyond-GT candidate review

The 18 author-validated beyond-GT candidates from the Sonnet 4 run are listed in benchmarks/results/rq2a/REVIEW_GUIDE.md along with provenance and cross-model confirmation information.

Applying to other codebases

The harness is benchmark-agnostic. To evaluate a new C/C++ (or other-language) project:

  1. Create an outputs/TARGET_INFO.json per target codebase (target_repo + target_commit)
  2. Define the scope in outputs/BUG_BOUNTY_SCOPE.json (target modules / excluded paths)
  3. Run SPECA with scripts/run_phase.py --target 04
  4. If you prepare a comparison reference in the benchmarks/rq2a/published_baselines.yaml format, the existing visualization scripts pick it up directly

See also Project structure.