Skip to main content

Reproducing RQ1 — Sherlock Ethereum Fusaka Audit Contest

A benchmark that runs SPECA against 10 Ethereum clients implementing EIP-7594 (PeerDAS) / EIP-7691 and measures how many of the 15 adjudicated H/M/L issues from Sherlock can be recovered. This is how to reproduce the numbers reported in RQ1 of the paper.

Dataset

  • Contest: Sherlock Ethereum Fusaka Audit Contest #787
  • Targets: 10 implementations across 5 languages (Go / Rust / Nim / TypeScript / C / C#) alloy_evm_fusaka, c_kzg_4844_fusaka, grandine_fusaka, lighthouse_fusaka, lodestar_fusaka, nethermind_fusaka, nimbus_fusaka, prysm_fusaka, reth_fusaka, rust_eth_kzg_fusaka
  • Ground truth: out of 366 submissions, 15 valid H/M/L (High 5, Medium 2, Low 8)
  • Spec: EIP-7594 / EIP-7691 + the consensus-spec PRs they reference (auto-discovered by Phase 01a)

Reported numbers (paper, post-Phase 6)

MetricValue
Phase 5 findings (pre-review)102
Phase 6 findings (post-review)72
H/M/L recover (expert-augmented)15 / 15 (100%)
H/M/L recover (automated-only)8 / 15 (53%)
Strict precision26.4% (19/72)
Confirmed-useful precision59.7% (43/72)
Broad precision66.7% (48/72)
F1 (broad precision)0.800
Novel bugs confirmed by developer fix commits4

How to reproduce

A. Re-evaluate already-committed audit branches (no API cost required)

The audit outputs for each implementation are committed to dedicated branches.

# 1. Fetch branches
git fetch origin \
alloy_evm_fusaka c_kzg_4844_fusaka grandine_fusaka lighthouse_fusaka \
lodestar_fusaka nethermind_fusaka nimbus_fusaka prysm_fusaka \
reth_fusaka rust_eth_kzg_fusaka

# 2. Run recall + precision evaluator
uv run python3 -m benchmarks.rq1 \
--branches "alloy_evm_fusaka,c_kzg_4844_fusaka,grandine_fusaka,lighthouse_fusaka,lodestar_fusaka,nethermind_fusaka,nimbus_fusaka,prysm_fusaka,reth_fusaka,rust_eth_kzg_fusaka" \
--use-llm

# 3. Generate report and charts
uv run python3 benchmarks/rq1/generate_report.py

B. Run SPECA end-to-end (full reproduction)

git checkout alloy_evm_fusaka # for each branch
uv run python3 scripts/run_phase.py --target 04 --workers 4 --max-concurrent 64

The paper's wall-time is 2.4–4.4 minutes per target in Phase 03 (Sonnet 4.5, $30–60 / target).

Required files (restore)

The raw traces under benchmarks/results/rq1/ need to be restored from the Release tag:

bash benchmarks/scripts/restore-results.sh bench-rq1-<date>-sherlock_ethereum_audit_contest

The restored content includes the matcher's LLM cache (llm_cache.jsonl + llm_cache_fp.jsonl), so re-evaluation requires zero API calls.

Matching method

Issue matching uses LLM-assisted semantic matching + manual verification by two authors (initial agreement 93%; disagreements resolved by consensus). A finding is a TP only when both "root cause" and "security impact" match.

A three-stage matcher:

  1. Text similarity — title + summary embedding cosine
  2. Token overlap — Jaccard over normalized identifiers
  3. Keyword candidate selection + LLM judgment — only for ambiguous cases

Finding labels (7 categories, paper Table 3)

LabelCounts asDefinition
tpTPMatches a valid H/M/L contest issue
tp_infoTPTrue informational/low-severity detection
fixedTPIndependently fixed in the target repo (developer fix commit)
partially_fixedTPPartially addressed
potential-infoTP*Possible but unconfirmed
fp_invalidFPFalse positive due to inference error
fp_reviewFPFalse positive blocked by Phase 6 review gate

Three precision tiers: strict (tp only), confirmed-useful (tp + tp_info + fixed + partially_fixed), broad (everything except fp_*).

Applying to other audit contests

The harness is contest-agnostic. To apply it to another contest such as Code4rena or Cantina, prepare an outputs/BUG_BOUNTY_SCOPE.json and outputs/TARGET_INFO.json for each implementation, run scripts/run_phase.py --target 04, and run the recall/precision evaluator (benchmarks/rq1/cli.py) against the corresponding branch. The ground-truth CSV can be swapped by preparing data/<contest>/<contest>.csv in the same format.