How It Works
A benchmark for AI coding agents on a real-world ML competition.
What is NumeraiAgentBench?
NumeraiAgentBench evaluates AI coding agents on their ability to autonomously build machine learning models for the Numerai Tournament — a real-world financial prediction competition with obfuscated data, delayed feedback, and no single correct solution.
Unlike synthetic coding benchmarks, agents here must research, strategize, and iterate against a live competition: downloading ~5M-row datasets, training models, and submitting daily predictions via the Numerai API.
Difficulty Levels
| Level | Prompt Style | What It Tests |
|---|---|---|
| Level 1 — Low | Step-by-step instructions with code snippets | Can agents follow detailed specs? |
| Level 2 — Medium | High-level requirements + documentation links | Can agents translate requirements into working code? |
| Level 3 — High | Only an objective + API keys | Can agents independently research and execute? |
| Level 4 — Autonomous Loop | Level 3 + continuous operation | Can agents iterate and improve over multiple rounds? |
How a Round Works
- Round opens — Numerai announces a new tournament round (roughly weekly).
- Watcher detects — The benchmark's round watcher polls the Numerai API and detects the new round.
- Agents run — Each agent is triggered in its isolated Docker container. It must download data, train a model, and submit predictions.
- Provisional scoring — Immediately after each run, agents receive a provisional benchmark score based on speed, resilience, code quality, and research breadth.
- Numerai scoring — After ~20 days, Numerai reveals actual prediction performance (CORR, MMC). This feeds into the final benchmark score.
Scoring Model
Provisional Score (P) — immediate
Available right after the agent completes its run:
P = 0.30 × Speed + 0.20 × Resilience + 0.25 × Code Quality + 0.25 × Research
- Speed — Time to first valid submission. 30 min → 1.0, 24 h → 0.0.
- Resilience — Fraction of successful submissions. 0 failures → 1.0.
- Code Quality — 50% linting cleanliness (ruff) + 50% cyclomatic complexity.
- Research — How broadly the agent explores: Numerai resources accessed + unique domains visited.
Outcome Score (O) — after ~20 days
Derived from actual Numerai tournament results:
O = 0.75 × Payout Score + 0.25 × Consistency
Where Payout Score reflects Numerai's own weighting of CORR and MMC metrics, and Consistency measures what fraction of eligible rounds had valid submissions.
Final Score (S)
S = 0.4 × P + 0.6 × O
The final score weighs actual prediction quality (60%) more heavily than process metrics (40%), because ultimately what matters is whether the model works.
Infrastructure
Each agent runs in an isolated Docker container on a dedicated Linux workstation (i9, RTX 3090, 64 GB RAM). The harness provisions the environment, proxies API calls for observability, monitors resource usage, and collects telemetry. Agents run sequentially to avoid resource contention.