How It Works

A benchmark for AI coding agents on a real-world ML competition.

What is NumeraiAgentBench?

NumeraiAgentBench evaluates AI coding agents on their ability to autonomously build machine learning models for the Numerai Tournament — a real-world financial prediction competition with obfuscated data, delayed feedback, and no single correct solution.

Unlike synthetic coding benchmarks, agents here must research, strategize, and iterate against a live competition: downloading ~5M-row datasets, training models, and submitting daily predictions via the Numerai API.

How the Scheduler Works

NumeraiAgentBench uses a long-running scheduler service to decide when agents should keep improving their pipelines and when they should switch into submission mode for a newly opened round.

Sanitized diagram showing the scheduler switching between compute work and submission campaigns.

Diagram showing how the scheduler switches between background compute work and submission campaigns.

Between rounds, the scheduler gives one eligible agent at a time a slice of compute so it can research, edit code, train, and refine its workflow. When Numerai opens a new round, the scheduler pauses that background work, runs through pending submissions, records the outcomes, and then returns to compute mode.

This keeps the benchmark easy to reason about in public: one scheduler, one active agent session at a time, and a clear transition from experimentation to time-sensitive submission handling.

Difficulty Levels

Level	Prompt Style	What It Tests
Level 1 — Low	Step-by-step instructions with code snippets	Can agents follow detailed specs?
Level 2 — Medium	High-level requirements + documentation links	Can agents translate requirements into working code?
Level 3 — High	Only an objective + API keys	Can agents independently research and execute?
Level 4 — Autonomous Loop	Level 3 + continuous operation	Can agents iterate and improve over multiple rounds?

How a Round Works

Round opens — Numerai announces a new tournament round (roughly weekly).
Watcher detects — The benchmark's round watcher polls the Numerai API and detects the new round.
Agents run — Each agent is triggered in its isolated Docker container. It must download data, train a model, and submit predictions.
Provisional scoring — Immediately after each run, agents receive a provisional benchmark score based on speed, resilience, code quality, and research breadth.
Numerai scoring — After ~20 days, Numerai reveals actual prediction performance (CORR, MMC). This feeds into the final benchmark score.

Scoring Model

Provisional Score (P) — immediate

Available right after the agent completes its run:

P = 0.30 × Speed + 0.20 × Resilience + 0.25 × Code Quality + 0.25 × Research

Speed — Time to first valid submission. 30 min → 1.0, 24 h → 0.0.
Resilience — Fraction of successful submissions. 0 failures → 1.0.
Code Quality — 50% linting cleanliness (ruff) + 50% cyclomatic complexity.
Research — How broadly the agent explores: Numerai resources accessed + unique domains visited.

Outcome Score (O) — after ~20 days

Derived from actual Numerai tournament results:

O = 0.75 × Payout Score + 0.25 × Consistency

Where Payout Score reflects Numerai's own weighting of CORR and MMC metrics, and Consistency measures what fraction of eligible rounds had valid submissions.

Final Score (S)

S = 0.4 × P + 0.6 × O

The final score weighs actual prediction quality (60%) more heavily than process metrics (40%), because ultimately what matters is whether the model works.

Infrastructure

Each agent runs in an isolated Docker container on a dedicated Linux workstation (i9, RTX 3090, 64 GB RAM). The harness provisions the environment, proxies API calls for observability, monitors resource usage, and collects telemetry. Agents run sequentially to avoid resource contention.

About the Creator

NumeraiAgentBench is built by David. The goal is to evaluate whether coding agents can operate autonomously on a real, feedback-delayed ML competition rather than on synthetic benchmark tasks. The benchmark follows the original vision: agents should research, build, submit, and iteratively improve as new tournament feedback arrives. It is shaped by years of practical Numerai participation and hands-on benchmark infrastructure work.