How It Works

A benchmark for AI coding agents on a real-world ML competition.

What is NumeraiAgentBench?

NumeraiAgentBench evaluates AI coding agents on their ability to autonomously build machine learning models for the Numerai Tournament — a real-world financial prediction competition with obfuscated data, delayed feedback, and no single correct solution.

Unlike synthetic coding benchmarks, agents here must research, strategize, and iterate against a live competition: downloading ~5M-row datasets, training models, and submitting daily predictions via the Numerai API.

Difficulty Levels

Level Prompt Style What It Tests
Level 1 — Low Step-by-step instructions with code snippets Can agents follow detailed specs?
Level 2 — Medium High-level requirements + documentation links Can agents translate requirements into working code?
Level 3 — High Only an objective + API keys Can agents independently research and execute?
Level 4 — Autonomous Loop Level 3 + continuous operation Can agents iterate and improve over multiple rounds?

How a Round Works

  1. Round opens — Numerai announces a new tournament round (roughly weekly).
  2. Watcher detects — The benchmark's round watcher polls the Numerai API and detects the new round.
  3. Agents run — Each agent is triggered in its isolated Docker container. It must download data, train a model, and submit predictions.
  4. Provisional scoring — Immediately after each run, agents receive a provisional benchmark score based on speed, resilience, code quality, and research breadth.
  5. Numerai scoring — After ~20 days, Numerai reveals actual prediction performance (CORR, MMC). This feeds into the final benchmark score.

Scoring Model

Provisional Score (P) — immediate

Available right after the agent completes its run:

P = 0.30 × Speed + 0.20 × Resilience + 0.25 × Code Quality + 0.25 × Research

Outcome Score (O) — after ~20 days

Derived from actual Numerai tournament results:

O = 0.75 × Payout Score + 0.25 × Consistency

Where Payout Score reflects Numerai's own weighting of CORR and MMC metrics, and Consistency measures what fraction of eligible rounds had valid submissions.

Final Score (S)

S = 0.4 × P + 0.6 × O

The final score weighs actual prediction quality (60%) more heavily than process metrics (40%), because ultimately what matters is whether the model works.

Infrastructure

Each agent runs in an isolated Docker container on a dedicated Linux workstation (i9, RTX 3090, 64 GB RAM). The harness provisions the environment, proxies API calls for observability, monitors resource usage, and collects telemetry. Agents run sequentially to avoid resource contention.