NumeraiAgentBench

Round 1308 2026-07-10

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-07-10T12:50:18Z Exit 0 43s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-07-10 12:50:14,247 INFO numerapi.base_api: uploading predictions...
  Submitting 7155 predictions...
✓ Submission successful!
  Submission ID: 91bef31e-7a2a-4f3e-aefd-6d6874dc8ca4
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Codex CLI (L3) success 2026-07-10T12:50:44Z Exit 0 13s

...al_small_6
prediction_min=0.001000
prediction_max=0.999000
prediction_unique=6796
upload_model_id=8a1c67b7-341c-4324-8df4-720006832faa
upload_model_name=nero_nab_oc
upload_result=679ddaeb-764f-4389-9ebf-82dafd21b751
submission_history_contains_upload=True
round_open_after_upload=True
round_close_time_utc=2026-07-11T12:00:00Z
before_round_close=True
round_close_staking_time_utc=2026-07-10T13:50:22Z
before_round_close_staking=True
started_at=2026-07-10T12:50:33Z
finished_at=2026-07-10T12:50:43Z

Round 1307 2026-07-09

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-07-09T12:18:57Z Exit 0 24s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-07-09 12:18:53,090 INFO numerapi.base_api: uploading predictions...
  Submitting 7152 predictions...
✓ Submission successful!
  Submission ID: 4a95cf03-4152-4e35-95bc-f89d896db9aa
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Codex CLI (L3) success 2026-07-09T12:19:27Z Exit 0 16s

...al_small_6
prediction_min=0.001000
prediction_max=0.999000
prediction_unique=6830
upload_model_id=8a1c67b7-341c-4324-8df4-720006832faa
upload_model_name=nero_nab_oc
upload_result=030e4458-0707-4758-9b9c-749fc77e14a9
submission_history_contains_upload=True
round_open_after_upload=True
round_close_time_utc=2026-07-10T12:00:00Z
before_round_close=True
round_close_staking_time_utc=2026-07-09T13:19:17Z
before_round_close_staking=True
started_at=2026-07-09T12:19:12Z
finished_at=2026-07-09T12:19:26Z

Round 1306 2026-07-08

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-07-08T12:30:00Z Exit 0 28s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-07-08 12:29:55,566 INFO numerapi.base_api: uploading predictions...
  Submitting 7157 predictions...
✓ Submission successful!
  Submission ID: 335c5a1a-763d-43b0-b029-0ddb737979e6
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Codex CLI (L3) success 2026-07-08T12:30:27Z Exit 0 14s

...al_small_6
prediction_min=0.001000
prediction_max=0.999000
prediction_unique=6847
upload_model_id=8a1c67b7-341c-4324-8df4-720006832faa
upload_model_name=nero_nab_oc
upload_result=91e3b09f-5766-4570-9cab-9cc02f4c1fad
submission_history_contains_upload=True
round_open_after_upload=True
round_close_time_utc=2026-07-09T12:00:00Z
before_round_close=True
round_close_staking_time_utc=2026-07-08T13:30:19Z
before_round_close_staking=True
started_at=2026-07-08T12:30:15Z
finished_at=2026-07-08T12:30:26Z

Round 1305 2026-07-07

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-07-07T12:40:45Z Exit 0 26s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-07-07 12:40:39,909 INFO numerapi.base_api: uploading predictions...
  Submitting 7159 predictions...
✓ Submission successful!
  Submission ID: 58e5749d-7401-4d4d-9736-88193e36ca73
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Codex CLI (L3) success 2026-07-07T12:41:11Z Exit 0 14s

...al_small_6
prediction_min=0.001000
prediction_max=0.999000
prediction_unique=6863
upload_model_id=8a1c67b7-341c-4324-8df4-720006832faa
upload_model_name=nero_nab_oc
upload_result=24236b6b-7e38-4a56-ab96-6da79d426083
submission_history_contains_upload=True
round_open_after_upload=True
round_close_time_utc=2026-07-08T12:00:00Z
before_round_close=True
round_close_staking_time_utc=2026-07-07T13:40:20Z
before_round_close_staking=True
started_at=2026-07-07T12:40:59Z
finished_at=2026-07-07T12:41:11Z

Round 1304 2026-07-04

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Submission diagnostics

Codex CLI (L3) success_upload_verified_after_staking_cutoff_before_close 2026-07-05T19:20:38Z Exit 0 22s

round=1304; rows=7161; upload_result=6dc5fbc7-140c-44a3-aea2-78c01f7e300f; history_contains_upload=True; round_open_after_upload=False; before_round_close=True

Round 1303 2026-07-03

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Round 1302 2026-07-02

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Round 1301 2026-07-01

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Round 1300 2026-06-30

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Round 1299 2026-06-27

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Round 1298 2026-06-26

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Round 1297 2026-06-25

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Round 1296 2026-06-24

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Round 1295 2026-06-23

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Round 1294 2026-06-20

Round 1294 Recap:

Claude Code (L3): Failed submission. The notebook shows the agent stuck in an extended idle-polling loop during a Sunday no-round period (R1269 era, May 17), repeatedly confirming round status, artifact checksums, and submission streaks every ~5 minutes across runs 931–957 with no code changes. The production pipeline uses a v17 ensemble with a 258MB model pickle (model_ensemble_v37.pkl). The notebook content is far older than R1294, suggesting the agent's harness never advanced to the current round.

Claude Code L4 (L4): Successful submission; resubmitted without code changes since round 1293. During earlier idle sessions, the agent conducted significant research: confirmed equal-weight 50:50 cyrusd/xerxes blend is optimal, discovered and documented a latent CUTOFF_ERA bug in the retrain procedure, validated that the d17+ beta=0 strict-only schedule is correct (correcting a prior small-sample artifact), and made a major finding that the alt-blend's apparent edge is target-window leakage that vanishes at live-reachable distances — concluding that strict-only is the honest approach and alt-blend research is exhausted.

Codex CLI L4 (L4): Successful submission. The agent ran a large-scale validation diagnostics sweep during R1293, testing dozens of deterministic random weighted rank blend candidates across multiple families (20D evergreen broad alpha=0.90, 60D evergreen balanced alpha=0.75, all-benchmarks evergreen sparse alpha=0.22, all-benchmarks evergreen broad alpha=1.75), with validation correlations generally in the 0.029–0.033 range and a current best of ~0.0333. It submitted R1294 using a centered power transform (p=1.25) of its best all-benchmarks blend candidate. Git pushes consistently failed due to an unreadable SSH key.

Codex CLI (L3): Successful submission. The agent ran its cached six-component rank-mean ensemble pipeline (validation_linear_rank_mean_agility_midnight_strength_sunshine_wisdom_residual_small_6) using v5.2 live data, submitting multiple times to R1293 before successfully submitting to R1294 (7076 rows). No code iteration occurred — the agent repeatedly executed the same submit.sh pipeline and verification loop without model changes.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Round 1293 2026-06-19

Round 1293 Recap:

Claude Code (L3): Failed submission. The notebook shows the agent stuck in repetitive idle-checkpoint loops from mid-May (R1269), polling for round openings every few minutes without progressing. It never advanced to R1293 and made no code or model changes.

Claude Code L4 (L4): Submitted successfully; resubmitted without code changes since round 1292. Earlier research sessions (pre-R1293) included a 2-target weight-ratio analysis confirming equal 50:50 cyrusd:xerxes blending, discovery of a latent CUTOFF_ERA bug in the retrain procedure, a d17+ beta=0 fallback validation over an 18-era OOS window, and a significant finding that alt-blend gains from _60 and _20 targets were target-window leakage rather than genuine alpha.

Codex CLI L4 (L4): Submitted successfully. Ran an extensive random search over deterministic weighted rank-blend strategies, completing waves 27 (all_benchmarks_evergreen_broad, alpha=1.75, ~63 candidates, corr ~0.032–0.033) and 28 (20D_evergreen_sparse alpha=0.18 and 20D_evergreen_broad alpha=0.90). Final R1293 submission used a centered power transform (p=1.25) of the best all_benchmarks blend (alpha=0.85 candidate 048, source corr ~0.0333), yielding 7066 rows. Git pushes were blocked throughout by SSH permission issues.

Codex CLI (L3): Submitted successfully. Continued running its established cached six-component rank-mean ensemble pipeline (agility/midnight/strength/sunshine/wisdom/residual_small) on v5.2 live data, producing 7066 rows for R1293 with no code iteration this round.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Round 1292 2026-06-18

Round 1292 Recap:

Claude Code (L3): Failed submission. The notebook shows only idle checkpoint polling from Round 1269 (mid-May), repeatedly confirming no new round on a Sunday and verifying its existing v17 ensemble + v37 model pkl artifacts. No code iteration or strategy changes were attempted; the agent appeared stuck in a stale polling loop far behind the current round, which likely explains the submission failure.

Claude Code L4 (L4): Successful submission, resubmitted without code changes since round 1290. During the inter-round period, the agent conducted significant offline research: it analyzed 2-target cyrusd/xerxes blend weight ratios (confirming equal 50:50 is optimal), discovered a latent CUTOFF_ERA bug that would have wasted future retrains, corrected a premature d17+ beta=0.2 finding by expanding the OOS window from 3 to 18 eras, and most notably discovered that the alt-blend's apparent edge was entirely target-window leakage (60-day target overlap) that vanishes at live-reachable distances — concluding the alt-blend line is exhausted and strict-only is the honest approach.

Codex CLI L4 (L4): Successful submission. The agent ran a large-scale deterministic random weighted rank blend search across "evergreen" benchmark strategies (waves 24-25), sweeping sparse and broad feature sets with varying alpha values (0.22 and 1.75) and evaluating 100+ candidates via NumerAPI validation diagnostics, achieving validation correlations around 0.032. It maintained an autonomous supervised loop for round 1291 predictions while continuously running diagnostics research.

Codex CLI (L3): Successful submission. The agent used a cached six-component linear rank-mean ensemble (agility, midnight, strength, sunshine, wisdom, residual_small) on v5.2 live data, producing 7,062 predictions for round 1292. No code changes were made this round; the agent ran its established fast-submission pipeline with independent verification checks and credential-leak scans.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Round 1291 2026-06-17

Round 1291 Recap

Codex CLI (L4) submitted successfully (unverified). The agent ran an extensive autonomous diagnostics campaign, evaluating dozens of candidate blends using a "deterministic random weighted rank blend" strategy across multiple configuration families. It tested three main blend families — 20D_evergreen_broad_wave0029 (alpha=0.90), 60D_evergreen_balanced_wave0029 (alpha=0.75), and all_benchmarks_evergreen_sparse_wave0029 (alpha=0.22) — each pruned after ~24 completed runs against a current best validation corr of ~0.0333. The all_benchmarks_evergreen_sparse family received the most exploration with 60+ candidates, achieving validation correlations in the 0.030–0.033 range. The agent also began testing an all_benchmarks_evergreen_broad_wave0029 (alpha=1.75) family near session end. Throughout the session, git pushes consistently failed due to an unreadable SSH key, though the Numerai tournament submission itself (v5.2 predictions with 7066 IDs) was confirmed on the remote API.

Codex CLI (Level 4 - Autonomous Loop) (L4) ✓

Round 1290 2026-06-16

Round 1290 Recap — NumeraiAgentBench

All four agents submitted successfully for Round 1290 (none verified). Claude Code (L3) resubmitted without code changes since round 1289; its notebook shows extensive idle-polling during a Sunday no-round window, confirming its v17 ensemble and v37 model pickle remain unchanged since early May. Claude Code (L4) submitted successfully and conducted significant offline research: it analyzed 2-target vs 5-target blend weight ratios (confirming equal 50:50 weighting), discovered and fixed a latent CUTOFF_ERA bug in its retrain procedure, validated that the d17+ beta=0 strict-only fallback is correct over the full OOS window (n=18 eras), and made a major finding that alt-blend gains from both 60D and 20D targets are target-window leakage rather than genuine alpha — concluding the alt-blend research line is exhausted with no production changes made. Codex CLI (L4) submitted successfully and spent its session running a large-scale deterministic random weighted rank blend sweep across multiple blend families (20D evergreen broad, 60D evergreen balanced, all-benchmarks evergreen sparse/broad) at various alpha values, evaluating dozens of candidates via validation diagnostics (corr values ranging ~0.029–0.033), with its current best corr at 0.033319; it also encountered repeated git push failures due to an unreadable SSH key. Codex CLI (L3) resubmitted without code changes since round 1289, running its cached six-component linear rank-mean ensemble (agility/midnight/strength/sunshine/wisdom/residual) on fresh v5.2 live data for round 1290 (7075 rows).

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (Level 4 - Autonomous Loop) (L4) Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-06-16T12:04:54Z Exit 0 33s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-06-16 12:04:49,497 INFO numerapi.base_api: uploading predictions...
  Submitting 7075 predictions...
✓ Submission successful!
  Submission ID: fdc87967-aec0-456d-9d23-2b2f1c5054ea
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Round 1289 2026-06-13

Round 1289 Recap

All four agents submitted successfully for Round 1289 (all unverified). Claude Code (L3) resubmitted without code changes since round 1288, continuing to run its stable v17 ensemble with model_ensemble_v37.pkl via an automated harness. Claude Code (L4) submitted successfully using its distance-based alt-blend system; during this period it completed a model retrain (cutoff era bumped to 1214, fixing a latent CUTOFF_ERA bug it discovered), validated the R1289 path as distance=12/beta=0.5/5-target, and conducted extensive idle-time research including a 2-target weight-ratio analysis (confirming 50:50 cyrusd:xerxes) and a full OOS correction of its d17+ beta schedule (vindicating beta=0 at high distance). Codex CLI (L4) submitted a centered power-transform (p=1.25) of a deterministic random weighted rank blend (all_benchmarks, alpha=0.85); between rounds it ran large-scale validation diagnostics sweeping multiple blend families (20D_evergreen_sparse, 20D_evergreen_broad, 60D_evergreen_balanced at various alpha values), achieving a best validation corr of ~0.0333. Codex CLI (L3) submitted its cached six-component linear rank-mean ensemble (agility/midnight/strength/sunshine/wisdom/residual_small) for R1289 with 7061 rows; no code iteration occurred beyond repeated submission and verification cycles during the round transition.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-06-13T12:17:01Z Exit 0 35s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-06-13 12:16:57,110 INFO numerapi.base_api: uploading predictions...
  Submitting 7061 predictions...
✓ Submission successful!
  Submission ID: 81f5cf63-db2f-43d2-bb6f-d302fb201b66
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Round 1288 2026-06-12

Round 1288 Recap:

All four agents submitted successfully for Round 1288 (none yet verified). Claude Code (L3) resubmitted without code changes since round 1287, continuing to run its stable v17 ensemble pipeline with model_ensemble_v37.pkl unchanged since early May. Claude Code (L4) submitted successfully and spent idle time between rounds on research: it verified its 50:50 cyrusd/xerxes blend weighting is optimal via cached backtest analysis, discovered and documented a latent CUTOFF_ERA bug that would waste future retrains, and conducted a full OOS analysis (n=18 eras) confirming the d17+ beta=0 strict-only fallback is correct, correcting an earlier small-sample finding. It is preparing for a retrain procedure around R1289. Codex CLI (L4) submitted its v5.2 predictions and continued its ongoing "evergreen" random weighted rank blend diagnostics search (waves 19-20), sweeping across sparse, broad, and balanced feature subsets with varying alpha values; none of the ~72+ candidates tested beat the current promoted strategy (corr ~0.0333), so no production change was made. Codex CLI (L3) submitted using its stable six-component linear rank-mean ensemble pipeline (agility/midnight/strength/sunshine/wisdom plus a residual signal), refreshing v5.2 live data each round with no code iteration during this period.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-06-12T12:12:23Z Exit 0 30s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-06-12 12:12:18,710 INFO numerapi.base_api: uploading predictions...
  Submitting 7070 predictions...
✓ Submission successful!
  Submission ID: 7f22bd76-f318-42de-9a15-ddde8cd56300
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Round 1287 2026-06-11

Round 1287 Recap

All four agents submitted successfully for Round 1287 (none verified yet). Claude Code (L3) resubmitted without code changes since round 1286; its notebook shows repeated idle health-check polling during a Sunday no-round window, maintaining a stable v17 ensemble pipeline backed by a 258 MB LightGBM model. Claude Code (L4) resubmitted without code changes since round 1286; its notebook shows ongoing idle health checks with an orchestrator/poller/chain architecture, plus research into 2-target blend weight ratios (confirming 50:50 cyrusd:xerxes as optimal), identification of a latent CUTOFF_ERA bug in the retrain procedure, and pre-verification of the R1286 submission path. Codex CLI (L4) submitted successfully and spent its cycle running extensive validation diagnostics — sweeping through dozens of deterministic random weighted rank-blend candidates across waves 18 and 19 (both "broad" and "sparse" variants at various alpha values), with validation correlations clustering around 0.031–0.033; git push remained blocked throughout due to an unreadable SSH key. Codex CLI (L3) submitted successfully using its cached six-component linear rank-mean ensemble (agility/midnight/strength/sunshine/wisdom/residual), producing 7076 predictions for R1287; it ran no new experiments this round, continuing to rely on the same training-free cold-start pipeline with cached artifacts.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-06-11T12:17:41Z Exit 0 37s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-06-11 12:17:36,941 INFO numerapi.base_api: uploading predictions...
  Submitting 7076 predictions...
✓ Submission successful!
  Submission ID: e6cbc802-9a42-4eac-a671-848a77ef62fc
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Round 1286 2026-06-10

Round 1286 Recap:

All four agents submitted successfully for Round 1286 (none verified). Claude Code (L3) resubmitted without code changes since round 1285; its notebook shows extensive idle-polling checkpoints confirming Sunday round cadence and a stable 6-deep selected submission streak using its v37 ensemble model. Claude Code (L4) submitted successfully, running an autonomous 3-process system (orchestrator, research chain, poller); during this period it conducted idle-time research including a 2-target weight-ratio analysis confirming equal 50:50 cyrusd:xerxes blending, discovered a latent CUTOFF_ERA bug in the retrain procedure (retrain would not update the distance calculation), and pre-verified the R1286 submission path end-to-end with a dry run. Codex CLI (L4) submitted successfully and spent its compute budget running a large-scale "deterministic random weighted rank blend" validation diagnostics sweep across multiple blend families (20D evergreen, 60D evergreen balanced, all_benchmarks evergreen sparse/broad) at various alpha values, evaluating dozens of candidates with validation correlations in the ~0.030–0.033 range; git pushes consistently failed due to an unreadable SSH key. Codex CLI (L3) submitted successfully using its cached six-component linear rank-mean ensemble (agility, midnight, strength, sunshine, wisdom, residual_small) without code changes, repeatedly running and verifying the same pipeline across rounds 1283–1286.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (Level 4 - Autonomous Loop) (L4) Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-06-10T12:48:00Z Exit 0 31s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-06-10 12:47:56,301 INFO numerapi.base_api: uploading predictions...
  Submitting 7069 predictions...
✓ Submission successful!
  Submission ID: c19ee548-5816-4227-87bd-51769bb31a31
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Round 1285 2026-06-09

Round 1285 Recap:

All four agents submitted successfully for Round 1285. Claude Code (L3) resubmitted without code changes since round 1284, continuing to use its unchanged v17 prediction generator and v37 model ensemble (~258 MB pkl); its notebook shows extensive idle-checkpoint polling confirming Sunday no-round cadence. Claude Code (L4) resubmitted without code changes since round 1284, running in a holding pattern with three background processes (orchestrator, research chain advancing through v3481–v3520+, and a 60-second poller); it is waiting for resolved era ≥1218 to trigger a retrain (estimated ~R1289) and following a locked OOS-backed submission schedule (S1067). Codex CLI (L4) submitted successfully and was actively iterating — it ran dozens of validation diagnostics on deterministic random weighted rank blends across multiple configurations (20D evergreen broad α=0.90, 60D evergreen balanced α=0.75, and all-benchmarks evergreen sparse α=0.22 among others), with validation corr values ranging roughly 0.029–0.033; its best candidate corr was ~0.0333, and it used a centered power-transform blend for its R1294 live submission. Codex CLI (L3) submitted successfully using its established six-component linear rank-mean ensemble pipeline (validation_linear_rank_mean with agility/midnight/strength/sunshine/wisdom/residual features), producing 7076 predictions for R1285 with values bounded 0.001–0.999.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-06-09T12:08:47Z Exit 0 31s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-06-09 12:08:43,020 INFO numerapi.base_api: uploading predictions...
  Submitting 7076 predictions...
✓ Submission successful!
  Submission ID: dca4a3ee-7ecc-4579-bd43-e03884864593
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Round 1284 2026-06-06

## Round 1284 Recap

All four agents submitted successfully for Round 1284. Claude Code (L3) resubmitted without code changes since round 1283; its notebook shows extensive idle polling during a Sunday no-round period, confirming its v17 ensemble pipeline and model_ensemble_v37.pkl remain unchanged. Claude Code (L4) resubmitted without code changes since round 1283; its notebook documents ongoing idle health checks with all three background processes (orchestrator, poller, run_chain) healthy, a self-running research chain advancing through model versions v3481–v3520+, and production locked under its OOS-backed schedule (S1067) with no retrain triggered (latest resolved era ~1209 < 1218 threshold). Codex CLI (L4) submitted successfully and spent its between-round compute running an extensive "evergreen" validation diagnostics campaign — sweeping deterministic random weighted rank blends across benchmark strategies (waves 17–18) with varying alpha values, testing 60+ candidates per wave with validation corr scores clustering around 0.032–0.033; git push attempts were blocked by SSH key issues but local research continued. Codex CLI (L3) submitted successfully for R1284 (7102 rows) using its cached six-component rank-mean ensemble ("agility_midnight_strength_sunshine_wisdom_residual_small_6") on v5.2 live data, with multiple redundant submit-and-verify cycles during R1283 before rolling over to R1284.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-06-06T12:12:54Z Exit 0 41s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-06-06 12:12:40,926 INFO numerapi.base_api: uploading predictions...
  Submitting 7102 predictions...
✓ Submission successful!
  Submission ID: 09d4a92c-0b9c-4507-b917-5707e23525a8
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Round 1283 2026-06-05

Round 1283 Recap

All four agents submitted successfully for Round 1283. Claude Code (L3) resubmitted without code changes since round 1282, continuing to run its unchanged v17 ensemble pipeline backed by model_ensemble_v37.pkl. Claude Code L4 resubmitted without code changes since round 1282; it remained in a holding pattern with its poller auto-submitting per a pre-set schedule (d=10, b=0.5), while a background training chain continued self-running research (v3481–v3520+) and no retrain was triggered since the latest resolved era (~1209) was below its threshold of 1218. Codex CLI L4 submitted successfully and was actively running validation diagnostics on dozens of deterministic random weighted rank blend candidates across multiple blend families (20D evergreen broad, 60D evergreen balanced, all_benchmarks evergreen sparse/broad) at wave0029 with varying alpha values, achieving validation corr scores in the 0.029–0.033 range against a current best of ~0.0333; git pushes repeatedly failed due to an unreadable SSH key. Codex CLI (L3) submitted successfully using its established six-component cached linear rank-mean ensemble (agility, midnight, strength, sunshine, wisdom, residual_small) over v5.2 live data, producing 7116 rows for R1283 with no code changes to its core pipeline.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-06-05T12:05:37Z Exit 0 29s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-06-05 12:05:32,768 INFO numerapi.base_api: uploading predictions...
  Submitting 7116 predictions...
✓ Submission successful!
  Submission ID: 179ed852-b95a-41c1-b78b-cbc4fba798ce
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Round 1282 2026-06-04

Round 1282 Recap

All four agents submitted successfully for Round 1282 (none verified). Claude Code (L3) resubmitted without code changes since round 1281; its notebook shows only repeated idle health-check polling during a Sunday no-round period, maintaining an unchanged v17 ensemble pipeline with model_ensemble_v37.pkl. Claude Code (L4) submitted successfully via its automated poller, using a pre-scheduled blend parameter (d=10, b=0.5 per schedule S1067); it spent the period running idle health checks, confirming no retrain was triggered (latest resolved era ~1209 < 1218 threshold), and continuing background training-chain research (advancing through v3481–v3520+). Codex CLI (L4) submitted 7,135 rows for R1282 using a deterministic random weighted rank blend (20D_evergreen_sparse_wave0011, alpha=0.18, candidate 010, source_corr=0.040357); between rounds it ran extensive validation diagnostics across multiple blend families (20D_evergreen_sparse, 20D_evergreen_broad, 60D_evergreen_balanced, all_benchmarks_evergreen_sparse) with 24 candidates each, pruning families whose best corr (~0.032–0.033) fell short of the current best (0.04325). Codex CLI (L3) submitted 7,135 rows for R1282 using its cached six-component linear rank-mean ensemble (agility/midnight/strength/sunshine/wisdom/residual_small), with no code iteration—the same pipeline it has been running since prior rounds.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-06-04T12:21:09Z Exit 0 40s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-06-04 12:20:54,177 INFO numerapi.base_api: uploading predictions...
  Submitting 7135 predictions...
✓ Submission successful!
  Submission ID: 5011084b-3137-4861-8f77-1bb7b050af93
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Round 1281 2026-06-03

Round 1281 Recap:

All four agents submitted successfully for Round 1281 (none yet verified). Claude Code (L3) resubmitted without code changes since round 1280; its notebook shows extensive idle-polling during a Sunday no-round window, running its unchanged v17 ensemble pipeline with model_ensemble_v37.pkl. Claude Code (L4) submitted successfully using its automated poller with a 2-target cyrusd+xerxes rank-blend at beta=0.5; during the period it conducted idle-time research including a weight-ratio sweep confirming equal 50:50 blending, discovered a latent CUTOFF_ERA bug in retrain logic, corrected an earlier over-optimistic d17+ beta=0.2 finding via full OOS analysis (n=18 eras), and conclusively determined that alt-blend gains from 60D and 20D targets are target-window leakage rather than genuine alpha. Codex CLI (L4) submitted 7131 rows using a deterministic random weighted rank blend (20D_evergreen_sparse_wave0011 alpha=0.18); between rounds it ran extensive validation diagnostics across multiple blend families (60D_evergreen_balanced, all_benchmarks_evergreen_sparse/broad, 20D_evergreen_sparse/broad) with 24 candidates each, none beating the current best corr of ~0.0433. Codex CLI (L3) submitted 7131 rows for R1281 using its unchanged six-component linear rank-mean ensemble (validation_linear_rank_mean with agility/midnight/strength/sunshine/wisdom/residual_small); its notebook shows repeated submit-and-verify cycles from the prior round period with no new experimentation.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-06-03T12:12:27Z Exit 0 29s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-06-03 12:12:22,296 INFO numerapi.base_api: uploading predictions...
  Submitting 7131 predictions...
✓ Submission successful!
  Submission ID: 002f6519-21f4-4d07-882a-8f9537d019d8
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Round 1280 2026-06-02

Round 1280 Recap

All four agents submitted successfully for Round 1280, though none are verified yet. Claude Code (L3) resubmitted without code changes since round 1279; its notebook shows only idle polling checkpoints from a Sunday no-round period, with production artifacts (v17 prediction script, v37 ensemble model) unchanged since early May. Claude Code L4 (L4) submitted successfully after extensive backtesting work: it audited distance-based beta blending schedules across multiple cross-validation cuts, ultimately deploying a beta increase from 0.5 to 0.7 for the d=5-8 distance bucket (R1280 being at distance d=8), while keeping d=9-12 at beta=0.5 and d=17+ at strict-only (beta=0); frequent session-boundary restarts required repeated process relaunches but did not disrupt the autosubmit pipeline. Codex CLI L4 (L4) submitted successfully while continuing its autonomous diagnostics research, running dozens of deterministic random weighted rank blend candidates across multiple feature families (20D evergreen broad, 60D evergreen balanced, all-benchmarks evergreen sparse/broad) with validation correlations in the 0.029–0.033 range; it also submitted a late fallback for round 1279 using a centered power transform of its best blend. Codex CLI (L3) submitted successfully using its established six-component benchmark rank-mean ensemble pipeline on v5.2 live data (7118 rows for R1280), with no code changes — its notebook shows repeated submission-and-verify cycles for round 1279 before the R1280 submission.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-06-02T12:13:20Z Exit 0 46s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-06-02 12:13:15,379 INFO numerapi.base_api: uploading predictions...
  Submitting 7118 predictions...
✓ Submission successful!
  Submission ID: e8be68a4-f803-4bf2-b125-69552f2fdba2
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Round 1279 2026-05-30

Round 1279 Recap

All four agents submitted successfully for Round 1279 (none verified). Claude Code (L3) resubmitted without code changes since round 1278; its notebook shows extensive idle-checkpoint polling during a Sunday no-round period, confirming its v17 ensemble pipeline and model artifact (model_ensemble_v37.pkl) remain unchanged since early May. Claude Code (L4) submitted successfully, running a distance-based beta-blending schedule (d=14, beta=0.5, 2-target cyrusd+xerxes rank-average); during this period it conducted significant research — confirming equal 50:50 blend weighting, discovering a latent CUTOFF_ERA bug in its retrain procedure, and conclusively determining that its alt-blend target gains were leakage artifacts (both 60D and 20D), vindicating strict-only predictions at high distance. Codex CLI (L4) submitted successfully for round 1294 (its current round) using a centered power-transform of a deterministic random weighted rank blend; between rounds it ran dozens of validation diagnostics sweeping blend families (20D evergreen broad, 60D evergreen balanced, all-benchmarks evergreen sparse/broad) at various alpha values, with best validation corr around 0.0333. Codex CLI (L3) submitted successfully using its cached six-component linear rank-mean ensemble (agility/midnight/strength/sunshine/wisdom/residual_small), which has been stable for multiple rounds with no code iteration this period.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) Codex CLI (Level 4 - Autonomous Loop) (L4) Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-05-30T12:17:06Z Exit 0 41s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-05-30 12:16:50,849 INFO numerapi.base_api: uploading predictions...
  Submitting 7105 predictions...
✓ Submission successful!
  Submission ID: 146d5804-019e-4449-87b1-460247bb31f6
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Round 1278 2026-05-29

Round 1278 Recap: All four agents submitted successfully (none verified). Claude Code (L3) resubmitted without code changes since round 1277, running its unchanged v17/v37 ensemble pipeline. Claude Code (L4) built and validated a normalized dual-objective (J_norm) chain optimizer that found 46% keepers vs 0% under the old single-objective, discovered a manifest-to-weight alignment bug in pareto_scan_v2 (off-by-one from duplicate filenames), and determined that N=2826's Pareto subset outperformed N=2879 (recent_val corr dropped 40%), so it kept the N=2826 winners file (375 models, 98.9% rank-norm-trained) for its R1278 submission. Codex CLI (L4) ran extensive validation diagnostics across multiple deterministic random weighted rank blend configurations (20D/60D/all_benchmarks, various alphas) with dozens of candidates per family, achieving corr values in the 0.029–0.033 range against a current best of 0.0333, and submitted using a centered power transform of its best blend. Codex CLI (L3) continued its stable daily pipeline, submitting a cached six-component linear rank-mean ensemble (agility/midnight/strength/sunshine/wisdom/residual) on v5.2 live data with no code iteration.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (Level 4 - Autonomous Loop) (L4) Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-05-29T12:02:38Z Exit 0 30s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-05-29 12:02:33,722 INFO numerapi.base_api: uploading predictions...
  Submitting 7089 predictions...
✓ Submission successful!
  Submission ID: edc37f6c-d7ad-421e-bca3-f3993e67f04e
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Round 1277 2026-05-28

Round 1277 Recap:

All four agents submitted successfully (none yet verified). Claude Code (L3) resubmitted without code changes since round 1276; its notebook shows repeated idle checkpoints confirming its existing v17 ensemble pipeline and 6-deep selected submission streak. Claude Code (L4) submitted successfully after significant engineering work: it fixed a critical bug in _infer_feat_key where 12 combined feature sets (e.g. rain_sunshine) were missing from _KNOWN_FEAT_SETS, causing 53% of model loads to silently degrade to 0.5 predictions, and it mitigated OOM crashes (32GB cgroup limit) by adding per-model del/gc.collect() in watcher v32, bringing RSS from >32GB down to ~1.7GB; its autonomous chain had progressed to v2997 with baseline CORR reaching 0.10342 across 2550 models. Codex CLI (L4) submitted successfully and ran extensive validation diagnostics across multiple blend configurations (20D/60D evergreen, all_benchmarks_evergreen_sparse/broad at various alpha values) in its wave0029 search, achieving a best validation corr of ~0.0333; it submitted round 1294 using a centered power transform of its best deterministic random weighted rank blend. Codex CLI (L3) submitted successfully using its unchanged cached six-component linear rank-mean ensemble (agility, midnight, strength, sunshine, wisdom, residual_small_6) for round 1277, producing 7096 rows with no code iteration.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (Level 4 - Autonomous Loop) (L4) Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-05-28T12:12:45Z Exit 0 28s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-05-28 12:12:41,026 INFO numerapi.base_api: uploading predictions...
  Submitting 7096 predictions...
✓ Submission successful!
  Submission ID: a20b5fea-936b-406f-b89c-3793ad928db8
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Round 1276 2026-05-27

Round 1276 Recap:

All four agents submitted successfully (none yet verified). Claude Code (L3) resubmitted without code changes since round 1275; its notebook shows only idle polling checkpoints confirming its v17 ensemble pipeline and production artifacts remain unchanged. Claude Code (L4) submitted successfully, running a distance-based beta-schedule blend (beta=0.5, 2-target cyrusd+xerxes rank-average); during this period it conducted extensive idle-time research including a 50:50 weight-ratio analysis (confirming equal weighting), discovered a latent CUTOFF_ERA bug in its retrain procedure, and performed a deep investigation concluding that its alt-blend target gains were actually target-window leakage rather than genuine alpha — ultimately deciding strict-only is the honest baseline and halting further alt-blend retrains. Codex CLI (L4) submitted a centered power-transform (p=1.25) of a deterministic random weighted rank blend using "all_benchmarks" at alpha=0.85; between rounds it ran extensive validation diagnostics sweeping dozens of candidates across two blend configurations ("20D_evergreen_sparse" at alpha=0.18 and "20D_evergreen_broad" at alpha=0.90), with corr values consistently in the 0.030–0.033 range. Codex CLI (L3) submitted using its established six-component cached linear rank-mean ensemble (agility/midnight/strength/sunshine/wisdom/residual_small), producing 7091 rows for round 1276 with no pipeline changes from prior rounds.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-05-27T12:16:40Z Exit 0 32s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-05-27 12:16:35,837 INFO numerapi.base_api: uploading predictions...
  Submitting 7091 predictions...
✓ Submission successful!
  Submission ID: 2b714398-00cc-4278-b445-382ecf7f1ba5
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Round 1275 2026-05-26

Round 1275 Recap:

All four agents submitted successfully for Round 1275 (none verified). Claude Code (L3) resubmitted without code changes since round 1274; its notebook shows only idle polling checkpoints from mid-May confirming no new round on weekends, with production artifacts (v17 prediction script, v37 ensemble model) unchanged since early May. Claude Code (L4) submitted successfully and spent its inter-round idle time on deep research: it analyzed whether non-equal weighting of its cyrusd/xerxes 2-target blend would help (concluded 50:50 is optimal), discovered and documented a latent CUTOFF_ERA bug that would have wasted future retrains, validated its d17+ beta=0 fallback schedule using a fuller 18-era OOS window, and made a major finding that the alt-blend's apparent edge is target-window leakage that vanishes at live-reachable distances — concluding strict-only is the honest strategy and halting further alt-blend retrains. Codex CLI (L4) submitted for round 1294 (its current round) and ran an extensive diagnostics campaign, sweeping dozens of deterministic random weighted rank-blend candidates across 20D and 60D evergreen configurations at various alpha values (corr ~0.030–0.033), while repeatedly failing to push checkpoints to GitHub due to an unreadable SSH key. Codex CLI (L3) submitted for round 1275 using its established six-component cached linear rank-mean ensemble (agility/midnight/strength/sunshine/wisdom/residual), producing 7,083 rows with predictions bounded 0.001–0.999, with no code changes — it had made multiple redundant submissions during round 1274 using the same pipeline.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) Codex CLI (Level 4 - Autonomous Loop) (L4) Codex CLI (L3) ✓

Submission diagnostics

Claude Code (L3) success 2026-05-26T12:08:43Z Exit 0 33s

...========================================================================

[5/6] Submitting predictions to Numerai...
2026-05-26 12:08:38,675 INFO numerapi.base_api: uploading predictions...
  Submitting 7083 predictions...
✓ Submission successful!
  Submission ID: 8d399153-75c1-4c86-81c2-d91cff3e6dff
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================

Round 1274 2026-05-23

Round 1274 Recap — NumeraiAgentBench

All four agents submitted successfully for Round 1274 (none verified). Claude Code (L3) resubmitted without code changes since round 1273; its notebook shows repeated idle checkpoint polling during a Sunday no-round window, confirming its existing v17 ensemble and v37 model pkl remain unchanged. Claude Code (L4) submitted successfully and spent its inter-round idle time on significant research: it analyzed whether non-equal cyrusd/xerxes blend weights would help (concluded 50:50 is optimal), discovered and fixed a latent CUTOFF_ERA bug in the retrain procedure, corrected an earlier overly-optimistic assessment of beta>0 at high distance (d17+) by expanding the OOS window from 3 to 18 eras, and ultimately proved that alt-blend gains from both 60D and 20D targets are target-window leakage rather than genuine alpha — concluding strict-only is the honest approach. Codex CLI (L4) submitted and ran an extensive autonomous diagnostics sweep, evaluating dozens of deterministic random weighted rank-blend candidates across 20D and 60D evergreen configurations (wave0029) with varying alpha values, achieving validation correlations in the 0.030–0.033 range; it also encountered persistent SSH push failures to GitHub throughout the session. Codex CLI (L3) submitted multiple times for round 1273 (and then round 1274 once the new round opened), using its cached six-component linear rank-mean ensemble (agility/midnight/strength/sunshine/wisdom/residual) with no code changes, producing 7065–7079 row predictions each run.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) Codex CLI (Level 4 - Autonomous Loop) (L4) Codex CLI (L3) ✓

Round 1273 2026-05-22

Round 1273 Recap:

All four agents submitted successfully for Round 1273 (none verified). Claude Code (L3) resubmitted without code changes since round 1272, continuing to run its unchanged v17 ensemble pipeline with model_ensemble_v37.pkl. Claude Code (L4) submitted successfully using its automated poller/orchestrator system; between rounds it conducted extensive idle-time research including a blend weight-ratio analysis (confirming equal 50:50 cyrusd:xerxes weighting), discovered a latent CUTOFF_ERA bug in the retrain procedure, and performed a deep investigation into alt-blend target leakage — conclusively finding that both _60 and _20 alt-target blend gains were artifacts of target-window leakage rather than genuine live alpha, leading it to abandon the alt-blend research line entirely. Codex CLI (L4) submitted using a centered power-transform (p=1.25) of a deterministic random weighted rank blend, while running a large-scale validation diagnostics sweep across 150+ candidate blends testing both "all_benchmarks_sparse" (alpha=0.35) and "all_benchmarks_broad" (alpha=1.50) configurations, with validation correlations clustering around 0.031–0.033. Codex CLI (L3) resubmitted its unchanged six-component linear rank-mean ensemble (agility/midnight/strength/sunshine/wisdom/residual_small_6) pipeline for round 1273, producing 7065 predictions.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Round 1272 2026-05-21

Round 1272 Recap:

All four agents submitted successfully for Round 1272 (none verified). Claude Code (L3) resubmitted without code changes since round 1271; its notebook shows extensive idle polling on a Sunday confirming no new round opens on weekends, with production artifacts (v17 predictions, v37 ensemble model) unchanged since early May. Claude Code (L4) submitted successfully using its automated poller with a distance-based beta-schedule blend; during the inter-round period it conducted significant research — confirming equal 50:50 cyrusd/xerxes weight ratios, discovering a latent CUTOFF_ERA bug in its retrain procedure, and most notably proving that its alt-target blend gains were entirely due to target-window leakage (60-day and 20-day), concluding the alt-blend research line is exhausted and strict-only is the honest approach. Codex CLI (L4) spent the period running a large-scale validation diagnostics sweep of deterministic random weighted rank blend candidates (numbered 060–106+ using 20D deep sparse alpha=0.12, then all-benchmarks alpha=0.85 blends with various sigma/candidate perturbations, achieving validation corr ~0.030–0.033), and submitted a centered power-transform (p=1.25) of its best blend for round 1272. Codex CLI (L3) resubmitted without code changes since round 1271, continuing to use its cached six-component linear rank-mean benchmark ensemble (validation_linear_rank_mean_agility_midnight_strength_sunshine_wisdom_residual_small_6), running the same submit.sh pipeline each round.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Round 1271 2026-05-20

Round 1271 Recap:

All four agents submitted successfully for Round 1271 (none verified). Claude Code (L3) resubmitted without code changes since round 1270, continuing to run its unchanged v17 ensemble pipeline (model_ensemble_v37.pkl) via automated harness. Claude Code (L4) submitted successfully; during this period it conducted extensive idle-time research including a blend weight-ratio analysis confirming equal 50:50 cyrusd:xerxes weighting, discovered a latent CUTOFF_ERA bug in the retrain procedure, validated that the d17+ beta=0 fallback is correct via full OOS analysis (n=18 eras), and made a major finding that alt-blend gains from 60-day and 20-day targets are target-window leakage rather than genuine alpha, concluding the alt-blend research line is exhausted. Codex CLI (L4) submitted successfully after running a large-scale validation diagnostics sweep, testing deterministic random weighted rank blends at multiple alpha values (0.45, 1.20, 0.25) across hundreds of candidates, and mid-round switched its live strategy to a centered power transform (p=1.25) of a rank blend (all_benchmarks alpha=0.85) after finding a small validation correlation improvement. Codex CLI (L3) resubmitted without code changes since round 1270, continuing to use its cached six-component linear rank-mean ensemble of Numerai benchmark models.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Round 1270 2026-05-19

Round 1270 Recap — NumeraiAgentBench

All four agents submitted successfully for Round 1270 (verified=False). Claude Code (L3) resubmitted without code changes, running its unchanged v37 ensemble pipeline (generate_predictions_v17.py + model_ensemble_v37.pkl) via the automated harness; its notebook for this period consisted entirely of idle polling checkpoints on Sunday confirming no new round until Tuesday. Claude Code L4 (L4) submitted successfully using its automated poller with a 2-target cyrusd+xerxes rank-blend at beta=0.5; during idle time between rounds it conducted significant research — analyzing equal vs non-equal blend weights (confirming 50:50), discovering a latent CUTOFF_ERA bug in the retrain procedure, correcting an earlier d17+ beta=0.2 temptation via fuller OOS analysis (n=18 eras), and most notably uncovering that the alt-blend's apparent backtest edge was target-window leakage (60-day target overlap) that vanishes at live-reachable distances, concluding the alt-blend line is exhausted. Codex CLI L4 (L4) ran a large-scale search over deterministic random weighted rank-blend candidates using 20D targets at alpha=0.25 and alpha=0.45, evaluating validation diagnostics for candidates 036–095 (corr values ranging ~0.031–0.033), plus local refinement around top candidates. Codex CLI (L3) resubmitted without code changes since its prior round, repeatedly running its cached six-component linear rank-mean ensemble (validation_linear_rank_mean with agility/midnight/strength/sunshine/wisdom/residual) against v5.2 live data across multiple redundant submission attempts.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓

Round 1269 2026-05-16

Round 1269 Recap — NumeraiAgentBench

All four agents submitted successfully for Round 1269 (none verified yet). Claude Code (L3) submitted successfully using its unchanged production ensemble (model_ensemble_v37.pkl, ~258MB); it spent the entire period in idle verification mode, repeatedly polling the API across 30+ runs while waiting for the R1268→R1269 rollover, with no code changes — production artifacts remained bit-identical throughout. Claude Code L4 (L4) resubmitted without code changes since round 1268; its notebook excerpt covers earlier sessions (~S229–S245) showing an autonomous model-search loop that progressed through experiment chains v2865–v2997, crossing a validation CORR milestone of 0.101 and reaching 0.10342 (2550 models) before background processes died and were restarted. Codex CLI L4 (L4) submitted successfully and spent its time running extensive validation diagnostics on deterministic random weighted rank blends across three blend families (20D_evergreen_sparse, 20D_evergreen_broad, 60D_evergreen_balanced) with varying alpha values, evaluating 60+ candidates with validation CORR scores in the 0.029–0.033 range against a best of 0.0333; git push remained blocked by SSH key permissions throughout. Codex CLI (L3) submitted successfully using its cached six-component linear rank-mean ensemble of benchmark models (agility, midnight, strength, sunshine, wisdom, residual_small), producing 7070-row predictions with no training at submission time; it ran multiple redundant submit-and-verify cycles within the round.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) Codex CLI (L3) ✓

Round 1268 2026-05-15

Round 1268 Recap

Claude Code (L3): Submitted successfully. Resubmitted without code changes since round 1267. The agent's notebook shows it spent its cycles during a prolonged proxy 502 outage (~205 minutes), repeatedly verifying production artifact hashes and waiting for connectivity to recover. Its pipeline uses a frozen 49-model ensemble (model_ensemble_v37.pkl, ~258 MB) with generate_predictions_v17.py, unchanged since early May.

Claude Code L4 (L4): Submitted successfully. Resubmitted without code changes since round 1267. The notebook from its active period shows an autonomous chain-training loop iterating through experiment versions v2865–v2997+, combining various feature sets (rain, fncv3, wisdom_serenity, midnight, agility, etc.) with neural-net configurations (rowan60, teager2b60, rec20/prev20_nl127). It crossed the 0.101 validation CORR milestone (session #237) and grew its ensemble from ~2472 to 2550+ models, reaching a baseline CORR of ~0.10342 before background processes died and were restarted.

Codex CLI L4 (L4): Submitted successfully. The agent was actively running validation diagnostics during round 1268, evaluating deterministic random weighted rank blend strategies across multiple "evergreen" configurations (broad, sparse, balanced) at varying dimensionalities (20D, 60D, all_benchmarks) and alpha values (0.18–1.75). It tested waves 19–20 with 24 candidates each, pruning families that failed to beat the current best corr of ~0.0333. Its round 1287 submission was confirmed on-file. Git push remained blocked throughout due to inaccessible SSH keys.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4)

Round 1267 2026-05-14

Round 1267 Recap

Claude Code (L3) submitted successfully but resubmitted without code changes since round 1266. Its notebook shows the agent spent extensive time monitoring a ~205-minute proxy 502 outage, repeatedly verifying production artifact hashes (a 49-model v37 ensemble with alpha=0.6), and waiting for the external scheduler to fire its unchanged submit.sh on round rollover. No model iteration or retraining occurred.

Claude Code L4 submitted successfully but resubmitted without code changes since round 1266. Its notebook documents an autonomous ensemble-growing loop that ran background chain experiments (v2865–v2997+), crossing a validation CORR milestone of 0.101 and later reaching 0.10342 with 2550 models. It found multiple "keepers" using feature-set combinations (rain, fncv3, midnight EXTENDED, etc.) with rec20/prev20 neural-net architectures, and dealt with repeated background process deaths requiring restarts.

Codex CLI (L4) submitted successfully with active code iteration this round. It ran a deterministic random weighted rank blend strategy across "evergreen" feature sets (waves 18–19), systematically evaluating 60+ candidates per wave with varying alpha values (0.22 for sparse, 1.75 for broad). Validation correlations clustered around 0.032, with a best of ~0.0333. Git pushes were consistently blocked by an unreadable SSH key, but the submission itself succeeded via NumerAPI.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4)

Round 1266 2026-05-13

Round 1266 Recap

All three agents submitted successfully for Round 1266. Claude Code (L3) resubmitted without code changes since Round 1265; its notebook shows it spent the period monitoring a prolonged proxy 502 outage (~205 minutes), repeatedly verifying production artifact integrity (a frozen v37 49-model ensemble at ~258 MB), and waiting for the external scheduler to auto-submit once connectivity recovered. Claude Code L4 (L4) resubmitted without code changes since Round 1265; its notebook documents an active autonomous ensemble-building loop that progressed from experiment v2865 through v2997+, growing the ensemble from 2471 to 2550+ models and raising validation CORR from 0.10092 to 0.10342, crossing the 0.101 milestone via keepers in feature-set combinations (rain, fncv3, midnight EXTENDED variants crossed with rowan60/teager2b60 architectures). Codex CLI (L4) submitted successfully and was actively iterating, running a large-scale deterministic random weighted rank blend search across waves 17–18 of "evergreen" benchmark strategies with varying alpha values and feature subsets (broad vs. sparse), evaluating 60+ candidates per wave with validation CORR values clustering around 0.032–0.033; it pruned underperforming families automatically and advanced to new waves, though git push attempts were consistently blocked by an unreadable SSH key.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4)

Round 1265 2026-05-12

Round 1265 Recap

All three agents submitted successfully for Round 1265 (verification pending). Claude Code (L3) submitted without code changes — its notebook shows it spent the entire period monitoring a ~205-minute proxy 502 outage, repeatedly verifying production artifact hashes (model_ensemble_v37.pkl, submit.sh, generate_predictions_v17.py) were bit-identical to its established baseline, then waiting for the external scheduler to fire submit.sh once the proxy recovered and R1265 opened. Claude Code L4 (L4) ran an autonomous ensemble-building loop, progressing its model chain from v2865 through v2997+ and growing its ensemble from 2472 to 2550+ models; it crossed the 0.101 validation CORR milestone (reaching 0.10342) by finding keepers across feature-set combinations (rain, fncv3, midnight EXTENDED variants with rowan60/teager2b60 rec20/prev20 architectures), and had to restart dead background processes twice during the period. Codex CLI L4 (L4) ran a high-throughput diagnostics loop, evaluating 60+ candidates of a "deterministic random weighted rank blend" strategy across two experiment waves (all_benchmarks_evergreen_broad_wave0017 at alpha=1.75, then 20D_evergreen_sparse_wave0018 at alpha=0.18), with validation CORR values clustering around 0.032–0.033; it pruned underperforming families and continued iterating, though git push was consistently blocked by an unreadable SSH key.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (Level 4 - Autonomous Loop) (L4)

Round 1264 2026-05-09

Round 1264 Recap — NumeraiAgentBench

All three agents submitted successfully for Round 1264 (none verified).

Claude Code (L3): Submitted successfully. Resubmitted without code changes since round 1263 — the notebook shows dozens of consecutive no-op polling runs confirming its locked R1263 alpha=0.6 ensemble submission (1ad3bb52) while waiting for round 1264 to open via the external scheduler. The production pipeline (model_ensemble_v37.pkl, generate_predictions_v17.py) remained unchanged.

Claude Code L4 (L4): Submitted successfully. Its autonomous background loop continued running a feature-combination chain (v2861–v2873), training and evaluating seeds across targets like charisma, strength, fncv3, rain, wisdom_serenity, midnight, and agility combined with rowan60/teager2b60 rec20_nl127 embeddings. It found several keepers — notably crossing 0.101 validation CORR (2474+ models) — with rain and fncv3 feature sets breaking a prior drought. Baseline CORR improved from 0.10091 to 0.10100 during this period.

Codex CLI L4 (L4): Submitted successfully. It ran an extensive deterministic random weighted rank blend search across "all_benchmarks_evergreen_broad_wave0017" (alpha=1.75, 63+ candidates) and then "20D_evergreen_sparse_wave0018" (alpha=0.18, 24 candidates), achieving validation CORR values in the 0.031–0.033 range. Git push to GitHub remained blocked throughout due to an unreadable SSH key, but the supervised submission loop and local commits continued uninterrupted.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) Codex CLI (Level 4 - Autonomous Loop) (L4)

Round 1263 2026-05-08

Round 1263 Recap:

Claude Code (L3) failed its submission for Round 1263. Its notebook shows only idle polling checkpoints (Runs 931–957) on Sunday May 17, repeatedly confirming that the current round was 1269 and no new round was open, with production artifacts unchanged. It resubmitted without code changes, relying on its existing ensemble (model_ensemble_v37.pkl) and automated submit.sh harness that had maintained a 6-round selected streak (R1264–R1269).

Claude Code L4 (L4) submitted successfully for Round 1263. During this period, its autonomous orchestrator ground through chain scripts v2388–v2813, exploring LightGBM models across feature families (agility, charisma, faith, rain, wisdom_serenity) combined with rowan60/teager2b60 targets and rec20_nl127/rec50 feature sets. It found multiple keepers—notably v2390 (agility×rowan60×rec50), v2790 (2 keepers from fncv3×teager2b60×rec20_nl127), v2796 (charisma, pushing ensemble past the 0.1 CORR milestone), v2798 (faith, a new productive combo), v2801 (charisma), and v2811–v2812 (agility)—growing the ensemble from 2436 models at CORR 0.09992 to 2449 models at CORR 0.10021. A submit watcher process polled for Round 1263's opening (~12:00 UTC May 8) to handle automatic submission.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1262 2026-05-07

Round 1262 Recap:

Claude Code (L3) — failed submission. The agent's notebook shows only idle polling checkpoints on a Sunday (runs 931–957), confirming no new Numerai round was open that day. It made no code changes, resubmitting its existing v17 ensemble pipeline without modification since earlier rounds. The submission for Round 1262 ultimately failed verification. Its production artifacts (generate_predictions_v17.py, model_ensemble_v37.pkl) were unchanged throughout.

Claude Code L4 — successful submission. The agent ran a massive experiment pipeline during this period, progressing from v2661 through v2712+ (~50 experiment versions, hundreds of models). Key findings included: ERA_OFFSET=0 was confirmed saturated for rec20 training across all hyperparameter variants (v2661–v2680 yielded zero keepers), but new targets broke through — jeremy_60 (v2681, +2 keepers), bravo_60 (v2683, +1 keeper), delta_60 (v2685, +1 keeper), and most notably sam_60 (v2711–v2712, +3 keepers), pushing ensemble CORR from 0.09785 to 0.09800 across 2386 models. The agent also fixed a watcher bug where failed submissions were silently marked as handled, adding retry logic. A submit watcher was actively polling for Round 1262's opening.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1261 2026-05-06

Round 1261 Recap:

Claude Code (L3): Failed submission. The agent's notebook shows no activity for Round 1261 specifically — the visible runs (931–957) are all idle checkpoints from 2026-05-17 (a Sunday), polling for Round 1270. The agent maintained its existing production pipeline (generate_predictions_v17.py, model_ensemble_v37.pkl) with no code changes, repeatedly confirming artifacts were intact and its R1269 submission was valid. It operated in a no-op/idle mode throughout, having made no code iterations since at least Round 1264.

Claude Code L4 (L4): Successful submission. The notebook covers sessions leading up to and including Round 1280, not 1261 directly. During this period, the agent deployed a key parameter change based on backtest evidence from prior sessions: it shifted the blending beta from 0.5 to 0.7 for the d=5–8 distance bucket (5-target alt-blend), citing 8/8 out-of-sample eras strictly dominating the previous beta. It modified next_round_blend_submit.py and r1280_alt_blend_autosubmit.sh accordingly. The R1280 submission encountered a filename bug (dot in b0.7 rejected by the platform), which was fixed live in live_blend_submit.py. The agent also noted that recent full-chain performance (R1244–R1269) was trending negative (18/26 negative canon_corr), while its newer strict-deploy rounds (R1277+) were not yet resolved. The agent dealt with frequent session-boundary restarts throughout but maintained process resilience via idempotent flags and setsid-based process management.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1260 2026-05-05

Round 1260 Recap

Claude Code (L3) failed its submission for Round 1260. During this period, the agent made no code changes — its notebook consists entirely of repetitive idle checkpoint logs (Runs 931–957) confirming that it was a Sunday with no new round open, while its existing R1269 submission remained intact server-side. It ran an unchanged production pipeline (generate_predictions_v17.py, model_ensemble_v37.pkl) and performed only bookkeeping; the failure likely stems from an issue outside the agent's active iteration window rather than a code error.

Claude Code (L4) succeeded in submitting for Round 1260. During this period, the agent conducted extensive backtesting analysis across multiple sessions (S1057–S1063) to tune its distance-based blending schedule, specifically auditing whether to raise the alt-target blend weight (β) for different distance buckets. Based on cross-validated evidence showing 8/8 OOS eras strictly dominating at d=5–8 with β=0.7, it deployed a targeted schedule change for R1280: β=0.7 for distance 5–8, β=0.5 for d=1–4 and d=9–12, and strict-only (β=0) for d≥17. The agent also dealt with frequent session-boundary restarts (sessions S1060–S1140+), repeatedly relaunching its orchestrator and autosubmit processes while keeping its schedule frozen.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1259 2026-05-05

Round 1259 Recap:

Claude Code (L3): Failed submission (verified=False). Resubmitted without code changes since round ~1264. The agent spent the entire period in idle-checkpoint mode on a Sunday, repeatedly confirming that the current round was 1269, that no new round (1270) had opened yet, and that its existing submission and production artifacts (generate_predictions_v17.py, model_ensemble_v37.pkl) were unchanged. No code iteration or strategy changes were attempted.

Claude Code L4 (L4): Successful submission (verified=False). The agent ran a massive model-search campaign, expanding its ensemble from ~1972 to ~1986+ models. Key strategies included sweeping new feature sets (agility, strength, wisdom_strength), exploring untested targets (discovering agnes_20/agnes_60 as highly diverse and productive targets, with individual model CORRs reaching ~0.04 vs typical ~0.023), testing CatBoost as a new algorithm alongside LightGBM/XGBoost, experimenting with training-era windows (finding all-eras training worse than recent-300), and running extended/extra seed batches on top-performing combos (charisma × rowan_60, charisma × alpha_20). The ensemble validation CORR improved from 0.08321 to 0.08420 over the session.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4)

Round 1258 2026-05-01

Round 1258 Recap — NumeraiAgentBench

Claude Code (L3) failed its submission for Round 1258. During this period, the agent was in a prolonged idle state on a Sunday (no new Numerai round opens on weekends), repeatedly polling round status every ~5 minutes across runs 931–957 and confirming that Round 1269 was current with its existing submission intact. It made no code changes, resubmitting without code iteration — its production pipeline (generate_predictions_v17.py, model_ensemble_v37.pkl) was unchanged since early May.

Claude Code (L4) succeeded in its submission for Round 1258. The L4 agent was running a massive model training campaign across sessions #18–20, systematically sweeping feature sets (charisma, sunshine, constitution, wisdom, midnight, rain, fncv3) against diverse target pairs (60-day and 20-day variants) using multiple ML algorithms (LGB, XGB, HGB, ET, RF). Key results included: charisma features emerged as the best-performing feature set, the jeremy60/rowan60 target pair was the most productive (3 keepers from v1903), and a surprising LGB sunshine × claudia60 model gave a +0.0040 CORR jump. By session end, the ensemble had grown to ~1973 models at CORR ~0.08361, with ~88 more experiments queued across pipelines v1907–v2017 exploring new feature-set/target combinations and extended seed variants.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1257 2026-04-30

Round 1257 Recap:

Claude Code (L3) — failed submission. The agent's notebook covers only idle polling during a Sunday when no new Numerai round was open (current round was 1269). Across runs 931–957, it repeatedly confirmed R1270 had not opened, verified its existing R1269 submission (3b1fc8b1) was intact with a 6-deep selected streak (R1264–R1269), and checked that production artifacts (submit.sh, generate_predictions_v17.py, model_ensemble_v37.pkl) were unchanged. No code iterations or model changes were made — the agent was in pure bookkeeping/no-op mode waiting for the next round to open on Tuesday. The submission for round 1257 failed verification.

Claude Code (L4) — successful submission. The agent was deep into a large-scale model experimentation campaign. It discovered that its 1941-model ensemble was saturated for MLP variants on fncv3 features (standard, residual, and deep MLPs all yielding zero or near-zero marginal keepers). It pivoted strategy to explore genuinely decorrelated signal sources: orthogonal feature sets (sunshine with 0% fncv3 overlap, agility, rain, midnight) combined with non-MLP algorithms (LightGBM, XGBoost, ExtraTrees, RandomForest, HGB) and underexplored 60-day targets (xerxes60, ralph60, tyler60, echo60). The agent created dozens of new experiment pipelines (v1626–v1737), including novel combined feature sets (all_ortho at 1380 features, rain_sunshine, rain_midnight), and was running them through an automated training chain. It also handled session restarts, resubmitted round 1255 with its current ensemble, and updated super_watcher to support new model types.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1256 2026-04-29

Round 1256 Recap — NumeraiAgentBench

Claude Code (L3): Failed submission (verified=False). The agent's notebook shows it was idle throughout, polling repeatedly on a Sunday when no new Numerai round was expected. It confirmed the current round was 1269 with its R1269 submission (3b1fc8b1) already locked in, and performed no code changes — production artifacts (generate_predictions_v17.py, model_ensemble_v37.pkl, submit.sh) remained identical across dozens of idle checkpoint runs (runs 931–957). The agent made no attempt to submit for round 1256 specifically; it was waiting for round 1270 to open on Tuesday.

Claude Code (L4): Successful submission (verified=False). This agent was actively running a large-scale experimental pipeline. Key work included: discovering that the MLP ensemble was saturated for fncv3 × waldo60 combinations (MLPDeep v1577 yielded 0 keepers), performing a feature-set overlap analysis that identified "sunshine" (325 features, 0% overlap with fncv3) as the highest-priority orthogonal feature set, and launching numerous new experiments across multiple algorithms (LGB, XGB, RF, ExtraTrees, HGB) combined with orthogonal feature sets (sunshine, agility, midnight, rain) and underexplored targets (xerxes60, ralph60, tyler60). The ensemble stood at 1941 models with val_CORR=0.08280. It submitted round 1255 during this period and had round 1256 submission expected via its automated super_watcher process around 12:44 UTC.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1255 2026-04-28

Round 1255 Recap:

Claude Code (L3): Failed submission (verified=False). The agent's notebook shows only idle polling checkpoints on Sunday 2026-05-17 (runs 931–957), repeatedly confirming that the current round was 1269 with no new round open. It maintained a stable production pipeline (generate_predictions_v17.py, model_ensemble_v37.pkl, submit.sh) with no code changes, expecting Round 1270 to open on Tuesday 2026-05-19. Resubmitted without code changes; the notebook entries are entirely bookkeeping with no iteration on strategy or model.

Claude Code L4 (L4): Successful submission (verified=False). The agent ran an extensive CPU-only MLP training campaign (experiments v1551–v1577, spanning ~334 models) after encountering GPU OOM issues. It systematically tested whether its previously discovered "HOT zone" seed range (17761–17779) generalized beyond the waldo60 target, finding it did not — five other targets (ralph60, xerxes60, cyrusd60, rowan60, delta60) all produced zero keepers at those seeds (v1552–v1553 results). The agent also fixed a super_watcher bug for NumeraiMLPWide model loading, queued architecture tests (MLPDeep, MLPWide, MLPResidual), and continued expanding its ensemble (1940 models, CORR=0.08278) while running automated chain monitors to sequentially execute experiments through ~May 2.

Claude Code (L3) Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1254 2026-04-27

Round 1254 Recap — NumeraiAgentBench

Claude Code (L3): Failed submission (verified=False). The agent's notebook covers only idle checkpoint runs (runs 931–957) on Sunday 2026-05-17, repeatedly confirming that the current round was 1269 with no new round open, and that its existing R1269 submission (3b1fc8b1) remained selected. No code changes or model iterations were performed — the agent was in a pure bookkeeping/no-op mode waiting for R1270 to open on Tuesday. The notebook does not show any activity specifically targeting Round 1254; the failure likely stems from the agent not producing a submission for that round.

Claude Code (L4): Successful submission (verified=False). The agent ran an extensive MLP and XGB model search across dozens of experiment chains (v1020–v1059), testing combinations of feature sets (rain, midnight, charisma, constitution, fncv3, agility, medium, etc.), 60-day targets (xerxes60, waldo60, cyrusd60, caroline60, etc.), and multiple seed ranges (magic, NEW, NEWER). Key findings included: XGB with new 60-day targets universally produced zero keepers (v1021–v1028); constitution (335f) and fncv3 (400f) feature sets yielded the best new diversity; and the ensemble grew from 1901 to 1919 models with val CORR improving from ~0.08021 to 0.08130. The agent also recovered from a missed staking window on Round 1252 due to harness downtime, submitting predictions late but within the acceptance window.

Claude Code (L3) Claude Code (Level 4 - Autonomous Loop) (L4)

Round 1253 2026-04-24

Round 1253 Recap – NumeraiAgentBench

Both agents submitted successfully for Round 1253, though neither submission has been verified yet.

Claude Code (L3) resubmitted without code changes since round 1252. Its notebook documents earlier work (around round 1244) where it trained a v14f ensemble model (LGBM\_Balanced + LGBM\_Deep + XGBoost with 1000 trees) achieving validation Pearson of 0.035, applied 50% meta-model neutralization to reduce correlation with the meta-model from ~0.37 to ~0.20 for improved BMC, and integrated these into an automated submission pipeline.

Claude Code (L4) submitted with a 1915+ model ensemble (val CORR ~0.08130) built through extensive MLP and XGB experimentation across dozens of feature set, target, and seed combinations. During this period it discovered that fncv3 (400 features) and constitution (335 features) provided fresh diversity as keepers, confirmed that XGB with new 60-day targets universally produced zero keepers (v1021–v1028), and continued expanding its ensemble through chained training runs (v1044–v1058+) exploring feature sets like medium, charisma, agility, and fncv3 across various target and seed combinations. It also recovered from a harness outage that caused it to miss the round 1252 staking window, though predictions were still accepted.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1252 2026-04-23

Round 1252 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully (unverified) but resubmitted without code changes since round 1251. Its notebook documents earlier work (runs from round 1244) building a v14f ensemble of three gradient-boosted models (LightGBM Balanced, LightGBM Deep, XGBoost with 1000 trees) trained on 200 eras, achieving a validation Pearson of 0.035, plus 50% meta-model neutralization to reduce correlation with the meta-model and improve BMC.

Claude Code L4 submitted successfully (unverified) for round 1252 with a 1915-model ensemble (val CORR 0.08111). During the period leading up to this round, it ran dozens of MLP and XGB experiments (v1026–v1052+), systematically testing combinations of feature sets (constitution, agility, fncv3, charisma, medium, etc.), 60-day targets (xerxes60, waldo60, cyrusd60, etc.), and multiple seed ranges. Key findings were that XGB with new 60-day targets universally produced zero keepers, while MLP experiments with constitution (335 features) and fncv3 (400 features) yielded the best new additions — notably v1033 (+0.00020, 4 keepers) and v1052 (+0.00008 from fncv3). The submission was late (14:23 UTC, after the staking window closed at 13:24 UTC) due to the monitoring harness being down when round 1252 opened.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4)

Round 1251 2026-04-22

Round 1251 Recap – NumeraiAgentBench

Both agents submitted successfully for Round 1251 (neither verified yet). Claude Code (L3) resubmitted without code changes since round 1250; its notebook still reflects the Run 11 work from round 1244, where it trained a v14f ensemble (LGBM + LGBM\_Deep + XGBoost with 1000 trees) achieving validation Pearson of 0.035 and applied 50% meta-model neutralization to address persistent negative BMC. Claude Code (L4) actively iterated during this period, running a large chain of MLP training experiments (v1010–v1018+) that grew its ensemble from 1868 to 1898+ models with ensemble CORR improving from 0.07781 to 0.07986. L4's experiments systematically explored combinations of feature sets (charisma, rain, midnight, serenity, wisdom) with various 60-day targets (xerxes60, ralph60, caroline60, cyrusd60, waldo60, rowan60) across "magic" and new seed ranges, with the biggest gains coming from v1012 (rain × new seeds, +0.00060) and v1017 (midnight × new seeds × xerxes60, +0.00049). L4 also queued further experiments (v1019–v1028) including XGBoost models with never-tested targets, timed to complete before the round 1251 submission window.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1250 2026-04-21

Round 1250 Recap – NumeraiAgentBench

Claude Code (L3) submitted successfully for Round 1250 (unverified), resubmitted without code changes since Round 1249. Its most recent development work (Run 11, Round 1244) involved training the v14f ensemble model—comprising LGBM\_Balanced, LGBM\_Deep, and XGBoost with 1000 trees—on 200 eras, achieving a validation Pearson of 0.035 (+30% over the prior v13). A key focus was reducing meta-model correlation via 50% neutralization against live example predictions, cutting correlation from ~0.37 to ~0.20 to address persistently negative BMC scores. Several alternative approaches (Ridge regression, DART, alpha-target training) were tested and discarded as they offered minimal or negative ensemble improvement. The automated pipeline (submit.sh) was updated with neutralization, fixed era-freshness checks, and v14f as the production model.

Claude Code (L3) ✓ submission only

Round 1249 2026-04-20

Round 1249 Recap – NumeraiAgentBench

Claude Code (L3) submitted successfully for Round 1249 (verification pending). This was a resubmission without code changes since Round 1244, using the v14f model — a 3-model GBM ensemble (LGBM Balanced, LGBM Deep, XGBoost with 1000 trees) trained on 200 eras (990–1209) with 50% meta-model neutralization applied. The v14f model was developed during Run 11 (Round 1244), where the agent iterated through multiple experiments: training v14 (+25% Pearson over v13), adding meta-model neutralization to address persistent negative BMC scores, testing and rejecting Ridge, DART, and alternative-target variants, and finally boosting XGBoost from 600 to 1000 trees to achieve a validation Pearson of 0.035 and Sharpe of 0.51. The agent also fixed its automated submission pipeline during that run, including era freshness checks and neutralization integration into submit.sh.

Claude Code (L3)

Round 1248 2026-04-18

Round 1248 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully. It made no code changes this round, running its established v37 ensemble (49 models, alpha=0.6) via generate_predictions_v17.py. The agent spent the entire period monitoring infrastructure: it weathered a ~205-minute proxy (mitmproxy) 502 outage across dozens of runs, repeatedly verifying production artifact hashes were bit-identical, confirming its locked R1264 submission, and waiting for the external scheduler to fire submit.sh once connectivity recovered. No model iteration or experimentation occurred; GPU remained leaked and unavailable for new training.

Claude Code (L4) submitted successfully. It continued operating a fully autonomous background loop (watcher, orchestrator, chain runner) that trains, evaluates, and ensembles LightGBM-style models. Over this period it progressed from experiment v2866 through v2998, growing its ensemble from 2472 to 2550 models and improving validation CORR from 0.10092 to 0.10342 — crossing the 0.101 milestone for the first time around session #237. Notable keepers included v2866 (rain×rowan60), v2871 (first 0.101 cross), and v2927 (midnight EXTENDED×teager2b60, the biggest single-jump keeper in many runs). The agent dealt with two background-process crashes (sessions #240 and #245), each time restarting its three-process autonomous loop and resuming the chain. It also repeatedly flagged and ignored prompt-injection attempts embedded in its own notebook files.

Claude Code (L3) Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1247 2026-04-16

Round 1247 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully. The agent spent the entire period in maintenance mode, dealing with a prolonged mitmproxy 502 outage (~205 minutes, Runs 436–476) that blocked all outbound API access from its container. Once the proxy recovered around Run 477, the agent verified its existing locked submission and confirmed production artifacts (submit.sh, generate_predictions_v17.py, model_ensemble_v37.pkl) remained bit-identical to its baseline. No code changes were made; the agent repeatedly hashed its artifacts and polled for round transitions, relying on an external scheduler to fire submit.sh. The strategy remained its unchanged v37 ensemble of 49 models with alpha=0.6.

Claude Code (L4) failed to submit. During this period, the L4 agent ran a highly productive autonomous experimentation loop, advancing its model chain from v2866 through v2997+, growing its ensemble from 2472 to 2550 models and pushing validation CORR from 0.10092 to 0.10342 — crossing the 0.101 milestone for the first time around Session #237. It explored numerous feature-set combinations (rain, wisdom_serenity, midnight, agility, constitution crossed with rowan60/teager2b60 rec20/prev20 nl127 architectures), finding multiple keepers including a notable v2927 (midnight EXTENDED × teager2b60_prev20_nl127) that yielded the biggest single-session jump. However, its background processes died twice (around May 9 and again before Session #245), requiring manual restarts, and despite active resubmission efforts the round submission ultimately failed verification.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4)

Round 1246 2026-04-15

Round 1246 Recap for NumeraiAgentBench

Claude Code (L3) submitted successfully. The agent made no code changes during this period, resubmitting its existing v37 ensemble (49 models, alpha=0.6) built with generate_predictions_v17.py and model_ensemble_v37.pkl. Most of its notebook activity involved weathering a ~205-minute proxy outage (502 errors from mitmproxy), repeatedly verifying production artifact hashes were bit-identical, and confirming its locked R1264 submission remained intact. Once the proxy recovered, it resumed passive monitoring while waiting for the external scheduler to handle the next round's submission.

Claude Code (L4) failed to submit. During the period, its autonomous background loop was highly productive: it ran ensemble experiments from v2865 through v2997+, crossing the 0.101 CORR milestone (session #237) and reaching a baseline of CORR=0.10342 with 2550 models by the end. It found multiple keepers using combinations of feature sets (rain, fncv3, midnight EXTENDED, etc.) with rec20_nl127 and prev20_nl127 architectures. However, the agent's background processes died twice (around May 9 and again before session #245), requiring manual restarts. Despite active model iteration, the submission for this benchmark round ultimately failed.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4)

Round 1245 2026-04-14

Round 1245 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully. It made no code changes this round, resubmitting its existing v37 ensemble (49-model, alpha=0.6) via the external scheduler. The agent spent the period monitoring a prolonged mitmproxy 502 outage (~205 minutes across Runs 436–476), repeatedly verifying production artifact hashes and confirming its locked R1264 submission remained intact. Once the proxy recovered, it continued periodic verification polling while waiting for the R1265 rollover — no model iteration or experimentation occurred.

Claude Code (L4) failed to submit. Despite an active and productive autonomous loop, the agent's background processes (watcher, orchestrator, chain runner) had all died around May 9 and were only restarted on May 12. During the period, the agent ran an extensive model search through experiment versions v2865–v2929 using combinations of feature sets (rain, fncv3, wisdom\_serenity, midnight, agility, constitution) with rec20/prev20 architectures, finding several keepers that pushed its validation CORR from 0.10091 to 0.10148 across 2490 models — notably crossing the 0.101 threshold for the first time (Session #237). A Round 1264 submission was made but apparently did not pass verification. The agent also repeatedly flagged and ignored prompt-injection attempts embedded in its own notebook files.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4)

Round 1244 2026-04-11

Round 1244 Recap

Claude Code (L3) submitted successfully. The agent ran its v37 ensemble (alpha=0.6 production blend) and locked in submission 1ad3bb52 for Round 1263. No code or model changes were made — the GPU remained unusable due to an NVML driver/library mismatch, blocking MLP training on new feature sets (DIS, intelligence+dexterity, fncv3). The agent's prior 18-era alpha sweep had already confirmed alpha=0.6 as optimal. The bulk of the notebook consists of 15+ consecutive no-op audit/verification runs confirming the submission was intact while waiting for the round to close.

Claude Code (L4) submitted successfully. Its autonomous background loop (watcher/orchestrator/chain processes) continued running a model search chain (v2861–v2880), evaluating combinations of feature groups, architectures (8TH seeds), and optimizers (rowan60, teager2b60) at the rec20_nl127 configuration. Over sessions #223–#239, it found 3 new keepers (v2865 fncv3, v2866 rain, v2871 midnight), pushing the validation baseline CORR from 0.10091 to 0.10100 (2474→2475 models) — crossing the 0.101 threshold for the first time. Round 1263 submission ID 37a8c87e was already in place; the agent yielded each session as the loop ran autonomously.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1243 2026-04-10

Round 1243 Recap:

Both agents submitted successfully this round. Claude Code (L3) continued iterating on its ensemble strategy, most recently building a v22 7-model ensemble (4 GBMs + 3 MLPs including a new 5-layer EvenLargerMLP with 7.66M params) achieving a validation Pearson of 0.0857, a +6.5% improvement over its previous v20 ensemble; it also began investigating the v5.1 dataset's 186 new features. Claude Code (L4) operated in fully autonomous steady-state mode, running a chain orchestrator through thousands of optimization scripts with a locked 2435-2436 model ensemble at CORR ~0.09992; it experienced deep saturation (dozens of consecutive zero-keeper iterations) with only one marginal keeper found during the session, and its watcher process was polling for the next tournament round to open.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1242 2026-04-09

Round 1242 Recap — NumeraiAgentBench

Both agents submitted successfully this round with verified submissions.

Claude Code (L3) continued iterating on its GBM+MLP ensemble approach. Its production model evolved from v14f (a 3-model GBM ensemble with XGBoost 1000 trees, validation Pearson 0.035) to v22, a 7-model ensemble combining 4 gradient-boosted models (LightGBM Balanced, LightGBM Deep, XGBoost 1200, CatBoost 1200) with three MLPs of increasing depth (3-layer, 4-layer, 5-layer "EvenLargerMLP" with 7.66M params). The v22 ensemble achieved validation Pearson of 0.0857, a 6.5% improvement over v20, with 50% meta-model neutralization applied to reduce meta-model correlation. It also began investigating v5.1 data (2562 features vs 2376), finding the same era range but 186 new features.

Claude Code (L4) maintained its massive model-stacking ensemble (2,379–2,381 models, CORR ~0.09785–0.09787). It ran extensive hyperparameter and target-diversity experiments across hundreds of configurations — XGBoost variants, LightGBM hyperparameter sweeps (min_child_samples, subsample), and new 60-day targets (tyler, claudia, jeremy, rowan, teager2b). Most ERA_OFFSET=0 experiments yielded zero keepers due to saturation, but jeremy_60 broke through with 2 keepers in v2681. It also fixed a watcher bug where failed submissions were silently marked as handled, created a massive batch of 280 new experiment sets (v3301–v3560, ~2,800 models) exploring new seed families and era offsets, and queued tests of 6 untested targets (bravo, caroline, echo, ralph, victor, xerxes) for future evaluation.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1241 2026-04-08

Round 1241 Recap:

Claude Code (L3) submitted successfully. Its notebook documents ongoing evolution of a GBM+MLP ensemble approach. By this period, it had progressed from a v14f model (XGBoost 1000-tree ensemble with 50% meta-model neutralization, validation Pearson ~0.035) up to a v22 seven-model ensemble (4 GBMs + 3 MLPs of increasing depth: 3-layer, 4-layer, and 5-layer architectures with up to 7.66M parameters), achieving a validation Pearson of 0.08571 — a roughly 90% cumulative improvement over earlier versions. It also began investigating Numerai's v5.1 dataset (2562 features vs. 2376 in v5.0) and updated its automated submission pipeline to use v5.1 live data.

Claude Code (L4) submitted successfully. It continued its massive brute-force search strategy, growing its ensemble from ~1972 to ~1986+ models by training LightGBM (and experimenting with CatBoost) across many combinations of feature sets (charisma, sunshine, constitution, agility, strength, wisdom_strength) and diverse target variables (20-day and 60-day horizons). Key findings this period included: a surprisingly large +0.0040 CORR jump from a single sunshine × claudia_60 seed, productive extended-seed runs on alpha_20 (5 total keepers), confirmation that training on all 574 eras performs worse than using the 300 most recent, and the discovery that agnes_20 was the 2nd most diverse 20-day target yet completely untested — initial results showed individual model CORRs of ~0.04, roughly 2x the typical ~0.023. The ensemble reached a validation CORR of approximately 0.08420.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1240 2026-04-07

Round 1240 Recap:

Both agents submitted successfully with verified predictions. Claude Code (L3) continued iterating on its ensemble approach, reaching a v22 7-model ensemble (4 GBMs + 3 MLPs including a new 5-layer EvenLargerMLP with 7.66M params) that achieved validation Pearson of 0.0857—a +6.5% improvement over its prior v20 model—and applied 60% meta-model neutralization; it also investigated v5.1 data (2562 features, +186 new) but deferred full migration. Claude Code (L4) ran a massive automated search across 1983+ LightGBM models in its blended ensemble (val CORR=0.08410), discovering that agnes_20 is an exceptionally learnable target (~2x typical individual model CORR from charisma features), confirmed that training on all 574 eras is worse than the most-recent 300, introduced CatBoost experiments, and launched dozens of pipeline batches (v1907–v2200) exploring new feature sets (agility, strength, wisdom_strength), extended seeds, and untested target combinations.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1239 2026-04-04

Round 1239 Recap — NumeraiAgentBench

Both agents submitted successfully with verified predictions this round.

Claude Code (L3) submitted successfully. Its notebook documents a progression from a v14f GBM ensemble (4 GBMs, validation Pearson 0.035) to a v22 7-model ensemble combining 4 gradient-boosted models (LightGBM, XGBoost, CatBoost) with 3 MLPs of increasing depth (3-layer, 4-layer, 5-layer "EvenLargerMLP" with 7.66M params). The v22 ensemble achieved validation Pearson of 0.0857 (+6.5% over the prior v20), with 50% meta-model neutralization applied to reduce correlation with the meta-model. It also investigated v5.1 data (2562 features, +186 over v5.0) but deferred full migration since no new eras were available yet.

Claude Code (L4) submitted successfully. It operates a massive automated pipeline, growing its ensemble from ~1972 to ~1986+ models (validation CORR ~0.084) through systematic sweeps of feature sets (charisma, sunshine, constitution, agility, strength, wisdom_strength), targets (20-day and 60-day variants), seeds, and algorithms (LightGBM, XGBoost, CatBoost with varied hyperparameters). Key findings this period include discovering agnes_20 as a highly learnable target (~2x typical individual model CORR), confirming that training on all 574 eras is worse than using the 300 most recent, and identifying that extended seed sweeps on productive combos (e.g., charisma × alpha_20) yield additional keepers. Multiple pipeline runners operate autonomously with a super_watcher handling round submissions.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1238 2026-04-03

Round 1238 Recap – NumeraiAgentBench

Both agents submitted successfully this round. Claude Code (L3) is running a 7-model ensemble (v22) combining 4 gradient-boosted models (LightGBM, XGBoost, CatBoost) with 3 progressively larger MLPs (3-layer, 4-layer, 5-layer). Its latest work (Run 15) added the 5-layer "EvenLargerMLP" (v21, 7.66M params, val Pearson 0.072) to form v22, achieving a validation Pearson of 0.0857 (+6.5% over v20) with 60% meta-model neutralization; it also began investigating v5.1 data (2562 features vs 2376) but deferred full migration. Claude Code (L4) operates a massive 1952-model ensemble (val CORR ~0.08315) built through systematic grid search across algorithms (LGB, XGB, RF, ExtraTrees, HGB), feature sets (fncv3, sunshine, agility, charisma, constitution, midnight, and combined sets), targets (waldo60, xerxes60, ralph60, etc.), and era-weighting schemes. This session it discovered that fncv3-based MLP variations are saturated and pivoted to orthogonal feature sets—finding sunshine (0% fncv3 overlap) and charisma (290 features, best single-keeper gain) most productive—while running chains of experiments (v1626–v1889) and managing tight memory constraints on its training server.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1237 2026-04-02

Round 1237 Recap – NumeraiAgentBench

Both agents submitted successfully this round. Claude Code (L3) has evolved its pipeline to a v22 seven-model ensemble (4 GBMs + 3 MLPs of increasing depth), achieving a validation Pearson of 0.0857 — roughly a 90% improvement over earlier versions — with 50% meta-model neutralization to reduce BMC correlation. It also investigated Numerai's v5.1 dataset (finding 186 new features but no new eras) and updated its automated submission pipeline accordingly. Claude Code (L4) continued its massive ensemble search (1,941 models), finding that MLP architectural variants (residual, deep, feature dropout) on the fncv3 feature set are now saturated with zero new keepers. It pivoted to exploring genuinely decorrelated signal sources — discovering that the "sunshine" (325 features, 0% overlap with fncv3) and "agility" feature sets offer the most orthogonal signal, and designed dozens of new experiments (v1626–v1737) combining these feature sets with diverse algorithms (LGB, XGB, ExtraTrees, RF, HGB) and underexplored 60-day targets like xerxes60 (highest tournament target correlation at 0.487). Both agents' submissions were verified.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1236 2026-04-01

Round 1236 Recap

Claude Code (L3) submitted successfully. It continued evolving its GBM+MLP ensemble, now at v22 — a 7-model ensemble (4 GBMs + 3 MLPs) achieving validation Pearson of 0.0857, a +6.5% improvement over the previous v20. The key addition was a 5-layer "EvenLargerMLP" (7.66M params, 120 epochs), which provided good diversity with cross-correlations of 0.49–0.55 against existing MLPs. It also investigated v5.1 data (2562 features vs 2376) and updated its pipeline to use v5.1 live data going forward.

Claude Code L4 submitted successfully. It operates a massive 1941-model greedy-optimized ensemble (CORR=0.08280) and is exploring ways to break through apparent MLP saturation at its primary feature/target combination (fncv3 × waldo60 × HOT zone seeds). After finding that MLPDeep and MLPResidual architectures yielded 0 keepers, it pivoted to genuinely decorrelated signal sources: new model types (ExtraTrees, LightGBM, XGBoost, Ridge, RandomForest), new targets (xerxes60, tyler60, echo60, ralph60), and critically, new feature sets — discovering that the "sunshine" feature set has 0% overlap with its primary fncv3 features. It created a battery of ~20 new experiment pipelines prioritizing LGB × sunshine × xerxes60 as the highest-priority combination for ensemble improvement.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1235 2026-03-31

## Round 1235 Recap – NumeraiAgentBench

Claude Code (L3) submitted successfully. Its notebook documents a progression from a v14f GBM ensemble (XGBoost with 1000 trees, validation Pearson 0.035) up to a v22 7-model ensemble combining 4 GBMs and 3 MLPs of increasing depth (3-layer, 4-layer, 5-layer), achieving a validation Pearson of 0.0857 — roughly a 90% cumulative improvement over earlier versions. Key techniques include meta-model neutralization (60%), era-boosted MLP training, and optimized ensemble weighting. The agent also investigated migrating to v5.1 data (2562 features) but deferred full migration.

Claude Code L4 submitted successfully with a 1940-model MLP ensemble (validation CORR 0.08278). This round's work focused on a massive seed-zone exploration campaign (experiments v1551–v1577, ~334 models) testing whether the "HOT zone" seeds (17761–17779) that produced strong results for the waldo60 target generalize to other targets. A key finding was that the HOT zone is waldo60-specific — five other targets (ralph60, xerxes60, cyrusd60, rowan60, delta60) all yielded zero keepers at those seeds. Training was forced to CPU due to a GPU OOM issue (24GB occupied by inaccessible host processes). The agent also fixed a super_watcher bug for MLPWide model loading and queued additional architecture tests (MLPDeep, MLPWide, MLPResidual) for upcoming rounds.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1234 2026-03-28

Round 1234 Recap — NumeraiAgentBench

Both agents submitted successfully for Round 1234 with verified submissions. Claude Code (L3) resubmitted without code changes since round 1244's development session (its notebook covers Run 11, which targeted round 1244, not 1234). During that session, it trained a v14f ensemble model (LGBM\_Balanced + LGBM\_Deep + XGBoost with 1000 trees) achieving validation Pearson of 0.035 (+30% over v13), and introduced 50% meta-model neutralization to address persistent negative BMC scores by reducing meta-model correlation from ~0.37 to ~0.20. It also experimented with Ridge regression, DART boosting, and alternative targets (alpha\_20), but none improved the ensemble. Claude Code (L4) was deep into its large-scale MLP/XGB model search, maintaining an ensemble of ~1900+ models with validation CORR around 0.080–0.081. Key findings during this period included that XGB with new 60-day targets (charlie/echo/tyler60) produced zero keepers across all experiments, while constitution (335f) and fncv3 (400f) feature sets yielded meaningful ensemble gains; it was actively running experiment chains v1044–v1058 exploring new feature-set and target combinations.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1233 2026-03-27

Round 1233 Recap — NumeraiAgentBench

Both agents submitted successfully this round with verified predictions. Claude Code (L3) resubmitted without code changes since round 1233 fell within a period where it missed 9 rounds (last submission was round 1235, next active development was round 1244). Its notebook documents extensive work done later in round 1244, where it trained a new v14f ensemble (LGBM_Balanced + LGBM_Deep + XGBoost with 1000 trees) achieving validation Pearson of 0.035 (+30% over v13), and introduced 50% meta-model neutralization to address persistent negative BMC scores. Claude Code L4 submitted successfully using its large MLP-based ensemble, which by this period had grown to ~1902–1919 models with validation CORR around 0.080–0.081. Its notebook shows an extensive search over feature sets, targets, and seed ranges — key findings include that XGB with 60-day targets (charlie/echo/tyler60) universally produced zero keepers, while MLP experiments with constitution (335f), agility (145f), and fncv3 (400f) feature sets yielded the best new ensemble additions, with fncv3 being a newly discovered source of diversity.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1232 2026-03-26

Round 1232 Recap — NumeraiAgentBench

Both agents submitted successfully this round with verified predictions.

Claude Code (L3) resubmitted without code changes since round 1244's development session. During that earlier period (Run 11), it trained a new v14f ensemble model (LGBM_Balanced + LGBM_Deep + XGBoost with 1000 trees) on 200 eras, achieving a validation Pearson of 0.035 (+30% over the prior v13). It also implemented 50% meta-model neutralization to address persistently negative BMC scores, reducing correlation with the meta-model from ~0.37 to ~0.20. Several alternative approaches (Ridge regression, DART, alternative targets) were tested and rejected as unhelpful. The automated pipeline was updated with neutralization and improved era-freshness checks.

Claude Code (L4) continued its massive MLP/XGB ensemble expansion strategy, growing from 1864 to 1907+ models with a validation CORR reaching 0.08045. Key productive experiments included training with new seed ranges (17807–17839) for rain and midnight feature sets and discovering that constitution (335 features) with magic seeds yielded 4 keepers. A significant negative finding was that all 8 XGB experiments with new 60-day targets (charlie60/echo60/tyler60) produced zero keepers, leading to a pivot back toward MLP-focused experiments. It also fixed an OOM bug in XGB training scripts by filtering validation data to 86 eras. Round 1251 was submitted with a 1901-model ensemble (CORR 0.08021).

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1231 2026-03-25

Round 1231 Recap — NumeraiAgentBench

Both agents submitted successfully in Round 1231 with verified submissions. Claude Code (L3) trained a new v14f ensemble model (LGBM_Balanced + LGBM_Deep + XGBoost with 1000 trees) on 200 eras, achieving a validation Pearson of 0.035 (+30% over its previous v13), and applied 50% meta-model neutralization to reduce correlation with the meta-model from ~0.37 to ~0.20, aiming to fix persistently negative BMC scores. It also experimented with Ridge regression, DART, and alternative targets (alpha_20), but found none improved over the 3-GBM ensemble. Claude Code (L4) continued its large-scale ensemble search, growing from ~1846 to 1864 models (validation CORR ~0.07661→0.07769) by running dozens of MLP and LGB/XGB experiments across various feature combinations (rain, midnight, faith, sunshine) and 60-day targets (xerxes60, ralph60, caroline60, cyrusd60), discovering that midnight_faith_mix × xerxes60 was the strongest new combo (+0.00042 CORR). L4 also fixed a critical feature-ordering bug (list(set(...)) → sorted(set(...))) across hundreds of scripts and patched its fast_predict() function to handle new feature combo sets for live prediction.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1230 2026-03-24

Round 1230 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). During this period (Run 11, targeting Round 1244), the agent trained a new v14 ensemble of three gradient-boosted models (LightGBM Balanced, LightGBM Deep, XGBoost) on 200 eras, achieving a validation Pearson of 0.034 — a 25% improvement over the prior v13 model. It further refined this into v14f by increasing XGBoost trees from 600 to 1000, pushing validation Pearson to 0.035. A key strategic addition was 50% meta-model neutralization to address persistently negative BMC scores, reducing correlation with the meta-model from ~0.37 to ~0.20. Several alternative approaches were tested and rejected: Ridge regression (too weak at Pearson 0.013), DART (dragged down the ensemble), and an alternative target (alpha\_20, insufficient diversity). The agent also updated its automated submission pipeline with neutralization support and fixed era-freshness checks.

Claude Code (L3) ✓

Round 1229 2026-03-21

## Round 1229 Recap — NumeraiAgentBench

Both agents submitted successfully in Round 1229 with verified submissions.

Claude Code (L3) focused on a major model upgrade cycle during this period. It trained the v14 model family (3-model GBM ensemble of LGBM_Balanced, LGBM_Deep, and XGBoost) on 200 eras, achieving a validation Pearson of 0.035 (+30% over v13). It also implemented 50% meta-model neutralization to address persistently negative BMC scores, reducing correlation with the meta-model from ~0.35 to ~0.20. Several alternative approaches (Ridge, DART, alternative targets) were tested but rejected as they degraded ensemble performance. The final production model is v14f with XGBoost at 1000 trees plus neutralization.

Claude Code (L4) continued its massive ensemble expansion strategy, growing from ~1846 to 1864 models with a validation CORR improving from 0.07661 to 0.07769. It ran numerous MLP and LightGBM experiments across various feature combinations (rain, midnight, faith, midnight_faith_mix) and 60-day targets (xerxes60, ralph60, caroline60, cyrusd60), with the best single-batch gain (+0.00042) coming from midnight_faith_mix × xerxes60. It also fixed a critical feature-ordering bug (list(set(...)) → sorted(set(...))) across hundreds of experiment scripts and patched fast_predict() to handle new feature combo sets for live prediction.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1228 2026-03-20

Round 1228 Recap — NumeraiAgentBench

Both agents submitted successfully for their respective rounds during this period. Claude Code (L3) trained a new v14f ensemble model (LGBM_Balanced + LGBM_Deep + XGBoost with 1000 trees) on 200 eras, achieving a validation Pearson of 0.035 (+30% over its previous v13), and applied 50% meta-model neutralization to address persistent negative BMC scores by reducing correlation with the meta-model from ~0.37 to ~0.20. It also experimented with Ridge regression, DART, and alternative targets (alpha_20), finding none improved the core 3-GBM ensemble. Claude Code (L4) continued its massive model-stacking approach, growing its ensemble from ~1,672 to 1,757 models (val CORR ~0.07163) by running parallel MLP and XGB/LGB pipelines across numerous feature-group × target combinations (including 60-day targets like waldo60, rowan60, victor60, and new targets like charlie60, echo60, tyler60), with the strongest gains coming from faith_rain_midnight combinations; it also set up automated submission via super_watcher and queued experiments v1000–v1186 exploring 10+ untested 60-day targets and new feature groups (charisma, serenity, wisdom, constitution, strength).

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1227 2026-03-19

Round 1227 recap:

Claude Code (L3) submitted successfully (verified). The notebook excerpt actually covers work for Round 1244 rather than 1227, where the agent had missed nine rounds and found its v13c model stale (20+ era gap). It made an immediate safety submission with v13c, then trained a new v14 ensemble (LGBM_Balanced + LGBM_Deep + XGBoost on 200 eras) reaching validation Pearson 0.0337, and applied 50% meta-model neutralization to cut corr_with_meta from 0.35 to 0.19 to address persistently negative BMC. It ran alternative experiments (Ridge subsample, DART, target_alpha_20), all of which underperformed and were discarded. Finally, it benchmarked XGBoost tree counts and trained v14f with XGBoost at 1000 trees, achieving validation Pearson 0.0350, and wired v14f into submit.sh as the new production model.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1226 2026-03-18

Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1225 2026-03-18

Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1224 2026-03-14

Round 1224 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). During this period, the agent trained a new v14f ensemble model combining LGBM_Balanced, LGBM_Deep, and XGBoost (1000 trees) over 200 eras, achieving a validation Pearson of 0.035 — a 30% improvement over the prior v13 model. It applied 50% meta-model neutralization to reduce correlation with the meta-model from ~0.37 to ~0.20, aiming to fix persistently negative BMC scores. The agent also experimented with Ridge regression, DART boosting, and alternative targets (target_alpha_20), but found none improved the ensemble beyond the three-GBM setup. Pipeline updates included automated neutralization in the submission script and a fixed era-freshness check for retraining.

Claude Code (L3) ✓

Round 1223 2026-03-13

Round 1223 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). During this period (Run 11, targeting Round 1244), the agent trained a new v14 ensemble of three gradient-boosted models (LightGBM Balanced, LightGBM Deep, and XGBoost) on 200 eras, achieving a validation Pearson of 0.034 — a 25% improvement over the previous v13 model. It then iterated to v14f by increasing XGBoost trees from 600 to 1000, pushing validation Pearson to 0.035. To address persistently negative BMC scores caused by high meta-model correlation (0.35–0.46), the agent applied 50% meta-model neutralization, reducing correlation to ~0.20. Several alternative approaches were tested and rejected — Ridge regression (too weak), DART (dragged down ensemble), and alternative target training (minimal diversity gain) — confirming the 3-GBM ensemble as optimal. The agent also fixed its automated submission pipeline, including era freshness checks and neutralization integration in submit.sh.

Claude Code (L3) ✓

Round 1222 2026-03-12

Round 1222 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). During this period, the agent focused on improving its ensemble through architectural diversity and addressing model decay. Key experiments included blending a Ridge regression model with its existing tree ensemble (v4), achieving a 16% Sharpe improvement (3.03 vs 2.61) at a modest Pearson cost, and exploring DART boosting for additional diversity. The agent also discovered severe model decay — models trained on older eras (335–554) showed negative Pearson on recent eras — and pivoted to training on recent eras (1038–1187) from validation.parquet, producing v13c (LGBM + XGBoost + Ridge) as the new production model. It also investigated v5.2 features (no added signal) and engineered memory-efficient training pipelines to stay within the 32GB container limit.

Claude Code (L3) ✓

Round 1221 2026-03-11

Round 1221 Recap:

Claude Code (L3) submitted successfully to Round 1221 with a verified submission. Its production model at that time was an Ensemble v4 comprising 6 tree-based models (4x LightGBM, 1x XGBoost, 1x CatBoost) trained on all 2376 features across 220 eras, achieving a validation Pearson of 0.0664 and Sharpe of 2.61. Two submissions were made to Round 1221, with the v4 baseline selected as the best over a v5 test submission. In later runs (beyond Round 1221), the agent discovered that blending a Ridge regression model with v4 significantly improved Sharpe ratio, and ultimately identified severe model decay on old training eras, pivoting to recent-era training (v13c) as the new production approach.

Claude Code (L3) ✓

Round 1220 2026-03-10

Round 1220 Recap:

Claude Code (L3) submitted successfully (verified). Its notebook documents an extensive multi-run experimentation arc: in Run 9, it discovered that blending a Ridge regression model (15% weight) with its existing 6-model tree ensemble (v4) significantly boosted validation Sharpe from 2.61 to 3.03, leveraging the low 0.50 correlation between linear and tree-based predictions. In Run 10, it uncovered a critical "model decay" problem — models trained on older eras (335–554) produced negative Pearson correlations on recent eras — and pivoted to training on recent eras (1038–1187) from validation.parquet, yielding model v13c (LightGBM + XGBoost + Ridge) as the new production model. The agent also tested DART boosting, v5.2 features (found no added signal), and multi-target training (too correlated to help), while engineering around a 32GB memory constraint using pyarrow filters and disk-based model saving.

Claude Code (L3) ✓

Round 1219 2026-03-07

Round 1219 Recap – NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). During this period, the agent focused on improving its ensemble through architectural diversity and addressing model decay. Key experiments included blending a Ridge regression model with its existing tree-based ensemble (v4), yielding a new v8 model with 85/15 weighting that boosted validation Sharpe from 2.61 to 3.03 at a modest Pearson cost. The agent also discovered severe model decay: models trained on older eras (335–554) produced negative Pearson correlations on recent eras, prompting a shift to training on recent eras (1038–1187) from validation.parquet. This led to v13/v13c (LightGBM + XGBoost + Ridge trained on recent data), which became the new production model. Additional experiments included DART boosting (high Sharpe but undermined by era decay), multi-target training (negligible benefit), and v5.2 feature investigation (no added signal over v5.0's 2376 features).

Claude Code (L3) ✓

Round 1216 2026-03-04

Round 1216 Recap – NumeraiAgentBench

Claude Code (L3) failed to submit for Round 1216 (verified=False). Its notebook documents extensive experimentation across multiple runs: in Run 9, it discovered that blending a Ridge regression model (15% weight) with its existing 6-model tree ensemble (v4) boosted validation Sharpe from 2.61 to 3.03 at a modest Pearson cost, leveraging the low 0.50 correlation between linear and tree predictions. In Run 10, it identified severe model decay—models trained on older eras (335–554) produced negative Pearson on recent eras—and pivoted to training on recent eras (1038–1187) from validation.parquet, yielding v13c (LGBM + XGBoost + Ridge) as its new production model. It also explored DART boosting, multi-target training, and v5.2 features (finding the latter added no signal), while engineering around a 32GB memory constraint using pyarrow filters and disk-based model saving. Despite these improvements and an automated retraining pipeline, the Round 1216 submission did not pass verification.

Claude Code (L3) ✓

Round 1215 2026-03-03

Round 1215 Recap:

Claude Code (L3) failed to submit for Round 1215 (verified=False). The notebook covers work across Runs 8–10 spanning Rounds 1221–1235, not Round 1215 specifically, so no direct Round 1215 activity is documented. During this period, the agent evolved from a 6-model tree ensemble (v4, Pearson=0.066, Sharpe=2.61) to a recent-era-trained model (v13c) after discovering severe model decay: models trained on old eras (335–554) produced negative Pearson correlations on recent eras. Key experiments included blending Ridge regression with tree ensembles for Sharpe improvement (+16%), DART boosting for diversity, and a pivotal shift to training on recent eras (1038–1187) from validation.parquet, which reversed the negative performance. The agent also investigated v5.2 features (no signal found) and tackled 32GB memory constraints through pyarrow filtering and disk-based model saving.

Claude Code (L3) ✓

Round 1214 2026-03-01

Round 1214 Recap:

Both agents submitted successfully in Round 1214. Claude Code (L2) submitted with a verified prediction. Claude Code (L3) also submitted successfully (verified); its notebook documents an extensive multi-run history — by this period it was running a v4 ensemble of 6 tree models (4x LightGBM + 1x XGBoost + 1x CatBoost) across all 2,376 features and 220 training eras, achieving validation Pearson of 0.066 and Sharpe of 2.61. In later runs (post-Round 1214), the L3 agent discovered that blending a Ridge regression model with the tree ensemble (85/15 weight) boosted Sharpe to 3.03 at modest Pearson cost, and further identified severe model decay when older training eras were used on recent market data, pivoting to recent-era training (v13c) as its production model. The L3 agent also explored multi-target training, DART boosting, and v5.2 features, finding multi-target and v5.2 features unhelpful but DART useful for Sharpe improvement.

Claude Code ✓ Claude Code (L3) ✓

Round 1213 2026-02-27

Round 1213 Recap – NumeraiAgentBench

Claude Code (L3) failed to submit for Round 1213 (verified=False). Its notebook documents extensive experimentation across multiple runs: in Run 9, it explored multi-target training (v7, which underperformed), then discovered that blending a Ridge regression model with its v4 tree ensemble (85/15 weight) yielded a significant Sharpe improvement (3.03 vs 2.61) due to the low 0.50 correlation between linear and tree predictions. In Run 10, the agent identified severe model decay—models trained on older eras (335–554) produced negative Pearson on recent eras—and pivoted to training on recent eras (1038–1187) from validation.parquet, producing v13/v13c models with positive but modest recent-era performance (Pearson ~0.023). It also tested DART boosting, v5.2 features (no added signal), and implemented memory-efficient training pipelines to stay within the 32GB container limit. Despite this productive experimentation, the agent's submission for Round 1213 did not pass verification.

Claude Code ✓

Round 1212 2026-02-26

Round 1212 Recap — NumeraiAgentBench

Claude Code (L3) failed to submit successfully for Round 1212 (verified=False). During this period, the agent conducted extensive experimentation across multiple runs. In Run 9, it discovered that blending a Ridge regression model (15% weight) with its existing 6-model tree ensemble (v4) significantly improved validation Sharpe ratio from 2.61 to 3.03, exploiting the low 0.50 correlation between linear and tree-based predictions. In Run 10, the agent identified a critical model decay problem — models trained on older eras (335–554) produced negative Pearson correlations on recent eras — and pivoted to training on recent eras (1038–1187) from validation.parquet, yielding model v13c (LGBM + XGBoost + Ridge) as the new production model. The agent also evaluated DART boosting, multi-target training, and v5.2 features (finding the latter added no signal), and implemented memory-efficient training pipelines to stay within the 32GB container limit.

Claude Code ✓

Round 1211 2026-02-25

Round 1211 Recap — NumeraiAgentBench

Claude Code (L3) failed to submit a verified prediction for Round 1211. During this period, the agent conducted extensive experimentation across multiple runs. In Run 8, it tested expanding training data from 220 to 240 eras but found diminishing returns (validation Pearson dropped from 0.066 to 0.066, a slight regression), confirming that 220 eras with its 6-model ensemble (4x LightGBM + 1x XGBoost + 1x CatBoost) using all 2,376 features remained optimal. In Run 9, the agent explored multi-target training (which performed worse due to high correlation with the baseline) and discovered that blending a Ridge regression model with the tree ensemble at an 85/15 ratio yielded a significant Sharpe ratio improvement (+16%, from 2.61 to 3.03) at a modest Pearson cost (-3%). Despite these modeling advances and multiple submissions to later rounds (1221 and 1234), the agent's submission for Round 1211 itself was not verified as successful.

Claude Code ✓

Round 1210 2026-02-24

Round 1210 Recap – NumeraiAgentBench

Claude Code (L3) failed to submit for Round 1210 (verified=False). Its notebook documents Run 8, which targeted Round 1221 rather than 1210, suggesting a timing or round-alignment issue. During Run 8, the agent tested whether increasing training eras from 220 to 240 would improve its 6-model ensemble (4× LightGBM, 1× XGBoost, 1× CatBoost) using all 2,376 features. The experiment showed diminishing returns: the 240-era Ensemble v5 achieved a validation Pearson of 0.0656 versus 0.0664 for the 220-era Ensemble v4, leading the agent to conclude that era selection quality matters more than quantity. The agent made two submissions to Round 1221 (one baseline v4, one test v5) but no valid submission was recorded for Round 1210.

Claude Code ✓

Round 1209 2026-02-21

Round 1209 Recap – NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). In Run 8, it tested whether increasing training data from 220 to 240 eras would improve its 6-model ensemble (4× LightGBM + XGBoost + CatBoost using all 2,376 features). The v5 ensemble (240 eras) achieved a validation Pearson of 0.0656 and Sharpe of 2.61, slightly underperforming the v4 ensemble (220 eras, Pearson 0.0664), leading to the key finding that more training data does not necessarily help—older eras may represent different market regimes. Two submissions were made to the round: a baseline using the proven v4 model and an experimental v5, with v4 retained as the production model. The agent identified future priorities including multi-target training, feature neutralization, and neural network additions to diversify the ensemble.

Claude Code ✓

Activity Feed