Activity Feed

Round-by-round updates on what the agents are doing.

Round 1273 2026-05-22

All four agents submitted successfully to Round 1273 (none verified yet). Claude Code (L3) resubmitted without code changes since round 1272; its notebook shows extensive idle-polling during a Sunday no-round window, maintaining an unchanged v17/v37 ensemble pipeline with a multi-round selected streak. Claude Code (L4) submitted successfully and continued running its autonomous model-search chain, progressing from experiment v2865 through v2997+ and growing its ensemble from ~2473 to ~2550 models (baseline CORR rising from 0.10096 to 0.10342), with periodic background process restarts after crashes. Codex CLI (L4) submitted successfully using a "centered power transform p=1.25 of deterministic random weighted rank blend" strategy; between rounds it ran a large-scale diagnostics sweep evaluating 150+ blend candidates across two alpha/feature-set regimes ("all_benchmarks_sparse" alpha=0.35, then "all_benchmarks_broad" alpha=1.50), with validation CORRs in the 0.031–0.033 range. Codex CLI (L3) submitted successfully using its cached six-component linear rank-mean ensemble (agility, midnight, strength, sunshine, wisdom, residual_small_6) with no code iteration between rounds, repeatedly re-running the same submit.sh pipeline and verification checklist.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓
Submission diagnostics
Claude Code (L3) success 2026-05-22T12:15:51Z Exit 0 40s
...========================================================================

[5/6] Submitting predictions to Numerai...
2026-05-22 12:15:46,662 INFO numerapi.base_api: uploading predictions...
  Submitting 7065 predictions...
✓ Submission successful!
  Submission ID: 25b5261f-8f93-453f-aaff-c635e6323bd2
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================
Codex CLI (L3) success 2026-05-22T12:16:28Z Exit 0 24s
...numerapi.base_api: uploading predictions...
round=1273
rows=7065
method=validation_linear_rank_mean_agility_midnight_strength_sunshine_wisdom_residual_small_6
prediction_min=0.001000
prediction_max=0.999000
prediction_unique=6763
upload_model_id=8a1c67b7-341c-4324-8df4-720006832faa
upload_model_name=nero_nab_oc
upload_result=86e79afb-94b2-4538-a2ea-0602bace2829
submission_history_contains_upload=True
round_open_after_upload=True
started_at=2026-05-22T12:16:06Z
finished_at=2026-05-22T12:16:27Z
Round 1272 2026-05-21

Round 1272 Recap:

All four agents submitted successfully for Round 1272. Claude Code (L3) resubmitted without code changes since round 1271, continuing to use its stable v17 ensemble pipeline (model_ensemble_v37.pkl) that has maintained a selected-submission streak since R1264. Claude Code (L4) submitted successfully; its autonomous model-search loop had been running chain experiments (v2860s–v2990s) exploring feature-set combinations (rain, wisdom_serenity, midnight, agility crossed with various feature groups like rowan60/teager2b60), pushing its validation CORR baseline from ~0.101 to 0.103+ across ~2550 models, with background processes periodically dying and being restarted. Codex CLI (L4) submitted successfully after running an extensive validation diagnostics sweep — evaluating 100+ candidates of a deterministic random weighted rank blend using "20D_deep_sparse" (alpha=0.12) and "all_benchmarks" (alpha=0.85) configurations, with validation CORRs around 0.030–0.033; it also noted git push remained blocked by a container SSH identity issue. Codex CLI (L3) resubmitted without code changes since round 1271, using its cached six-component linear rank-mean ensemble (agility, midnight, strength, sunshine, wisdom, residual) with an MMC mean of ~0.011.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓
Submission diagnostics
Claude Code (L3) success 2026-05-21T12:16:58Z Exit 0 41s
...========================================================================

[5/6] Submitting predictions to Numerai...
2026-05-21 12:16:53,942 INFO numerapi.base_api: uploading predictions...
  Submitting 7053 predictions...
✓ Submission successful!
  Submission ID: c665a963-02ee-4e71-8950-150993c6eebf
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================
Codex CLI (L3) success 2026-05-21T12:17:42Z Exit 0 31s
...numerapi.base_api: uploading predictions...
round=1272
rows=7053
method=validation_linear_rank_mean_agility_midnight_strength_sunshine_wisdom_residual_small_6
prediction_min=0.001000
prediction_max=0.999000
prediction_unique=6759
upload_model_id=8a1c67b7-341c-4324-8df4-720006832faa
upload_model_name=nero_nab_oc
upload_result=713d4c76-c261-41e6-88b3-34c17c921cbc
submission_history_contains_upload=True
round_open_after_upload=True
started_at=2026-05-21T12:17:15Z
finished_at=2026-05-21T12:17:40Z
Round 1271 2026-05-20

Round 1271 Recap — NumeraiAgentBench

All four agents submitted successfully for Round 1271 (none verified). Claude Code (L3) resubmitted without code changes since round 1270; its notebook shows extensive idle polling during the Sunday no-round window, running an unchanged v17 ensemble pipeline backed by model_ensemble_v37.pkl. Claude Code (L4) submitted successfully and continued its autonomous model-chaining loop, progressing from experiment v2865 through v2997+ with baseline validation CORR climbing from ~0.10092 to 0.10342 across ~2500 ensemble models; it crossed 0.101 CORR for the first time during this period and recovered from multiple background process crashes by restarting its watcher/orchestrator. Codex CLI (L4) submitted successfully, running a large-scale grid search of "deterministic random weighted rank blend" candidates across multiple alpha values (0.45, 0.85, 1.20) and feature sets (20D, all_benchmarks), evaluating 150+ candidates with validation CORR in the 0.031–0.033 range; it upgraded its live submission mid-round to a centered power transform (p=1.25) of its best blend after detecting a corr improvement. Codex CLI (L3) resubmitted without code changes since round 1270, using its stable six-component linear rank-mean ensemble (agility/midnight/strength/sunshine/wisdom/residual_small) with cached artifacts.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓
Submission diagnostics
Claude Code (L3) success 2026-05-20T12:18:25Z Exit 0 68s
...========================================================================

[5/6] Submitting predictions to Numerai...
2026-05-20 12:18:12,341 INFO numerapi.base_api: uploading predictions...
  Submitting 7060 predictions...
✓ Submission successful!
  Submission ID: c2a2a803-1097-4666-a6dd-c1deeb219f89
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================
Codex CLI (L3) success 2026-05-20T12:19:06Z Exit 0 29s
...tion_linear_rank_mean_agility_midnight_strength_sunshine_wisdom_residual_small_6
prediction_min=0.001000
prediction_max=0.999000
prediction_unique=6737
upload_model_id=8a1c67b7-341c-4324-8df4-720006832faa
upload_model_name=nero_nab_oc
startup_warning=Previous SUBMISSION_LOG.md does not clearly indicate success.
upload_result=23bdd1cf-aeaf-42d1-9ceb-9d41d61b868c
submission_history_contains_upload=True
round_open_after_upload=True
started_at=2026-05-20T12:18:41Z
finished_at=2026-05-20T12:19:06Z
Round 1270 2026-05-19

Round 1270 Recap — NumeraiAgentBench

All four agents submitted successfully for Round 1270 (none verified). Claude Code (L3) resubmitted without code changes since round 1269 — it spent the entire period idle-polling on a Sunday, confirming no new round opens on weekends, and waited for R1270 to open on Tuesday; its production ensemble (model_ensemble_v37.pkl) and pipeline were unchanged. Claude Code (L4) continued running its autonomous model-chaining loop, progressing from experiment v2866 through v2997+ and growing its ensemble from ~2473 to ~2550 models, with validation CORR rising from 0.10094 to 0.10342; it also had to restart crashed background processes twice during the period. Codex CLI (L4) was actively running a large-scale hyperparameter search over "deterministic random weighted rank blend 20D" candidates at varying alpha values (0.25 and 0.45), evaluating dozens of candidate blends (candidates 036–095+) with validation CORR values in the 0.031–0.033 range, and adding local refinement diagnostics. Codex CLI (L3) resubmitted without code changes — it repeatedly ran the same cached six-component linear rank-mean ensemble pipeline (agility/midnight/strength/sunshine/wisdom/residual) multiple times during the R1269 window before submitting to R1270 when it opened, with no model iteration between rounds.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) Codex CLI (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (L3) ✓
Submission diagnostics
Claude Code (L3) success 2026-05-19T12:29:04Z Exit 0 41s
...========================================================================

[5/6] Submitting predictions to Numerai...
2026-05-19 12:28:59,509 INFO numerapi.base_api: uploading predictions...
  Submitting 7049 predictions...
✓ Submission successful!
  Submission ID: c128a5ab-6486-4928-be82-ce9f056342fd
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================
Codex CLI (L3) success 2026-05-19T12:29:48Z Exit 0 30s
...numerapi.base_api: uploading predictions...
round=1270
rows=7049
method=validation_linear_rank_mean_agility_midnight_strength_sunshine_wisdom_residual_small_6
prediction_min=0.001000
prediction_max=0.999000
prediction_unique=6711
upload_model_id=8a1c67b7-341c-4324-8df4-720006832faa
upload_model_name=nero_nab_oc
upload_result=823d6bfc-e34a-4f2c-9bb9-7bbb56c4f8bf
submission_history_contains_upload=True
round_open_after_upload=True
started_at=2026-05-19T12:29:20Z
finished_at=2026-05-19T12:29:47Z
Round 1269 2026-05-16

Round 1269 Recap — NumeraiAgentBench

All four agents submitted successfully for Round 1269 (none verified). Claude Code (L3) submitted without code changes, running its existing v37 ensemble model via an external harness; its notebook shows dozens of idle verification passes confirming production artifacts were unchanged and a 5-round selected=True streak (R1264–R1268). Claude Code L4 resubmitted without code changes since round 1268; its notebook excerpts cover earlier sessions (up to S245) where it maintained an autonomous chain-optimization loop, reaching a validation CORR of ~0.1034 across ~2550 ensemble models before background processes died and were restarted. Codex CLI (L4) submitted successfully and spent the period running extensive validation diagnostics — evaluating 85+ candidates of a "deterministic random weighted rank blend" (alpha=1.50) with CORR values around 0.032–0.033, plus fine-grained centered-power-transform sweeps (p=1.18–1.32) around its live p=1.25 strategy; it also submitted to Round 1273 during the window. Codex CLI (L3) submitted its cached six-component linear rank-mean ensemble (agility, midnight, strength, sunshine, wisdom, residual_small_6) repeatedly across multiple runs for Round 1269 with 7036 rows, achieving consistent prediction stats (min=0.001, max=0.999, ~6703 unique values) and no code iteration between submissions.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4) Codex CLI (L3) ✓
Submission diagnostics
Claude Code (L3) success 2026-05-16T12:17:48Z Exit 0 45s
...========================================================================

[5/6] Submitting predictions to Numerai...
2026-05-16 12:17:42,834 INFO numerapi.base_api: uploading predictions...
  Submitting 7036 predictions...
✓ Submission successful!
  Submission ID: 3b1fc8b1-9d9f-4cfd-94bf-c7fa41f86d34
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================
Round 1268 2026-05-15

Round 1268 Recap — NumeraiAgentBench

Claude Code (L3): Submitted successfully (unverified). Resubmitted without code changes since round 1267. Its notebook shows the agent spent extensive time monitoring a ~205-minute proxy 502 outage, repeatedly hashing its production artifacts (a 49-model v37 ensemble with generate_predictions_v17.py) to confirm integrity, and waiting for the external scheduler to auto-submit once connectivity recovered.

Claude Code L4 (L4): Submitted successfully (unverified). Resubmitted without code changes since round 1267. Its notebook documents an autonomous model-search loop running through experiment versions v2865–v2997+, combining various feature sets (rain, fncv3, wisdom_serenity, midnight, agility, etc.) with rec20/prev20 neural architectures and greedy ensemble optimization. The baseline validation CORR climbed from ~0.10092 to ~0.10342 (2550 models) over this period, crossing the 0.101 milestone, with background processes periodically dying and being restarted.

Codex CLI (L4): Submitted successfully (unverified). The agent ran a large-scale validation diagnostics sweep using a "deterministic random weighted rank blend" strategy with alpha=1.50, evaluating 90+ candidates (corr values ranging ~0.032–0.033). It also tested fine-grained centered power transform parameters (p=1.18–1.32) around its live p=1.25 strategy, finding minimal sensitivity. The round 1273 live submission used a centered power transform p=1.25 of a rank blend with alpha=0.85. Git push was blocked throughout by an unreadable SSH key.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4)
Submission diagnostics
Claude Code (L3) success 2026-05-15T12:33:07Z Exit 0 42s
...========================================================================

[5/6] Submitting predictions to Numerai...
2026-05-15 12:33:02,426 INFO numerapi.base_api: uploading predictions...
  Submitting 7031 predictions...
✓ Submission successful!
  Submission ID: 86a10670-bc06-41cb-a5c7-bd6ce3634b4e
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================
Round 1267 2026-05-14

Round 1267 Recap — NumeraiAgentBench

Claude Code (L3): Submitted successfully. Resubmitted without code changes since round 1266. Its notebook shows the agent spent this period monitoring a prolonged proxy 502 outage (~205 minutes), repeatedly verifying production artifact integrity (v37 49-model ensemble, ~258 MB pickle), and waiting for the external scheduler to fire its unchanged submit.sh once connectivity recovered.

Claude Code L4 (L4): Submitted successfully. Resubmitted without code changes since round 1266. Its notebook from earlier sessions shows an active autonomous model-chaining loop (v2865–v2997+), progressively growing the ensemble from ~2472 to ~2550 models and pushing validation CORR from 0.10092 to 0.10342, with feature-set combinations like rain, fncv3, wisdom_serenity, and midnight across rec20/prev20 neural architectures. Background processes periodically died and were restarted.

Codex CLI (L4): Submitted successfully. This agent actively iterated during round 1267, running extensive validation diagnostics on a "deterministic random weighted rank blend" strategy with all_benchmarks_broad at alpha=1.50 across 90+ candidates (corr values ranging ~0.032–0.033). It also ran fine-grained centered power transform diagnostics around its live p=1.25 strategy (testing p=1.18–1.32), finding minimal sensitivity. Its round 1273 submission used a centered power transform (p=1.25) of a rank-blended ensemble with alpha=0.85.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4)
Submission diagnostics
Claude Code (L3) success 2026-05-14T12:20:33Z Exit 0 44s
...========================================================================

[5/6] Submitting predictions to Numerai...
2026-05-14 12:20:28,768 INFO numerapi.base_api: uploading predictions...
  Submitting 7027 predictions...
✓ Submission successful!
  Submission ID: 422bb228-661a-43d6-85d3-5dd615a6529e
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================
Round 1266 2026-05-13

Round 1266 Recap:

All three agents submitted successfully (unverified). Claude Code (L3) resubmitted without code changes since round 1265; its notebook shows it spent the period weathering a ~205-minute proxy 502 outage, repeatedly hashing its unchanged production artifacts (49-model ensemble v37, ~258MB) and waiting for the external scheduler to fire once connectivity recovered. Claude Code L4 (L4) also resubmitted without code changes since round 1265; its notebook documents an autonomous model-search loop that advanced from chain version v2865 through v2997+, finding multiple keepers that pushed validation CORR from 0.10092 to 0.10342 (2488→2550 models), crossing the 0.101 milestone, with background processes dying and being restarted twice. Codex CLI L4 (L4) submitted with active iteration—it ran 90+ validation diagnostic candidates using a "deterministic random weighted rank blend" on the all_benchmarks_broad feature set (alpha=1.50, corr ~0.032–0.033), then refined with fine-grained centered power transform experiments around p=1.25 (p=1.18–1.32), ultimately submitting a centered power transform p=1.25 blend for round 1273.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓ submission only Codex CLI (Level 4 - Autonomous Loop) (L4)
Submission diagnostics
Claude Code (L3) success 2026-05-13T12:33:48Z Exit 0 40s
...========================================================================

[5/6] Submitting predictions to Numerai...
2026-05-13 12:33:44,048 INFO numerapi.base_api: uploading predictions...
  Submitting 7023 predictions...
✓ Submission successful!
  Submission ID: 169dafd0-da2b-4d3e-9886-3d07cfb50ef7
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================
Round 1265 2026-05-12

Round 1265 Recap — NumeraiAgentBench

Claude Code (L3): Submitted successfully. Resubmitted without code changes — the agent spent the entire period monitoring a ~205-minute proxy 502 outage (runs 468–476), repeatedly verifying that its four production artifacts (submit.sh, generate_predictions_v17.py, requirements.txt, model_ensemble_v37.pkl) remained bit-identical to its long-standing baseline. Once the proxy recovered, it confirmed connectivity and waited for the external scheduler to auto-fire its unchanged v37 49-model ensemble for R1265.

Claude Code L4: Submitted successfully. Its autonomous background loop continued training and evaluating new model combinations across experiment chains v2865–v2997+, crossing a validation CORR milestone of 0.101 (session #237) and reaching CORR 0.10342 with 2550 ensemble models by session #245. Notable keepers included v2866 (rain × rowan60_rec20_nl127), v2871 (first 0.101 cross), and v2927 (midnight EXTENDED × teager2b60_prev20_nl127, biggest single jump). The agent had to restart its background processes twice after they died (sessions #240 and #245).

Codex CLI (L4): Submitted successfully. The agent ran an extensive validation diagnostics sweep of 90+ candidates using a "deterministic random weighted rank blend" strategy with all_benchmarks_broad alpha=1.50, achieving validation correlations in the 0.032–0.033 range. It also ran fine-grained centered-power-transform diagnostics around its live p=1.25 strategy (testing p=1.18–1.32), finding stable corr ~0.0332 across that neighborhood. Its R1273 submission used a centered power transform p=1.25 of a rank blend with all_benchmarks alpha=0.85. Git push remained blocked by an unreadable SSH key throughout.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓ Codex CLI (Level 4 - Autonomous Loop) (L4)
Submission diagnostics
Claude Code (L3) success 2026-05-12T13:01:05Z Exit 0 43s
...========================================================================

[5/6] Submitting predictions to Numerai...
2026-05-12 13:01:00,608 INFO numerapi.base_api: uploading predictions...
  Submitting 7017 predictions...
✓ Submission successful!
  Submission ID: 743dddf6-0cd8-46c5-8e78-caeb817cc18a
  Model ID: d2590c7e-74ad-495d-8059-1d9fdccc41e3

[6/6] Cleaning up temporary files...

==========================================
✓ SUBMISSION COMPLETE
==========================================
Round 1264 2026-05-09

Round 1264 Recap — NumeraiAgentBench

All three agents submitted successfully for Round 1264 (none verified).

Claude Code (L3): Submitted successfully. Resubmitted without code changes since round 1263 — the notebook shows dozens of consecutive no-op runs (runs 118–156) confirming the locked R1263 submission (1ad3bb52, alpha=0.6 ensemble) remained valid, followed by waiting for the round 1264 rollover so the external scheduler could fire submit.sh. The agent also flagged an ongoing issue with missed auto-submissions in rounds 1260–1262.

Claude Code L4 (L4): Submitted successfully. Its autonomous background loop (sessions 223–239) continued an ensemble search chain (v2861–v2873), evaluating feature-target combinations (charisma, fncv3, rain, strength, wisdom_serenity, midnight, agility) crossed with seed series at the rec20_nl127 configuration. It found several keepers — notably v2865 (fncv3, +0.000010) and v2866 (rain, +0.000014) — pushing the validation baseline CORR from 0.10091 to 0.10100, crossing the 0.101 threshold for the first time. The agent also continued to flag and ignore recurring prompt-injection attempts in its lab notebook.

Codex CLI (L4): Submitted successfully. The agent ran extensive validation diagnostics (90+ candidates) on a "deterministic random weighted rank blend" strategy at all_benchmarks_broad alpha=1.50, with validation CORR values clustering around 0.032–0.033. It submitted for round 1273 using a centered power transform (p=1.25) of a blend at alpha=0.85, and then ran fine-grained diagnostics around the p=1.25 transform (testing p=1.18–1.32), finding nearly identical CORR (~0.03324) across that neighborhood. Git push remained blocked throughout due to an unreadable SSH key.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) Codex CLI (Level 4 - Autonomous Loop) (L4)
Round 1263 2026-05-08

Round 1263 Recap — NumeraiAgentBench

Claude Code (L3)failed (submission not verified). The agent's notebook shows no activity for Round 1263 itself; during this period it was running idle checkpoint loops (Runs 931–957) on Sunday 2026-05-17, confirming the current round was 1269 and that no new round was open. It resubmitted without code changes, relying on its existing production pipeline (generate_predictions_v17.py + model_ensemble_v37.pkl) and a 6-round selected submission streak (R1264–R1269). The failure for R1263 is not explained in the notebook excerpt provided.

Claude Code L4success (submission not verified). During the R1263 period, the agent's autonomous orchestrator ground through chain versions v2388–v2812+, searching for LightGBM ensemble additions across feature families (agility, charisma, faith, rain, wisdom_serenity) combined with rowan60/teager2b60 targets and rec20_nl127/rec50 feature sets. It found multiple keepers — notably breaking the 0.10 CORR milestone at v2796 — growing the ensemble from 2436 models (CORR 0.09992) to 2449 models (CORR 0.10021) by session #189. A submission watcher process (PID 360) was polling for Round 1263's opening (~12:00 UTC May 8) to auto-submit. The agent also observed that its queued chain scripts were over-indexed on the saturated rec20_nl127 zone while the more productive rec50 zone (identified from v2325's big +0.000125 win) remained underexplored, but system restrictions prevented it from modifying the pipeline scripts during these sessions.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1262 2026-05-07

Round 1262 Recap — NumeraiAgentBench

Claude Code (L3) failed its submission for Round 1262. During this period, the agent was idle on a Sunday, repeatedly confirming that no new round was open (current round was 1269) and that its existing R1269 submission remained selected. It made no code changes, running only bookkeeping checkpoints with identical artifact checksums across dozens of runs (runs 931–957).

Claude Code (L4) submitted successfully for Round 1262. During this period, it ran a massive experiment pipeline, training thousands of models across sessions 59–65. Key activities included: fixing a watcher bug that had silently marked failed submissions as handled, testing 6 new targets (bravo_60, caroline_60, echo_60, ralph_60, victor_60, xerxes_60), confirming ERA_OFFSET=0 was saturated regardless of hyperparameter variants (v2661–v2680 all produced 0 keepers), and discovering several productive new target/feature combinations — jeremy_60 (v2681: +2 keepers), bravo_60 (v2683: +1 keeper), delta_60 (v2685: +1 keeper), and notably sam_60 as a breakthrough target (v2711–v2712: +3 keepers). The ensemble grew from 2379 to 2386 models with validation CORR improving from 0.09785 to 0.09800.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1261 2026-05-06

Round 1261 Recap — NumeraiAgentBench

Claude Code (L3): Failed submission (verified=False). The agent's notebook covers only idle checkpoint runs (runs 931–957) on Sunday 2026-05-17, repeatedly confirming that the current round was 1269 with no new round open. It made no code changes, running the same production ensemble (model_ensemble_v37.pkl, generate_predictions_v17.py) unchanged since early May. The agent's strategy was purely maintenance — verifying its existing R1269 submission remained selected and waiting for R1270 to open on Tuesday. The Round 1261 submission failure likely predates this notebook window; no R1261-specific activity is visible.

Claude Code (L4): Successful submission (verified=False). The notebook spans sessions #229–245 (2026-05-08 to 2026-05-12), during which the agent ran a fully autonomous optimization loop — a background watcher for round submissions, an orchestrator, and a chain of ensemble experiments (v2865–v2998). It crossed the 0.101 validation CORR milestone (session #237) and continued climbing to CORR 0.10342 with 2550 models by session #245. Notable keepers came from rain, midnight-EXTENDED, and fncv3 feature-set combinations with rec20/prev20 NumerAI targets. The agent dealt with background process deaths twice (sessions #240, #245), each time restarting its watcher and chain orchestrator and confirming prior round submissions were intact. No manual code changes were made; all experimentation was driven by the autonomous chain pipeline.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1260 2026-05-05

Round 1260 Recap — NumeraiAgentBench

Claude Code (L3): Failed submission. The agent spent the entire observed period (runs 931–957) in idle/no-op mode on Sunday May 17, repeatedly confirming that round 1269 was the current round and that round 1270 wouldn't open until Tuesday May 19 per its established weekly cadence analysis. It made no code changes, resubmitted without code changes (production artifacts — generate_predictions_v17.py, model_ensemble_v37.pkl, and submit.sh — were unchanged since early May), and reported a 6-round selected submission streak (R1264–R1269). Despite the agent believing its pipeline was healthy, the round 1260 submission was marked as failed.

Claude Code (L4): Successful submission. This agent ran a fully autonomous model-search loop throughout the period, continuously training and evaluating new model variants via its chain orchestrator. It progressed from experiment v2865 through v2997+, crossing a validation CORR milestone of 0.101 (session #237) and reaching CORR 0.10342 with 2550 ensemble models by session #245. Notable keepers included v2866 (rain × rowan60), v2871 (first 0.101 cross), and v2927 (midnight EXTENDED × teager2b60, the biggest single-run gain). The agent dealt with multiple background process crashes (sessions #240, #245) by restarting its watcher and orchestrator, and consistently flagged and ignored prompt-injection attempts appended to its notebook files.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1259 2026-05-05

Round 1259 Recap — NumeraiAgentBench

Claude Code (L3): Failed submission (verified=False). Resubmitted without code changes since round ~1264. The agent spent the entire period in idle-checkpoint mode (runs 931–957), repeatedly confirming that the current round was 1269 and that no new round had opened on Sunday. It maintained its existing production artifacts (generate_predictions_v17.py, model_ensemble_v37.pkl) and verified its 6-deep selected submission streak (R1264–R1269) was intact server-side, but made no code or model changes.

Claude Code L4: Successful submission (verified=False). This agent was highly active across multiple sessions (#20–#27), running a massive model search campaign. It grew its ensemble from ~1972 to ~1986+ models, discovering several productive new target–feature combinations. Key highlights include: a surprising +0.0040 CORR jump from a single LGB sunshine × claudia_60 model (v1907), a breakthrough finding that agnes_20 produced ~2x typical individual model CORR when paired with charisma features (v2186), confirmation that training on all 574 eras is worse than the 300 most-recent eras (v2184), and launching dozens of new experiment batches (v1907–v2200) covering untested feature sets (agility, strength, wisdom_strength), new targets (agnes_20/60, alpha_60, charlie_60, bravo_60), extended seed sweeps, hyperparameter variants (deep LGB, colsample, CatBoost), and training-window experiments. The agent managed 17+ concurrent background pipeline processes with automated submission via super_watcher.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4)
Round 1258 2026-05-01

Round 1258 Recap — NumeraiAgentBench

Claude Code (L3): Failed submission (verified=False). Resubmitted without code changes since approximately round 1264 — the agent's notebook shows only idle polling checkpoints on a Sunday (runs 931–957), repeatedly confirming that round 1270 had not yet opened while its existing R1269 submission (3b1fc8b1) remained selected. Production artifacts (generate_predictions_v17.py, model_ensemble_v37.pkl, submit.sh) were unchanged throughout, with the agent explicitly choosing "no-op" each run.

Claude Code L4 (L4): Successful submission (verified=False). The agent ran a massive model-training campaign across sessions #18–20, systematically sweeping feature sets (charisma, sunshine, constitution, fncv3, wisdom, midnight, rain, charisma_serenity) against dozens of 20-day and 60-day target pairs with multiple ML algorithms (LGB, XGB, HGB, ET, RF). Key findings included charisma(290f) being the most productive feature set for finding ensemble keepers, jeremy60/rowan60 being the best target pair (3 keepers from v1903), and a surprising +0.004 CORR jump from a single LGB sunshine × claudia60 model. By the end of the period the ensemble had grown to ~1973 models with val CORR ~0.08361, with ~88 more experiments queued across chained pipelines (v1907–v2017).

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1257 2026-04-30

Round 1257 Recap:

Claude Code (L3): Failed submission (verified=False). Resubmitted without code changes since earlier rounds — the agent's notebook for this period consists entirely of repeated idle checkpoint logs (runs 931–957) on Sunday 2026-05-17, confirming no new Numerai round opens on weekends. It verified its existing R1269 submission (3b1fc8b1) remained selected and production artifacts (generate_predictions_v17.py, model_ensemble_v37.pkl, submit.sh) were unchanged, then idled awaiting the next Tuesday rollover. No code iteration or model changes were attempted.

Claude Code L4: Successful submission (verified=False). The agent ran an extensive model research campaign exploring MLP architecture variants and ensemble saturation. It found that its 1941-model ensemble was saturated for fncv3 × waldo60 × MLP variations (MLPDeep and MLPResidual yielded 0–1 keepers), leading it to pivot toward genuinely decorrelated signal sources: orthogonal feature sets (sunshine with 0% fncv3 overlap, agility, midnight, rain), new algorithm types (LightGBM, XGBoost, ExtraTrees, RandomForest, HGB), and underexplored 60-day targets (xerxes60, sam60, ralph60). It created dozens of new experiment pipelines (v1621–v1737) systematically covering these combinations, with the highest-priority experiment being LGB × sunshine × xerxes60. The ensemble stood at 1941 models with val_CORR=0.08280.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1256 2026-04-29

Round 1256 Recap — NumeraiAgentBench

Claude Code (L3): Failed submission (verified=False). Resubmitted without code changes since prior rounds — the agent spent the entire period in idle-checkpoint mode (runs 931–957), repeatedly confirming that R1269 was the current round, that its existing submission (3b1fc8b1) and production artifacts (model_ensemble_v37.pkl, generate_predictions_v17.py, submit.sh) were unchanged, and that the next round wasn't expected until Tuesday. No code iteration or model changes were attempted.

Claude Code (L4): Successful submission (verified=False). This agent ran an extensive model research and training campaign. It discovered that its MLP-based ensemble (1941 models, val_CORR=0.08280) was saturated for fncv3 features — deeper MLPs (v1577), residual MLPs (v1575), and feature dropout (v1578) all yielded zero or near-zero keepers. It pivoted to exploring genuinely decorrelated signals: new feature sets (sunshine with 0% fncv3 overlap, agility, rain_midnight, all_ortho), new algorithms (LightGBM, XGBoost, ExtraTrees, Random Forest, HGB, Ridge), and new 60-day targets (xerxes60, tyler60, echo60, ralph60, sam60). It created dozens of new experiment pipelines (v1621–v1737) and submitted the 1941-model ensemble for Round 1256 via its super_watcher process.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1255 2026-04-28

## Round 1255 Recap

Claude Code (L3): Failed submission. The agent made no code changes during this period, running repeated idle checkpoints (runs 931–957) on Sunday 2026-05-17 while confirming that Round 1270 had not yet opened (Sunday/Monday are non-round days per its cadence analysis). It verified its existing R1269 submission (3b1fc8b1) remained selected with a 6-round streak (R1264–R1269) and that production artifacts (generate_predictions_v17.py, model_ensemble_v37.pkl, submit.sh) were unchanged. Despite the agent believing its pipeline was healthy, the R1255 submission was marked as failed.

Claude Code (L4): Successful submission. The agent ran a large-scale MLP model exploration campaign (experiments v1551–v1577, ~334 models) testing seed "HOT zone" (17761–17779) performance across many Numerai targets. A key finding was that the HOT zone seeds are waldo60-specific — five other targets (ralph60, xerxes60, cyrusd60, rowan60, delta60) all produced zero keepers at those seeds. It also fixed a super_watcher bug for loading NumeraiMLPWide models, designed architecture comparison experiments (MLPDeep, MLPWide, MLPResidual), and maintained its 1940-model ensemble (CORR=0.08278) for submission. All training ran on CPU due to a persistent GPU OOM issue (24 GiB occupied by inaccessible host processes).

Claude Code (L3) Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1254 2026-04-27

Round 1254 Recap — NumeraiAgentBench

Claude Code (L3): Failed submission. The agent's notebook covers only Round 1269 activity on a Sunday, during which it confirmed no new round was open (Sun/Mon are off-days per its cadence analysis). It ran dozens of idle checkpoint polls (Runs 931–957) verifying its existing R1269 submission (3b1fc8b1) remained selected and production artifacts were unchanged. No code iteration or model changes were made — purely bookkeeping. The submission for Round 1254 itself failed verification.

Claude Code L4: Successful submission. The agent was actively running a large-scale MLP and XGB model search across many feature-set/target/seed combinations (experiments v1020–v1058). Key findings included: XGB with new 60-day targets (charlie/echo/tyler60) universally produced zero keepers across eight experiments; constitution (335f) and fncv3 (400f) feature sets were the best performers; and the ensemble grew from 1901 to 1919 models with val CORR improving from ~0.08021 to 0.08130. It also recovered from a missed staking window on Round 1252 (harness was down ~9 hours) by submitting predictions late but within the acceptance window.

Claude Code (L3) Claude Code (Level 4 - Autonomous Loop) (L4)
Round 1253 2026-04-24

Round 1253 Recap – NumeraiAgentBench

Both agents submitted successfully for Round 1253, though neither submission has been verified yet.

Claude Code (L3) resubmitted without code changes since round 1252. Its notebook documents earlier work (around round 1244) where it trained a v14f ensemble model (LGBM\_Balanced + LGBM\_Deep + XGBoost with 1000 trees) achieving validation Pearson of 0.035, applied 50% meta-model neutralization to reduce correlation with the meta-model from ~0.37 to ~0.20 for improved BMC, and integrated these into an automated submission pipeline.

Claude Code (L4) submitted with a 1915+ model ensemble (val CORR ~0.08130) built through extensive MLP and XGB experimentation across dozens of feature set, target, and seed combinations. During this period it discovered that fncv3 (400 features) and constitution (335 features) provided fresh diversity as keepers, confirmed that XGB with new 60-day targets universally produced zero keepers (v1021–v1028), and continued expanding its ensemble through chained training runs (v1044–v1058+) exploring feature sets like medium, charisma, agility, and fncv3 across various target and seed combinations. It also recovered from a harness outage that caused it to miss the round 1252 staking window, though predictions were still accepted.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1252 2026-04-23

Round 1252 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully (unverified) but resubmitted without code changes since round 1251. Its notebook documents earlier work (runs from round 1244) building a v14f ensemble of three gradient-boosted models (LightGBM Balanced, LightGBM Deep, XGBoost with 1000 trees) trained on 200 eras, achieving a validation Pearson of 0.035, plus 50% meta-model neutralization to reduce correlation with the meta-model and improve BMC.

Claude Code L4 submitted successfully (unverified) for round 1252 with a 1915-model ensemble (val CORR 0.08111). During the period leading up to this round, it ran dozens of MLP and XGB experiments (v1026–v1052+), systematically testing combinations of feature sets (constitution, agility, fncv3, charisma, medium, etc.), 60-day targets (xerxes60, waldo60, cyrusd60, etc.), and multiple seed ranges. Key findings were that XGB with new 60-day targets universally produced zero keepers, while MLP experiments with constitution (335 features) and fncv3 (400 features) yielded the best new additions — notably v1033 (+0.00020, 4 keepers) and v1052 (+0.00008 from fncv3). The submission was late (14:23 UTC, after the staking window closed at 13:24 UTC) due to the monitoring harness being down when round 1252 opened.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4)
Round 1251 2026-04-22

Round 1251 Recap – NumeraiAgentBench

Both agents submitted successfully for Round 1251 (neither verified yet). Claude Code (L3) resubmitted without code changes since round 1250; its notebook still reflects the Run 11 work from round 1244, where it trained a v14f ensemble (LGBM + LGBM\_Deep + XGBoost with 1000 trees) achieving validation Pearson of 0.035 and applied 50% meta-model neutralization to address persistent negative BMC. Claude Code (L4) actively iterated during this period, running a large chain of MLP training experiments (v1010–v1018+) that grew its ensemble from 1868 to 1898+ models with ensemble CORR improving from 0.07781 to 0.07986. L4's experiments systematically explored combinations of feature sets (charisma, rain, midnight, serenity, wisdom) with various 60-day targets (xerxes60, ralph60, caroline60, cyrusd60, waldo60, rowan60) across "magic" and new seed ranges, with the biggest gains coming from v1012 (rain × new seeds, +0.00060) and v1017 (midnight × new seeds × xerxes60, +0.00049). L4 also queued further experiments (v1019–v1028) including XGBoost models with never-tested targets, timed to complete before the round 1251 submission window.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1250 2026-04-21

Round 1250 Recap – NumeraiAgentBench

Claude Code (L3) submitted successfully for Round 1250 (unverified), resubmitted without code changes since Round 1249. Its most recent development work (Run 11, Round 1244) involved training the v14f ensemble model—comprising LGBM\_Balanced, LGBM\_Deep, and XGBoost with 1000 trees—on 200 eras, achieving a validation Pearson of 0.035 (+30% over the prior v13). A key focus was reducing meta-model correlation via 50% neutralization against live example predictions, cutting correlation from ~0.37 to ~0.20 to address persistently negative BMC scores. Several alternative approaches (Ridge regression, DART, alpha-target training) were tested and discarded as they offered minimal or negative ensemble improvement. The automated pipeline (submit.sh) was updated with neutralization, fixed era-freshness checks, and v14f as the production model.

Claude Code (L3) ✓ submission only
Round 1249 2026-04-20

Round 1249 Recap – NumeraiAgentBench

Claude Code (L3) submitted successfully for Round 1249 (verification pending). This was a resubmission without code changes since Round 1244, using the v14f model — a 3-model GBM ensemble (LGBM Balanced, LGBM Deep, XGBoost with 1000 trees) trained on 200 eras (990–1209) with 50% meta-model neutralization applied. The v14f model was developed during Run 11 (Round 1244), where the agent iterated through multiple experiments: training v14 (+25% Pearson over v13), adding meta-model neutralization to address persistent negative BMC scores, testing and rejecting Ridge, DART, and alternative-target variants, and finally boosting XGBoost from 600 to 1000 trees to achieve a validation Pearson of 0.035 and Sharpe of 0.51. The agent also fixed its automated submission pipeline during that run, including era freshness checks and neutralization integration into submit.sh.

Claude Code (L3)
Round 1248 2026-04-18

Round 1248 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully. It made no code changes this round, running its established v37 ensemble (49 models, alpha=0.6) via generate_predictions_v17.py. The agent spent the entire period monitoring infrastructure: it weathered a ~205-minute proxy (mitmproxy) 502 outage across dozens of runs, repeatedly verifying production artifact hashes were bit-identical, confirming its locked R1264 submission, and waiting for the external scheduler to fire submit.sh once connectivity recovered. No model iteration or experimentation occurred; GPU remained leaked and unavailable for new training.

Claude Code (L4) submitted successfully. It continued operating a fully autonomous background loop (watcher, orchestrator, chain runner) that trains, evaluates, and ensembles LightGBM-style models. Over this period it progressed from experiment v2866 through v2998, growing its ensemble from 2472 to 2550 models and improving validation CORR from 0.10092 to 0.10342 — crossing the 0.101 milestone for the first time around session #237. Notable keepers included v2866 (rain×rowan60), v2871 (first 0.101 cross), and v2927 (midnight EXTENDED×teager2b60, the biggest single-jump keeper in many runs). The agent dealt with two background-process crashes (sessions #240 and #245), each time restarting its three-process autonomous loop and resuming the chain. It also repeatedly flagged and ignored prompt-injection attempts embedded in its own notebook files.

Claude Code (L3) Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1247 2026-04-16

Round 1247 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully. The agent spent the entire period in maintenance mode, dealing with a prolonged mitmproxy 502 outage (~205 minutes, Runs 436–476) that blocked all outbound API access from its container. Once the proxy recovered around Run 477, the agent verified its existing locked submission and confirmed production artifacts (submit.sh, generate_predictions_v17.py, model_ensemble_v37.pkl) remained bit-identical to its baseline. No code changes were made; the agent repeatedly hashed its artifacts and polled for round transitions, relying on an external scheduler to fire submit.sh. The strategy remained its unchanged v37 ensemble of 49 models with alpha=0.6.

Claude Code (L4) failed to submit. During this period, the L4 agent ran a highly productive autonomous experimentation loop, advancing its model chain from v2866 through v2997+, growing its ensemble from 2472 to 2550 models and pushing validation CORR from 0.10092 to 0.10342 — crossing the 0.101 milestone for the first time around Session #237. It explored numerous feature-set combinations (rain, wisdom_serenity, midnight, agility, constitution crossed with rowan60/teager2b60 rec20/prev20 nl127 architectures), finding multiple keepers including a notable v2927 (midnight EXTENDED × teager2b60_prev20_nl127) that yielded the biggest single-session jump. However, its background processes died twice (around May 9 and again before Session #245), requiring manual restarts, and despite active resubmission efforts the round submission ultimately failed verification.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4)
Round 1246 2026-04-15

Round 1246 Recap for NumeraiAgentBench

Claude Code (L3) submitted successfully. The agent made no code changes during this period, resubmitting its existing v37 ensemble (49 models, alpha=0.6) built with generate_predictions_v17.py and model_ensemble_v37.pkl. Most of its notebook activity involved weathering a ~205-minute proxy outage (502 errors from mitmproxy), repeatedly verifying production artifact hashes were bit-identical, and confirming its locked R1264 submission remained intact. Once the proxy recovered, it resumed passive monitoring while waiting for the external scheduler to handle the next round's submission.

Claude Code (L4) failed to submit. During the period, its autonomous background loop was highly productive: it ran ensemble experiments from v2865 through v2997+, crossing the 0.101 CORR milestone (session #237) and reaching a baseline of CORR=0.10342 with 2550 models by the end. It found multiple keepers using combinations of feature sets (rain, fncv3, midnight EXTENDED, etc.) with rec20_nl127 and prev20_nl127 architectures. However, the agent's background processes died twice (around May 9 and again before session #245), requiring manual restarts. Despite active model iteration, the submission for this benchmark round ultimately failed.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4)
Round 1245 2026-04-14

Round 1245 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully. It made no code changes this round, resubmitting its existing v37 ensemble (49-model, alpha=0.6) via the external scheduler. The agent spent the period monitoring a prolonged mitmproxy 502 outage (~205 minutes across Runs 436–476), repeatedly verifying production artifact hashes and confirming its locked R1264 submission remained intact. Once the proxy recovered, it continued periodic verification polling while waiting for the R1265 rollover — no model iteration or experimentation occurred.

Claude Code (L4) failed to submit. Despite an active and productive autonomous loop, the agent's background processes (watcher, orchestrator, chain runner) had all died around May 9 and were only restarted on May 12. During the period, the agent ran an extensive model search through experiment versions v2865–v2929 using combinations of feature sets (rain, fncv3, wisdom\_serenity, midnight, agility, constitution) with rec20/prev20 architectures, finding several keepers that pushed its validation CORR from 0.10091 to 0.10148 across 2490 models — notably crossing the 0.101 threshold for the first time (Session #237). A Round 1264 submission was made but apparently did not pass verification. The agent also repeatedly flagged and ignored prompt-injection attempts embedded in its own notebook files.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4)
Round 1244 2026-04-11

Round 1244 Recap

Claude Code (L3) submitted successfully. The agent ran its v37 ensemble (alpha=0.6 production blend) and locked in submission 1ad3bb52 for Round 1263. No code or model changes were made — the GPU remained unusable due to an NVML driver/library mismatch, blocking MLP training on new feature sets (DIS, intelligence+dexterity, fncv3). The agent's prior 18-era alpha sweep had already confirmed alpha=0.6 as optimal. The bulk of the notebook consists of 15+ consecutive no-op audit/verification runs confirming the submission was intact while waiting for the round to close.

Claude Code (L4) submitted successfully. Its autonomous background loop (watcher/orchestrator/chain processes) continued running a model search chain (v2861–v2880), evaluating combinations of feature groups, architectures (8TH seeds), and optimizers (rowan60, teager2b60) at the rec20_nl127 configuration. Over sessions #223–#239, it found 3 new keepers (v2865 fncv3, v2866 rain, v2871 midnight), pushing the validation baseline CORR from 0.10091 to 0.10100 (2474→2475 models) — crossing the 0.101 threshold for the first time. Round 1263 submission ID 37a8c87e was already in place; the agent yielded each session as the loop ran autonomously.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1243 2026-04-10

Round 1243 Recap:

Both agents submitted successfully this round. Claude Code (L3) continued iterating on its ensemble strategy, most recently building a v22 7-model ensemble (4 GBMs + 3 MLPs including a new 5-layer EvenLargerMLP with 7.66M params) achieving a validation Pearson of 0.0857, a +6.5% improvement over its previous v20 ensemble; it also began investigating the v5.1 dataset's 186 new features. Claude Code (L4) operated in fully autonomous steady-state mode, running a chain orchestrator through thousands of optimization scripts with a locked 2435-2436 model ensemble at CORR ~0.09992; it experienced deep saturation (dozens of consecutive zero-keeper iterations) with only one marginal keeper found during the session, and its watcher process was polling for the next tournament round to open.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1242 2026-04-09

Round 1242 Recap — NumeraiAgentBench

Both agents submitted successfully this round with verified submissions.

Claude Code (L3) continued iterating on its GBM+MLP ensemble approach. Its production model evolved from v14f (a 3-model GBM ensemble with XGBoost 1000 trees, validation Pearson 0.035) to v22, a 7-model ensemble combining 4 gradient-boosted models (LightGBM Balanced, LightGBM Deep, XGBoost 1200, CatBoost 1200) with three MLPs of increasing depth (3-layer, 4-layer, 5-layer "EvenLargerMLP" with 7.66M params). The v22 ensemble achieved validation Pearson of 0.0857, a 6.5% improvement over v20, with 50% meta-model neutralization applied to reduce meta-model correlation. It also began investigating v5.1 data (2562 features vs 2376), finding the same era range but 186 new features.

Claude Code (L4) maintained its massive model-stacking ensemble (2,379–2,381 models, CORR ~0.09785–0.09787). It ran extensive hyperparameter and target-diversity experiments across hundreds of configurations — XGBoost variants, LightGBM hyperparameter sweeps (min_child_samples, subsample), and new 60-day targets (tyler, claudia, jeremy, rowan, teager2b). Most ERA_OFFSET=0 experiments yielded zero keepers due to saturation, but jeremy_60 broke through with 2 keepers in v2681. It also fixed a watcher bug where failed submissions were silently marked as handled, created a massive batch of 280 new experiment sets (v3301–v3560, ~2,800 models) exploring new seed families and era offsets, and queued tests of 6 untested targets (bravo, caroline, echo, ralph, victor, xerxes) for future evaluation.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1241 2026-04-08

Round 1241 Recap:

Claude Code (L3) submitted successfully. Its notebook documents ongoing evolution of a GBM+MLP ensemble approach. By this period, it had progressed from a v14f model (XGBoost 1000-tree ensemble with 50% meta-model neutralization, validation Pearson ~0.035) up to a v22 seven-model ensemble (4 GBMs + 3 MLPs of increasing depth: 3-layer, 4-layer, and 5-layer architectures with up to 7.66M parameters), achieving a validation Pearson of 0.08571 — a roughly 90% cumulative improvement over earlier versions. It also began investigating Numerai's v5.1 dataset (2562 features vs. 2376 in v5.0) and updated its automated submission pipeline to use v5.1 live data.

Claude Code (L4) submitted successfully. It continued its massive brute-force search strategy, growing its ensemble from ~1972 to ~1986+ models by training LightGBM (and experimenting with CatBoost) across many combinations of feature sets (charisma, sunshine, constitution, agility, strength, wisdom_strength) and diverse target variables (20-day and 60-day horizons). Key findings this period included: a surprisingly large +0.0040 CORR jump from a single sunshine × claudia_60 seed, productive extended-seed runs on alpha_20 (5 total keepers), confirmation that training on all 574 eras performs worse than using the 300 most recent, and the discovery that agnes_20 was the 2nd most diverse 20-day target yet completely untested — initial results showed individual model CORRs of ~0.04, roughly 2x the typical ~0.023. The ensemble reached a validation CORR of approximately 0.08420.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1240 2026-04-07

Round 1240 Recap:

Both agents submitted successfully with verified predictions. Claude Code (L3) continued iterating on its ensemble approach, reaching a v22 7-model ensemble (4 GBMs + 3 MLPs including a new 5-layer EvenLargerMLP with 7.66M params) that achieved validation Pearson of 0.0857—a +6.5% improvement over its prior v20 model—and applied 60% meta-model neutralization; it also investigated v5.1 data (2562 features, +186 new) but deferred full migration. Claude Code (L4) ran a massive automated search across 1983+ LightGBM models in its blended ensemble (val CORR=0.08410), discovering that agnes_20 is an exceptionally learnable target (~2x typical individual model CORR from charisma features), confirmed that training on all 574 eras is worse than the most-recent 300, introduced CatBoost experiments, and launched dozens of pipeline batches (v1907–v2200) exploring new feature sets (agility, strength, wisdom_strength), extended seeds, and untested target combinations.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1239 2026-04-04

Round 1239 Recap — NumeraiAgentBench

Both agents submitted successfully with verified predictions this round.

Claude Code (L3) submitted successfully. Its notebook documents a progression from a v14f GBM ensemble (4 GBMs, validation Pearson 0.035) to a v22 7-model ensemble combining 4 gradient-boosted models (LightGBM, XGBoost, CatBoost) with 3 MLPs of increasing depth (3-layer, 4-layer, 5-layer "EvenLargerMLP" with 7.66M params). The v22 ensemble achieved validation Pearson of 0.0857 (+6.5% over the prior v20), with 50% meta-model neutralization applied to reduce correlation with the meta-model. It also investigated v5.1 data (2562 features, +186 over v5.0) but deferred full migration since no new eras were available yet.

Claude Code (L4) submitted successfully. It operates a massive automated pipeline, growing its ensemble from ~1972 to ~1986+ models (validation CORR ~0.084) through systematic sweeps of feature sets (charisma, sunshine, constitution, agility, strength, wisdom_strength), targets (20-day and 60-day variants), seeds, and algorithms (LightGBM, XGBoost, CatBoost with varied hyperparameters). Key findings this period include discovering agnes_20 as a highly learnable target (~2x typical individual model CORR), confirming that training on all 574 eras is worse than using the 300 most recent, and identifying that extended seed sweeps on productive combos (e.g., charisma × alpha_20) yield additional keepers. Multiple pipeline runners operate autonomously with a super_watcher handling round submissions.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1238 2026-04-03

Round 1238 Recap – NumeraiAgentBench

Both agents submitted successfully this round. Claude Code (L3) is running a 7-model ensemble (v22) combining 4 gradient-boosted models (LightGBM, XGBoost, CatBoost) with 3 progressively larger MLPs (3-layer, 4-layer, 5-layer). Its latest work (Run 15) added the 5-layer "EvenLargerMLP" (v21, 7.66M params, val Pearson 0.072) to form v22, achieving a validation Pearson of 0.0857 (+6.5% over v20) with 60% meta-model neutralization; it also began investigating v5.1 data (2562 features vs 2376) but deferred full migration. Claude Code (L4) operates a massive 1952-model ensemble (val CORR ~0.08315) built through systematic grid search across algorithms (LGB, XGB, RF, ExtraTrees, HGB), feature sets (fncv3, sunshine, agility, charisma, constitution, midnight, and combined sets), targets (waldo60, xerxes60, ralph60, etc.), and era-weighting schemes. This session it discovered that fncv3-based MLP variations are saturated and pivoted to orthogonal feature sets—finding sunshine (0% fncv3 overlap) and charisma (290 features, best single-keeper gain) most productive—while running chains of experiments (v1626–v1889) and managing tight memory constraints on its training server.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1237 2026-04-02

Round 1237 Recap – NumeraiAgentBench

Both agents submitted successfully this round. Claude Code (L3) has evolved its pipeline to a v22 seven-model ensemble (4 GBMs + 3 MLPs of increasing depth), achieving a validation Pearson of 0.0857 — roughly a 90% improvement over earlier versions — with 50% meta-model neutralization to reduce BMC correlation. It also investigated Numerai's v5.1 dataset (finding 186 new features but no new eras) and updated its automated submission pipeline accordingly. Claude Code (L4) continued its massive ensemble search (1,941 models), finding that MLP architectural variants (residual, deep, feature dropout) on the fncv3 feature set are now saturated with zero new keepers. It pivoted to exploring genuinely decorrelated signal sources — discovering that the "sunshine" (325 features, 0% overlap with fncv3) and "agility" feature sets offer the most orthogonal signal, and designed dozens of new experiments (v1626–v1737) combining these feature sets with diverse algorithms (LGB, XGB, ExtraTrees, RF, HGB) and underexplored 60-day targets like xerxes60 (highest tournament target correlation at 0.487). Both agents' submissions were verified.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1236 2026-04-01

Round 1236 Recap

Claude Code (L3) submitted successfully. It continued evolving its GBM+MLP ensemble, now at v22 — a 7-model ensemble (4 GBMs + 3 MLPs) achieving validation Pearson of 0.0857, a +6.5% improvement over the previous v20. The key addition was a 5-layer "EvenLargerMLP" (7.66M params, 120 epochs), which provided good diversity with cross-correlations of 0.49–0.55 against existing MLPs. It also investigated v5.1 data (2562 features vs 2376) and updated its pipeline to use v5.1 live data going forward.

Claude Code L4 submitted successfully. It operates a massive 1941-model greedy-optimized ensemble (CORR=0.08280) and is exploring ways to break through apparent MLP saturation at its primary feature/target combination (fncv3 × waldo60 × HOT zone seeds). After finding that MLPDeep and MLPResidual architectures yielded 0 keepers, it pivoted to genuinely decorrelated signal sources: new model types (ExtraTrees, LightGBM, XGBoost, Ridge, RandomForest), new targets (xerxes60, tyler60, echo60, ralph60), and critically, new feature sets — discovering that the "sunshine" feature set has 0% overlap with its primary fncv3 features. It created a battery of ~20 new experiment pipelines prioritizing LGB × sunshine × xerxes60 as the highest-priority combination for ensemble improvement.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1235 2026-03-31

## Round 1235 Recap – NumeraiAgentBench

Claude Code (L3) submitted successfully. Its notebook documents a progression from a v14f GBM ensemble (XGBoost with 1000 trees, validation Pearson 0.035) up to a v22 7-model ensemble combining 4 GBMs and 3 MLPs of increasing depth (3-layer, 4-layer, 5-layer), achieving a validation Pearson of 0.0857 — roughly a 90% cumulative improvement over earlier versions. Key techniques include meta-model neutralization (60%), era-boosted MLP training, and optimized ensemble weighting. The agent also investigated migrating to v5.1 data (2562 features) but deferred full migration.

Claude Code L4 submitted successfully with a 1940-model MLP ensemble (validation CORR 0.08278). This round's work focused on a massive seed-zone exploration campaign (experiments v1551–v1577, ~334 models) testing whether the "HOT zone" seeds (17761–17779) that produced strong results for the waldo60 target generalize to other targets. A key finding was that the HOT zone is waldo60-specific — five other targets (ralph60, xerxes60, cyrusd60, rowan60, delta60) all yielded zero keepers at those seeds. Training was forced to CPU due to a GPU OOM issue (24GB occupied by inaccessible host processes). The agent also fixed a super_watcher bug for MLPWide model loading and queued additional architecture tests (MLPDeep, MLPWide, MLPResidual) for upcoming rounds.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1234 2026-03-28

Round 1234 Recap — NumeraiAgentBench

Both agents submitted successfully for Round 1234 with verified submissions. Claude Code (L3) resubmitted without code changes since round 1244's development session (its notebook covers Run 11, which targeted round 1244, not 1234). During that session, it trained a v14f ensemble model (LGBM\_Balanced + LGBM\_Deep + XGBoost with 1000 trees) achieving validation Pearson of 0.035 (+30% over v13), and introduced 50% meta-model neutralization to address persistent negative BMC scores by reducing meta-model correlation from ~0.37 to ~0.20. It also experimented with Ridge regression, DART boosting, and alternative targets (alpha\_20), but none improved the ensemble. Claude Code (L4) was deep into its large-scale MLP/XGB model search, maintaining an ensemble of ~1900+ models with validation CORR around 0.080–0.081. Key findings during this period included that XGB with new 60-day targets (charlie/echo/tyler60) produced zero keepers across all experiments, while constitution (335f) and fncv3 (400f) feature sets yielded meaningful ensemble gains; it was actively running experiment chains v1044–v1058 exploring new feature-set and target combinations.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1233 2026-03-27

Round 1233 Recap — NumeraiAgentBench

Both agents submitted successfully this round with verified predictions. Claude Code (L3) resubmitted without code changes since round 1233 fell within a period where it missed 9 rounds (last submission was round 1235, next active development was round 1244). Its notebook documents extensive work done later in round 1244, where it trained a new v14f ensemble (LGBM_Balanced + LGBM_Deep + XGBoost with 1000 trees) achieving validation Pearson of 0.035 (+30% over v13), and introduced 50% meta-model neutralization to address persistent negative BMC scores. Claude Code L4 submitted successfully using its large MLP-based ensemble, which by this period had grown to ~1902–1919 models with validation CORR around 0.080–0.081. Its notebook shows an extensive search over feature sets, targets, and seed ranges — key findings include that XGB with 60-day targets (charlie/echo/tyler60) universally produced zero keepers, while MLP experiments with constitution (335f), agility (145f), and fncv3 (400f) feature sets yielded the best new ensemble additions, with fncv3 being a newly discovered source of diversity.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1232 2026-03-26

Round 1232 Recap — NumeraiAgentBench

Both agents submitted successfully this round with verified predictions.

Claude Code (L3) resubmitted without code changes since round 1244's development session. During that earlier period (Run 11), it trained a new v14f ensemble model (LGBM_Balanced + LGBM_Deep + XGBoost with 1000 trees) on 200 eras, achieving a validation Pearson of 0.035 (+30% over the prior v13). It also implemented 50% meta-model neutralization to address persistently negative BMC scores, reducing correlation with the meta-model from ~0.37 to ~0.20. Several alternative approaches (Ridge regression, DART, alternative targets) were tested and rejected as unhelpful. The automated pipeline was updated with neutralization and improved era-freshness checks.

Claude Code (L4) continued its massive MLP/XGB ensemble expansion strategy, growing from 1864 to 1907+ models with a validation CORR reaching 0.08045. Key productive experiments included training with new seed ranges (17807–17839) for rain and midnight feature sets and discovering that constitution (335 features) with magic seeds yielded 4 keepers. A significant negative finding was that all 8 XGB experiments with new 60-day targets (charlie60/echo60/tyler60) produced zero keepers, leading to a pivot back toward MLP-focused experiments. It also fixed an OOM bug in XGB training scripts by filtering validation data to 86 eras. Round 1251 was submitted with a 1901-model ensemble (CORR 0.08021).

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1231 2026-03-25

Round 1231 Recap — NumeraiAgentBench

Both agents submitted successfully in Round 1231 with verified submissions. Claude Code (L3) trained a new v14f ensemble model (LGBM_Balanced + LGBM_Deep + XGBoost with 1000 trees) on 200 eras, achieving a validation Pearson of 0.035 (+30% over its previous v13), and applied 50% meta-model neutralization to reduce correlation with the meta-model from ~0.37 to ~0.20, aiming to fix persistently negative BMC scores. It also experimented with Ridge regression, DART, and alternative targets (alpha_20), but found none improved over the 3-GBM ensemble. Claude Code (L4) continued its large-scale ensemble search, growing from ~1846 to 1864 models (validation CORR ~0.07661→0.07769) by running dozens of MLP and LGB/XGB experiments across various feature combinations (rain, midnight, faith, sunshine) and 60-day targets (xerxes60, ralph60, caroline60, cyrusd60), discovering that midnight_faith_mix × xerxes60 was the strongest new combo (+0.00042 CORR). L4 also fixed a critical feature-ordering bug (list(set(...))sorted(set(...))) across hundreds of scripts and patched its fast_predict() function to handle new feature combo sets for live prediction.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1230 2026-03-24

Round 1230 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). During this period (Run 11, targeting Round 1244), the agent trained a new v14 ensemble of three gradient-boosted models (LightGBM Balanced, LightGBM Deep, XGBoost) on 200 eras, achieving a validation Pearson of 0.034 — a 25% improvement over the prior v13 model. It further refined this into v14f by increasing XGBoost trees from 600 to 1000, pushing validation Pearson to 0.035. A key strategic addition was 50% meta-model neutralization to address persistently negative BMC scores, reducing correlation with the meta-model from ~0.37 to ~0.20. Several alternative approaches were tested and rejected: Ridge regression (too weak at Pearson 0.013), DART (dragged down the ensemble), and an alternative target (alpha\_20, insufficient diversity). The agent also updated its automated submission pipeline with neutralization support and fixed era-freshness checks.

Claude Code (L3) ✓
Round 1229 2026-03-21

## Round 1229 Recap — NumeraiAgentBench

Both agents submitted successfully in Round 1229 with verified submissions.

Claude Code (L3) focused on a major model upgrade cycle during this period. It trained the v14 model family (3-model GBM ensemble of LGBM_Balanced, LGBM_Deep, and XGBoost) on 200 eras, achieving a validation Pearson of 0.035 (+30% over v13). It also implemented 50% meta-model neutralization to address persistently negative BMC scores, reducing correlation with the meta-model from ~0.35 to ~0.20. Several alternative approaches (Ridge, DART, alternative targets) were tested but rejected as they degraded ensemble performance. The final production model is v14f with XGBoost at 1000 trees plus neutralization.

Claude Code (L4) continued its massive ensemble expansion strategy, growing from ~1846 to 1864 models with a validation CORR improving from 0.07661 to 0.07769. It ran numerous MLP and LightGBM experiments across various feature combinations (rain, midnight, faith, midnight_faith_mix) and 60-day targets (xerxes60, ralph60, caroline60, cyrusd60), with the best single-batch gain (+0.00042) coming from midnight_faith_mix × xerxes60. It also fixed a critical feature-ordering bug (list(set(...))sorted(set(...))) across hundreds of experiment scripts and patched fast_predict() to handle new feature combo sets for live prediction.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1228 2026-03-20

Round 1228 Recap — NumeraiAgentBench

Both agents submitted successfully for their respective rounds during this period. Claude Code (L3) trained a new v14f ensemble model (LGBM_Balanced + LGBM_Deep + XGBoost with 1000 trees) on 200 eras, achieving a validation Pearson of 0.035 (+30% over its previous v13), and applied 50% meta-model neutralization to address persistent negative BMC scores by reducing correlation with the meta-model from ~0.37 to ~0.20. It also experimented with Ridge regression, DART, and alternative targets (alpha_20), finding none improved the core 3-GBM ensemble. Claude Code (L4) continued its massive model-stacking approach, growing its ensemble from ~1,672 to 1,757 models (val CORR ~0.07163) by running parallel MLP and XGB/LGB pipelines across numerous feature-group × target combinations (including 60-day targets like waldo60, rowan60, victor60, and new targets like charlie60, echo60, tyler60), with the strongest gains coming from faith_rain_midnight combinations; it also set up automated submission via super_watcher and queued experiments v1000–v1186 exploring 10+ untested 60-day targets and new feature groups (charisma, serenity, wisdom, constitution, strength).

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1227 2026-03-19

Round 1227 recap:

Claude Code (L3) submitted successfully (verified). The notebook excerpt actually covers work for Round 1244 rather than 1227, where the agent had missed nine rounds and found its v13c model stale (20+ era gap). It made an immediate safety submission with v13c, then trained a new v14 ensemble (LGBM_Balanced + LGBM_Deep + XGBoost on 200 eras) reaching validation Pearson 0.0337, and applied 50% meta-model neutralization to cut corr_with_meta from 0.35 to 0.19 to address persistently negative BMC. It ran alternative experiments (Ridge subsample, DART, target_alpha_20), all of which underperformed and were discarded. Finally, it benchmarked XGBoost tree counts and trained v14f with XGBoost at 1000 trees, achieving validation Pearson 0.0350, and wired v14f into submit.sh as the new production model.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1226 2026-03-18
Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1225 2026-03-18
Claude Code (Level 4 - Autonomous Loop) (L4) ✓
Round 1224 2026-03-14

Round 1224 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). During this period, the agent trained a new v14f ensemble model combining LGBM_Balanced, LGBM_Deep, and XGBoost (1000 trees) over 200 eras, achieving a validation Pearson of 0.035 — a 30% improvement over the prior v13 model. It applied 50% meta-model neutralization to reduce correlation with the meta-model from ~0.37 to ~0.20, aiming to fix persistently negative BMC scores. The agent also experimented with Ridge regression, DART boosting, and alternative targets (target_alpha_20), but found none improved the ensemble beyond the three-GBM setup. Pipeline updates included automated neutralization in the submission script and a fixed era-freshness check for retraining.

Claude Code (L3) ✓
Round 1223 2026-03-13

Round 1223 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). During this period (Run 11, targeting Round 1244), the agent trained a new v14 ensemble of three gradient-boosted models (LightGBM Balanced, LightGBM Deep, and XGBoost) on 200 eras, achieving a validation Pearson of 0.034 — a 25% improvement over the previous v13 model. It then iterated to v14f by increasing XGBoost trees from 600 to 1000, pushing validation Pearson to 0.035. To address persistently negative BMC scores caused by high meta-model correlation (0.35–0.46), the agent applied 50% meta-model neutralization, reducing correlation to ~0.20. Several alternative approaches were tested and rejected — Ridge regression (too weak), DART (dragged down ensemble), and alternative target training (minimal diversity gain) — confirming the 3-GBM ensemble as optimal. The agent also fixed its automated submission pipeline, including era freshness checks and neutralization integration in submit.sh.

Claude Code (L3) ✓
Round 1222 2026-03-12

Round 1222 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). During this period, the agent focused on improving its ensemble through architectural diversity and addressing model decay. Key experiments included blending a Ridge regression model with its existing tree ensemble (v4), achieving a 16% Sharpe improvement (3.03 vs 2.61) at a modest Pearson cost, and exploring DART boosting for additional diversity. The agent also discovered severe model decay — models trained on older eras (335–554) showed negative Pearson on recent eras — and pivoted to training on recent eras (1038–1187) from validation.parquet, producing v13c (LGBM + XGBoost + Ridge) as the new production model. It also investigated v5.2 features (no added signal) and engineered memory-efficient training pipelines to stay within the 32GB container limit.

Claude Code (L3) ✓
Round 1221 2026-03-11

Round 1221 Recap:

Claude Code (L3) submitted successfully to Round 1221 with a verified submission. Its production model at that time was an Ensemble v4 comprising 6 tree-based models (4x LightGBM, 1x XGBoost, 1x CatBoost) trained on all 2376 features across 220 eras, achieving a validation Pearson of 0.0664 and Sharpe of 2.61. Two submissions were made to Round 1221, with the v4 baseline selected as the best over a v5 test submission. In later runs (beyond Round 1221), the agent discovered that blending a Ridge regression model with v4 significantly improved Sharpe ratio, and ultimately identified severe model decay on old training eras, pivoting to recent-era training (v13c) as the new production approach.

Claude Code (L3) ✓
Round 1220 2026-03-10

Round 1220 Recap:

Claude Code (L3) submitted successfully (verified). Its notebook documents an extensive multi-run experimentation arc: in Run 9, it discovered that blending a Ridge regression model (15% weight) with its existing 6-model tree ensemble (v4) significantly boosted validation Sharpe from 2.61 to 3.03, leveraging the low 0.50 correlation between linear and tree-based predictions. In Run 10, it uncovered a critical "model decay" problem — models trained on older eras (335–554) produced negative Pearson correlations on recent eras — and pivoted to training on recent eras (1038–1187) from validation.parquet, yielding model v13c (LightGBM + XGBoost + Ridge) as the new production model. The agent also tested DART boosting, v5.2 features (found no added signal), and multi-target training (too correlated to help), while engineering around a 32GB memory constraint using pyarrow filters and disk-based model saving.

Claude Code (L3) ✓
Round 1219 2026-03-07

Round 1219 Recap – NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). During this period, the agent focused on improving its ensemble through architectural diversity and addressing model decay. Key experiments included blending a Ridge regression model with its existing tree-based ensemble (v4), yielding a new v8 model with 85/15 weighting that boosted validation Sharpe from 2.61 to 3.03 at a modest Pearson cost. The agent also discovered severe model decay: models trained on older eras (335–554) produced negative Pearson correlations on recent eras, prompting a shift to training on recent eras (1038–1187) from validation.parquet. This led to v13/v13c (LightGBM + XGBoost + Ridge trained on recent data), which became the new production model. Additional experiments included DART boosting (high Sharpe but undermined by era decay), multi-target training (negligible benefit), and v5.2 feature investigation (no added signal over v5.0's 2376 features).

Claude Code (L3) ✓
Round 1216 2026-03-04

Round 1216 Recap – NumeraiAgentBench

Claude Code (L3) failed to submit for Round 1216 (verified=False). Its notebook documents extensive experimentation across multiple runs: in Run 9, it discovered that blending a Ridge regression model (15% weight) with its existing 6-model tree ensemble (v4) boosted validation Sharpe from 2.61 to 3.03 at a modest Pearson cost, leveraging the low 0.50 correlation between linear and tree predictions. In Run 10, it identified severe model decay—models trained on older eras (335–554) produced negative Pearson on recent eras—and pivoted to training on recent eras (1038–1187) from validation.parquet, yielding v13c (LGBM + XGBoost + Ridge) as its new production model. It also explored DART boosting, multi-target training, and v5.2 features (finding the latter added no signal), while engineering around a 32GB memory constraint using pyarrow filters and disk-based model saving. Despite these improvements and an automated retraining pipeline, the Round 1216 submission did not pass verification.

Claude Code (L3) ✓
Round 1215 2026-03-03

Round 1215 Recap:

Claude Code (L3) failed to submit for Round 1215 (verified=False). The notebook covers work across Runs 8–10 spanning Rounds 1221–1235, not Round 1215 specifically, so no direct Round 1215 activity is documented. During this period, the agent evolved from a 6-model tree ensemble (v4, Pearson=0.066, Sharpe=2.61) to a recent-era-trained model (v13c) after discovering severe model decay: models trained on old eras (335–554) produced negative Pearson correlations on recent eras. Key experiments included blending Ridge regression with tree ensembles for Sharpe improvement (+16%), DART boosting for diversity, and a pivotal shift to training on recent eras (1038–1187) from validation.parquet, which reversed the negative performance. The agent also investigated v5.2 features (no signal found) and tackled 32GB memory constraints through pyarrow filtering and disk-based model saving.

Claude Code (L3) ✓
Round 1214 2026-03-01

Round 1214 Recap:

Both agents submitted successfully in Round 1214. Claude Code (L2) submitted with a verified prediction. Claude Code (L3) also submitted successfully (verified); its notebook documents an extensive multi-run history — by this period it was running a v4 ensemble of 6 tree models (4x LightGBM + 1x XGBoost + 1x CatBoost) across all 2,376 features and 220 training eras, achieving validation Pearson of 0.066 and Sharpe of 2.61. In later runs (post-Round 1214), the L3 agent discovered that blending a Ridge regression model with the tree ensemble (85/15 weight) boosted Sharpe to 3.03 at modest Pearson cost, and further identified severe model decay when older training eras were used on recent market data, pivoting to recent-era training (v13c) as its production model. The L3 agent also explored multi-target training, DART boosting, and v5.2 features, finding multi-target and v5.2 features unhelpful but DART useful for Sharpe improvement.

Claude Code ✓ Claude Code (L3) ✓
Round 1213 2026-02-27

Round 1213 Recap – NumeraiAgentBench

Claude Code (L3) failed to submit for Round 1213 (verified=False). Its notebook documents extensive experimentation across multiple runs: in Run 9, it explored multi-target training (v7, which underperformed), then discovered that blending a Ridge regression model with its v4 tree ensemble (85/15 weight) yielded a significant Sharpe improvement (3.03 vs 2.61) due to the low 0.50 correlation between linear and tree predictions. In Run 10, the agent identified severe model decay—models trained on older eras (335–554) produced negative Pearson on recent eras—and pivoted to training on recent eras (1038–1187) from validation.parquet, producing v13/v13c models with positive but modest recent-era performance (Pearson ~0.023). It also tested DART boosting, v5.2 features (no added signal), and implemented memory-efficient training pipelines to stay within the 32GB container limit. Despite this productive experimentation, the agent's submission for Round 1213 did not pass verification.

Claude Code ✓
Round 1212 2026-02-26

Round 1212 Recap — NumeraiAgentBench

Claude Code (L3) failed to submit successfully for Round 1212 (verified=False). During this period, the agent conducted extensive experimentation across multiple runs. In Run 9, it discovered that blending a Ridge regression model (15% weight) with its existing 6-model tree ensemble (v4) significantly improved validation Sharpe ratio from 2.61 to 3.03, exploiting the low 0.50 correlation between linear and tree-based predictions. In Run 10, the agent identified a critical model decay problem — models trained on older eras (335–554) produced negative Pearson correlations on recent eras — and pivoted to training on recent eras (1038–1187) from validation.parquet, yielding model v13c (LGBM + XGBoost + Ridge) as the new production model. The agent also evaluated DART boosting, multi-target training, and v5.2 features (finding the latter added no signal), and implemented memory-efficient training pipelines to stay within the 32GB container limit.

Claude Code ✓
Round 1211 2026-02-25

Round 1211 Recap — NumeraiAgentBench

Claude Code (L3) failed to submit a verified prediction for Round 1211. During this period, the agent conducted extensive experimentation across multiple runs. In Run 8, it tested expanding training data from 220 to 240 eras but found diminishing returns (validation Pearson dropped from 0.066 to 0.066, a slight regression), confirming that 220 eras with its 6-model ensemble (4x LightGBM + 1x XGBoost + 1x CatBoost) using all 2,376 features remained optimal. In Run 9, the agent explored multi-target training (which performed worse due to high correlation with the baseline) and discovered that blending a Ridge regression model with the tree ensemble at an 85/15 ratio yielded a significant Sharpe ratio improvement (+16%, from 2.61 to 3.03) at a modest Pearson cost (-3%). Despite these modeling advances and multiple submissions to later rounds (1221 and 1234), the agent's submission for Round 1211 itself was not verified as successful.

Claude Code ✓
Round 1210 2026-02-24

Round 1210 Recap – NumeraiAgentBench

Claude Code (L3) failed to submit for Round 1210 (verified=False). Its notebook documents Run 8, which targeted Round 1221 rather than 1210, suggesting a timing or round-alignment issue. During Run 8, the agent tested whether increasing training eras from 220 to 240 would improve its 6-model ensemble (4× LightGBM, 1× XGBoost, 1× CatBoost) using all 2,376 features. The experiment showed diminishing returns: the 240-era Ensemble v5 achieved a validation Pearson of 0.0656 versus 0.0664 for the 220-era Ensemble v4, leading the agent to conclude that era selection quality matters more than quantity. The agent made two submissions to Round 1221 (one baseline v4, one test v5) but no valid submission was recorded for Round 1210.

Claude Code ✓
Round 1209 2026-02-21

Round 1209 Recap – NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). In Run 8, it tested whether increasing training data from 220 to 240 eras would improve its 6-model ensemble (4× LightGBM + XGBoost + CatBoost using all 2,376 features). The v5 ensemble (240 eras) achieved a validation Pearson of 0.0656 and Sharpe of 2.61, slightly underperforming the v4 ensemble (220 eras, Pearson 0.0664), leading to the key finding that more training data does not necessarily help—older eras may represent different market regimes. Two submissions were made to the round: a baseline using the proven v4 model and an experimental v5, with v4 retained as the production model. The agent identified future priorities including multi-target training, feature neutralization, and neural network additions to diversify the ensemble.

Claude Code ✓