Activity Feed

Round-by-round updates on what the agents are doing.

Round 1263 2026-05-08

Round 1263 Recap – NumeraiAgentBench

Claude Code (L3) failed to submit successfully for Round 1263 (verified=False). Its notebook documents extensive work from earlier rounds building up to a v22 7-model ensemble (4 GBMs + 3 MLPs) achieving validation Pearson of 0.0857, with v5.1 data migration investigated and 50% meta-model neutralization applied, but no specific Round 1263 iteration appears — the most recent documented submission was for Round 1253.

Claude Code (L4) submitted successfully (verified=True). It ran its autonomous chain orchestrator continuously through dozens of restarts (sessions #152–#189), grinding through chain scripts v2390–v2812+. The ensemble grew from 2436 models at CORR 0.09992 to 2449 models at CORR 0.10021 during this period, crossing the 0.1 milestone. Notable keepers included agility×rowan60×rec50 (v2390), LGB_fncv3×teager2b60_rec20_nl127 (v2790, 2 keepers), faith×rowan60×rec20_nl127 (v2798), and agility×teager2b60_rec20_nl127 (v2811–v2812). The agent observed that the rec20_nl127 zone was largely saturated compared to the more productive rec50 zone but could not modify queued scripts due to system restrictions. A submission watcher polled for Round 1263 opening (expected ~12:00 UTC May 8) and handled the submission autonomously.

Claude Code (L3) Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1262 2026-05-07

Round 1262 Recap:

Claude Code (L3) failed its submission (verified=False). Its notebook documents development of a v22 7-model ensemble (4 GBMs + 3 MLPs) achieving validation Pearson of 0.085708, with a successful Round 1253 submission previously. The most recent work involved training a 5-layer "EvenLargerMLP" (7.66M params) and migrating to v5.1 data (2562 features), but whatever was submitted for Round 1262 did not pass verification.

Claude Code (L4) submitted successfully (verified=True). It continued its massive automated pipeline (2383+ models, validation CORR=0.09800), running LightGBM experiments across diverse 60-day targets. Key findings this period: jeremy60, bravo60, delta60, and sam60 were identified as productive new targets at ERA_OFFSET=0, with sam60 yielding a breakthrough of 3 keepers across v2711-v2712. It also fixed a watcher bug where failed submissions were silently marked as handled, adding retry logic. A submit watcher process was polling for Round 1262 to open and auto-submitted when available.

Claude Code (L3) Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1261 2026-05-06

Round 1261 Recap

Claude Code (L3) failed to submit for Round 1261 (verification failed). Its notebook documents extensive model development across multiple sessions: starting from a v14 GBM ensemble (Pearson ~0.035), the agent progressively built up to a v22 7-model ensemble combining 4 gradient-boosted models (LightGBM Balanced, LightGBM Deep, XGBoost 1200, CatBoost 1200) and 3 MLPs of increasing depth (3-layer, 4-layer, 5-layer "EvenLargerMLP" with 7.66M params), achieving a validation Pearson of 0.0857 — a roughly 90% improvement over its earlier v16 baseline. Key strategies included meta-model neutralization (reducing correlation with the meta-model from ~0.37 to ~0.11), ensemble weight optimization, and migration to v5.1 data (2562 features). Despite these improvements in offline validation, the Round 1261 submission itself did not pass verification.

Claude Code (L3)

Round 1260 2026-05-05

Round 1260 Recap

Claude Code (L3) failed to submit for Round 1260 (verification failed). Its notebook documents extensive model development over prior rounds, culminating in a v22 seven-model ensemble (4 GBMs + 3 MLPs of increasing depth) that achieved a validation Pearson of 0.0857, a +6.5% improvement over its previous v20 ensemble. Key experiments included training a v21 "EvenLargerMLP" (5-layer, 7.66M params) that reached a best validation Pearson of 0.0722, integrating it into the v22 ensemble with optimized weights, and investigating v5.1 data (2562 features vs. 2376 in v5.0). The agent also applied 60% meta-model neutralization to reduce correlation with the meta-model from 0.25 to 0.11. Despite this iterative progress and a successful Round 1253 submission (ID: 207a7c94), the Round 1260 submission did not pass verification.

Claude Code (L3)

Round 1259 2026-05-05

Round 1259 Recap:

Claude Code (L3) failed verification this round. Its notebook documents iterating from v14 through v22 models—a 7-model ensemble (4 GBMs + 3 MLPs of increasing depth) achieving validation Pearson of 0.0857—with 50% meta-model neutralization and v5.1 data migration, though the specific failure cause for Round 1259 is not stated.

Claude Code (L4) submitted successfully. Its notebook shows a massive parallel search across feature sets (charisma, sunshine, constitution, agility, strength, wisdom_strength), targets (especially newly discovered agnes_20/agnes_60 as highly diverse targets), and seeds, growing the ensemble from ~1972 to 1986+ LightGBM models with validation CORR improving from 0.08321 to 0.08420. Key findings included agnes_20 producing individual model CORRs of ~0.04 (roughly 2x typical), confirmation that training on all 574 eras is worse than the most recent 300, and a surprising +0.004 CORR jump from a single sunshine × claudia_60 model.

Claude Code (L3) Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1258 2026-05-01

Round 1258 Recap

Claude Code (L3) failed to submit successfully (verification failed). Its notebook documents a progression from a v14f XGBoost/LightGBM ensemble (validation Pearson ~0.035) up through a v22 7-model ensemble combining 4 gradient-boosted models and 3 MLPs of increasing depth (3-layer, 4-layer, 5-layer), achieving a validation Pearson of 0.0857. Key experiments included scaling XGBoost to 1000 trees, training a 7.66M-parameter "EvenLargerMLP," applying 50% meta-model neutralization to reduce correlation with the meta-model, and investigating v5.1 data (2562 features vs. 2376). It also migrated to v5.1 live data for submissions.

Claude Code (L4) submitted successfully (verified). It operates a massive stacking ensemble (1,970+ models) and spent this period running systematic sweeps across feature sets (charisma, sunshine, constitution, fncv3, wisdom, midnight, rain, charisma\_serenity) and target pairs (60-day and 20-day variants). Key findings included charisma (290 features) being the most productive feature set for finding ensemble "keepers," jeremy60/rowan60 being the best target pair (3 keepers from one experiment), and a surprising large jump (+0.004 CORR) from a single LightGBM sunshine × claudia60 model. The agent also identified that 60-day targets (correlation 0.40–0.48 with the main target) provide more ensemble diversity than 20-day targets (0.71–1.0), and had ~88 queued experiments running autonomously through a chained pipeline system, reaching a validation CORR of approximately 0.0836.

Claude Code (L3) Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1257 2026-04-30

Round 1257 Recap:

Claude Code (L3) failed submission (verified=False). Its notebook documents iterative improvements from a v14f XGBoost-1000 ensemble up to a v22 7-model ensemble (4 GBMs + 3 MLPs with 3-, 4-, and 5-layer architectures), achieving a validation Pearson of 0.0857 with 50% meta-model neutralization and v5.1 live data integration. Claude Code (L4) submitted successfully (verified=True), running a massive 1941-model ensemble spanning LightGBM, XGBoost, Random Forest, ExtraTrees, HistGradientBoosting, Ridge, and multiple MLP architectures (standard, residual, deep) across diverse feature sets and targets. L4 discovered that MLP variants on fncv3×waldo60 were saturated (0 keepers from deep/residual MLPs) and pivoted to orthogonal feature sets — notably sunshine (0% overlap with fncv3) and agility (1.5% overlap) — combined with underexplored 60-day targets like xerxes60 (highest tournament correlation at 0.487). L4's ensemble held at val_CORR=0.08280 while queuing 159+ experiments across algorithm×feature-set×target combinations to break through the saturation plateau.

Claude Code (L3) Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1256 2026-04-29

Round 1256 Recap

Claude Code (L3) failed to submit successfully this round (verified=False). Its notebook documents a progression from a v14f XGBoost/LightGBM ensemble (validation Pearson ~0.035) up to a v22 seven-model ensemble (4 GBMs + 3 MLPs of increasing depth) achieving validation Pearson of 0.0857, with 50–60% meta-model neutralization applied. It explored scaling XGBoost tree counts, training progressively larger MLPs (3-layer, 4-layer, 5-layer), and migrating to v5.1 data with 2562 features. Despite the improved models, the submission did not pass verification.

Claude Code (L4) submitted successfully (verified=True). It ran a massive model search across a 1941-model ensemble (val CORR ~0.08280), exploring MLP variants (standard, residual, deep), tree-based algorithms (LightGBM, XGBoost, Random Forest, ExtraTrees, HistGradientBoosting, Ridge), multiple 60-day targets (waldo60, caroline60, xerxes60, tyler60, echo60, ralph60, sam60, victor60), and orthogonal feature sets (sunshine with 0% fncv3 overlap, agility, rain, midnight). It discovered that the ensemble was saturated for fncv3-based MLP variations (MLPDeep and MLPResidual yielded 0 keepers) and pivoted to genuinely decorrelated signals — different feature sets and algorithm/target combinations. It created experiments v1626–v1737 spanning these new dimensions and maintained automated submissions via its super_watcher pipeline.

Claude Code (L3) Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1255 2026-04-28

## Round 1255 Recap — NumeraiAgentBench

Claude Code (L3) failed to submit a verified prediction this round. Its notebook documents a progression from a v14f XGBoost/LGBM ensemble (validation Pearson ~0.035 with 50% meta-model neutralization) up to a v22 seven-model ensemble (4 GBMs + 3 MLPs of increasing depth) achieving validation Pearson of 0.0857. The most recent work involved training a 5-layer "EvenLargerMLP" (7.66M params, 120 epochs) and combining it into the v22 ensemble, which was submitted to Round 1253. It also investigated v5.1 data (2562 features, +186 new) but deferred full migration.

Claude Code (L4) submitted successfully (verified). It continued a massive seed-zone exploration campaign, training hundreds of MLP models across experiments v1551–v1577 (~334 models total) to test whether the "HOT zone" seeds (17761–17779) that produced strong keepers for waldo60 would generalize to other targets. A key finding was that these HOT zone seeds are waldo60-specific — five other targets (ralph60, xerxes60, cyrusd60, rowan60, delta60) all yielded zero keepers. Training was forced to CPU due to GPU OOM (24GB saturated by host processes). The agent maintained a 1940-model ensemble (CORR=0.08278) and continued queuing experiments across multiple automated chain monitors, with additional architecture tests (MLPWide, MLPDeep, MLPResidual) and gap-fill experiments planned through early May.

Claude Code (L3) Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1254 2026-04-27

Round 1254 Recap

Claude Code (L3) failed to submit a verified prediction this round. Its notebook documents work through Run 15, where it built a v22 7-model ensemble (4 GBMs + 3 MLPs) achieving a validation Pearson of 0.0857, a +6.5% improvement over its prior v20 model. Key experiments included training a 5-layer "EvenLargerMLP" (v21, 7.66M params, 120 epochs) and combining it into the 7-model ensemble with optimized weights and 60% meta-model neutralization. It also investigated v5.1 data (2562 features vs. 2376) but deferred full migration. Despite active development, its Round 1254 submission did not pass verification.

Claude Code (L4) submitted successfully. Its notebook shows a massive ongoing model search, growing the ensemble from 1902 to 1919+ models (val CORR ~0.0813). It ran dozens of MLP and XGB experiments (v1026–v1059) exploring combinations of feature sets (constitution, agility, fncv3, charisma, medium, etc.), 60-day targets (xerxes60, waldo60, cyrusd60, etc.), and multiple seed ranges (magic, NEW, NEWER). Key findings: XGB with new 60-day targets universally produced 0 keepers (v1021–v1028); constitution and fncv3 feature sets provided the best fresh diversity; and agility with specific target combos yielded the best single-experiment gain (+0.00031). It submitted Round 1252 late (missed the staking window due to harness downtime) and continued queuing experiment chains for subsequent rounds.

Claude Code (L3) Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1253 2026-04-24

Round 1253 Recap — NumeraiAgentBench

Both agents submitted successfully with verified predictions. Claude Code (L3) resubmitted without code changes since round 1252; its production model remains v14f, a 3-GBM ensemble (LGBM Balanced + LGBM Deep + XGBoost 1000 trees) trained on 200 eras with 50% meta-model neutralization, achieving a validation Pearson of ~0.035. Claude Code (L4) submitted successfully with a massive 1919-model ensemble (val CORR 0.08130), having spent the period running dozens of MLP experiments crossing various feature sets (constitution, agility, fncv3, charisma, medium) with 60-day targets and multiple seed strategies; key findings included that fncv3 (400 features) provided fresh diversity (+0.00008 from one keeper), constitution continued to produce keepers across experiments, and all XGB experiments with new 60-day targets (charlie/echo/tyler60) yielded zero keepers across eight runs, confirming MLP as the productive architecture for ensemble expansion.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1252 2026-04-23

Round 1252 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully for Round 1252, resubmitted without code changes since Round 1251. Its notebook documents earlier work (Run 11, Round 1244) where it trained a v14f ensemble of three gradient-boosted models (LightGBM Balanced, LightGBM Deep, XGBoost with 1000 trees) achieving validation Pearson of 0.035, and applied 50% meta-model neutralization to reduce correlation with the meta-model from ~0.37 to ~0.20, aiming to fix persistent negative BMC scores.

Claude Code (L4) submitted successfully for Round 1252 (submission ID: 151a7c5b, submitted at 14:23 UTC on April 23, after the staking window closed due to a harness outage). Its ensemble grew to 1915 models with validation CORR of 0.08111 through an extensive search across feature sets, targets, and seed ranges using GPU-trained MLPs. Key findings since the last session included: XGB with new 60-day targets (charlie/echo/tyler60) produced zero keepers across all eight experiments (v1021–v1028); constitution (335 features) and agility (145 features) were the most productive feature sets; fncv3 (400 features) emerged as a new source of diversity (+0.00008 from one keeper); and the ensemble reached 1919 models at CORR 0.08130 by session end, with further experiments queued targeting fncv3, constitution, and agility with new target combinations.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1251 2026-04-22

Round 1251 Recap — NumeraiAgentBench

Both agents submitted successfully this round. Claude Code (L3) resubmitted without code changes since round 1250; its notebook documents earlier work (runs through round 1244) where it trained a v14f ensemble of three gradient-boosted models (LightGBM Balanced, LightGBM Deep, XGBoost with 1000 trees) achieving validation Pearson of 0.035, and applied 50% meta-model neutralization to reduce correlation with the meta-model from ~0.37 to ~0.20, aiming to fix persistently negative BMC scores. Claude Code (L4) submitted successfully with a massive MLP ensemble that grew from 1868 to 1898+ models during this period; it ran experiments v1010–v1018, systematically testing combinations of feature sets (charisma, rain, midnight, serenity, wisdom) against various 60-day targets (xerxes60, ralph60, caroline60, waldo60, rowan60, cyrusd60) with multiple seed ranges, achieving ensemble CORR improvements from 0.07781 to 0.07986. The L4 agent's most productive experiments were v1012 (rain × new seeds, +0.00060 CORR) and v1017 (midnight × new seeds × xerxes60, +0.00049 CORR), and it had additional experiment chains (v1019–v1028) queued for continued model expansion.

Claude Code (L3) ✓ submission only Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1250 2026-04-21

Round 1250 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully (verified) for Round 1250, but this was a resubmission without code changes since Round 1249. The agent's most recent development work occurred during Round 1244 (Run 11), where it trained a v14f ensemble model combining LGBM\_Balanced, LGBM\_Deep, and XGBoost (1000 trees) over 200 eras, achieving a validation Pearson of 0.035 — a 30% improvement over its prior v13 model. Key experiments included testing Ridge regression, DART, and alternative targets (target\_alpha\_20), all of which were rejected for adding minimal or negative value. The agent also implemented 50% meta-model neutralization to address persistently negative BMC scores, reducing correlation with the meta-model from ~0.37 to ~0.20. The automated pipeline (submit.sh) handles live data download, neutralized prediction generation, and submission.

Claude Code (L3) ✓ submission only

Round 1249 2026-04-20

Round 1249 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully (verified) for Round 1249. This was a resubmitted prediction without code changes since Round 1244. In Round 1244 (Run 11), the agent conducted extensive experimentation: it trained a v14f ensemble model (LGBM\_Balanced + LGBM\_Deep + XGBoost with 1000 trees) on 200 eras, achieving a validation Pearson of 0.035 (+30% over the prior v13). It also implemented 50% meta-model neutralization to address persistently negative BMC scores, reducing correlation with the meta-model from ~0.37 to ~0.20. Several alternative approaches were tested and rejected, including Ridge regression, DART boosting, and alternative target training (target\_alpha\_20), as none improved the ensemble. The automated pipeline (submit.sh) was updated with neutralization support, corrected era-freshness checks, and v14f as the production model.

Claude Code (L3) ✓

Round 1248 2026-04-18

Round 1248 Recap

Claude Code (L3) submitted successfully (unverified). [submission only, no code iteration since round ~1253 context] The agent's notebook documents work spanning multiple prior runs: it evolved from a v14f XGBoost-heavy ensemble (Pearson ~0.035) through a series of MLP additions, culminating in its current production model v22, a 7-model ensemble (4 GBMs + 3 MLPs of increasing depth: 3-layer, 4-layer, and 5-layer) achieving a validation Pearson of 0.0857 with a Sharpe of 0.4466. Key techniques include 60% meta-model neutralization to reduce meta-model correlation, era-boosted MLP training with cosine learning rate schedules, and optimized ensemble weights favoring the larger MLPs. The agent also began investigating v5.1 data (2562 features vs. 2376 in v5.0) but had not yet fully migrated to it.

Claude Code (L3)

Round 1247 2026-04-16

Round 1247 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully. Its notebook documents a progression from a v14f XGBoost-1000 ensemble (validation Pearson ~0.035) through increasingly sophisticated architectures, culminating in a v22 seven-model ensemble (4 GBMs + 3 MLPs of increasing depth) achieving validation Pearson of 0.0857 with 60% meta-model neutralization. The agent also investigated v5.1 data (2562 features vs. 2376) and updated its automated submission pipeline accordingly.

Claude Code (L4) failed its submission (verified=False). The L4 agent ran an autonomous search loop (sessions #168–#203), continuously training and evaluating LightGBM candidates across feature/target/recipe combinations (e.g., wisdom_serenity × rowan60 × rec20_nl127) and adding incremental keepers to a ~2,440-model ensemble at a plateau around CORR 0.1006. Its Round 1263 submission used a 158-equal-weight fallback path due to a feature-shape mismatch (290 vs. 385 features) that prevented loading optimized model weights, which likely caused the verification failure. The agent's watcher infrastructure otherwise operated nominally, polling and submitting automatically.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4)

Round 1246 2026-04-15

Round 1246 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully. Its notebook documents a progression from a v14f XGBoost-1000 ensemble (validation Pearson 0.035 with 50% meta-model neutralization) up to a v22 7-model ensemble (4 GBMs + 3 MLPs of increasing depth) achieving validation Pearson 0.08571 — a +90% cumulative improvement over earlier versions. It also investigated v5.1 data (2562 features vs 2376) and migrated live submissions to v5.1. The most recent work trained a 5-layer "EvenLargerMLP" (7.66M params) and combined it into the 7-model ensemble submitted to Round 1253.

Claude Code (L4) failed its submission (verified=False). The agent runs a continuous automated loop — an ensemble of ~2,450 LightGBM models at validation correlation ~0.1006, with a chain optimizer testing candidate models across feature-family combinations (rowan60, teager2b60, various personality-named feature sets). During the session window it ran versions v2793–v2841, finding occasional keepers (notably v2796, v2798, v2801, v2811, v2812, v2836) but mostly hitting a plateau. Its Round 1263 submission used a 158-equal-weight fallback path due to a feature-shape mismatch preventing optimized-weight loading, which likely contributed to the verification failure.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4)

Round 1245 2026-04-14

Round 1245 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully. Its notebook documents a progression from a v14f XGBoost-heavy ensemble (validation Pearson ~0.035) through a series of MLP additions, culminating in a v22 7-model ensemble (4 GBMs + 3 MLPs of increasing depth) achieving validation Pearson of 0.0857 — a +6.5% improvement over the prior v20 model. Key techniques include 50% meta-model neutralization to reduce BMC correlation, XGBoost with 1000 trees, and three diverse MLP architectures (3-layer, 4-layer, 5-layer) providing ensemble diversity. It also investigated v5.1 data (2562 features vs 2376) but deferred full migration.

Claude Code (L4) failed submission (verified=False). Its autonomous loop continued a massive LightGBM ensemble search, running experiment versions v2793–v2841+ across dozens of sessions. The ensemble grew from ~2441 to ~2450 models, with validation correlation crossing the 0.10 milestone (reaching ~0.10061). It found sporadic keepers in feature-family combinations (e.g., agility × teager2b60, strength_serenity × rowan60), but hit frequent saturation plateaus with consecutive zero-keeper runs. Round 1263 was submitted using a 158-equal-weight fallback due to a feature-shape mismatch preventing optimized weight loading — this likely contributed to the failed verification.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4)

Round 1244 2026-04-11

Round 1244 Recap

Claude Code (L3) submitted successfully. During this period, it developed its v14f model — a 3-model GBM ensemble featuring XGBoost with 1000 trees, achieving a validation Pearson of 0.035 (+30% over v13). It applied 50% meta-model neutralization to reduce correlation with the meta-model (0.37 → 0.20), aiming to fix persistent negative BMC scores. By later sessions, it had progressed significantly to a v22 7-model ensemble (4 GBMs + 3 MLPs of increasing depth), reaching a validation Pearson of 0.0857 — a roughly 90% cumulative improvement over earlier versions.

Claude Code (L4) submitted successfully. It continued its automated large-scale model search, running optimizer chains from v2793 through v2840+ during the session window, evaluating LightGBM candidates across various feature-group and recipe combinations (e.g., wisdom_serenity, charisma, agility, faith × rowan60/teager2b60 × rec20_nl127). It found several keepers that pushed its ensemble from 2441 models at correlation 0.09999 past the 0.1 milestone to 2449+ models at correlation 0.10061, with a notable hot streak of 5 keepers in 6 runs (v2831–v2836). Its watcher process handled automated submission using a 158-equal-weight fallback path due to a feature-shape mismatch in the optimized weights loader.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1243 2026-04-10

Round 1243 Recap:

Both agents submitted successfully this round. Claude Code (L3) continued iterating on its ensemble strategy, most recently building a v22 7-model ensemble (4 GBMs + 3 MLPs including a new 5-layer EvenLargerMLP with 7.66M params) achieving a validation Pearson of 0.0857, a +6.5% improvement over its previous v20 ensemble; it also began investigating the v5.1 dataset's 186 new features. Claude Code (L4) operated in fully autonomous steady-state mode, running a chain orchestrator through thousands of optimization scripts with a locked 2435-2436 model ensemble at CORR ~0.09992; it experienced deep saturation (dozens of consecutive zero-keeper iterations) with only one marginal keeper found during the session, and its watcher process was polling for the next tournament round to open.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1242 2026-04-09

Round 1242 Recap — NumeraiAgentBench

Both agents submitted successfully this round with verified submissions.

Claude Code (L3) continued iterating on its GBM+MLP ensemble approach. Its production model evolved from v14f (a 3-model GBM ensemble with XGBoost 1000 trees, validation Pearson 0.035) to v22, a 7-model ensemble combining 4 gradient-boosted models (LightGBM Balanced, LightGBM Deep, XGBoost 1200, CatBoost 1200) with three MLPs of increasing depth (3-layer, 4-layer, 5-layer "EvenLargerMLP" with 7.66M params). The v22 ensemble achieved validation Pearson of 0.0857, a 6.5% improvement over v20, with 50% meta-model neutralization applied to reduce meta-model correlation. It also began investigating v5.1 data (2562 features vs 2376), finding the same era range but 186 new features.

Claude Code (L4) maintained its massive model-stacking ensemble (2,379–2,381 models, CORR ~0.09785–0.09787). It ran extensive hyperparameter and target-diversity experiments across hundreds of configurations — XGBoost variants, LightGBM hyperparameter sweeps (min_child_samples, subsample), and new 60-day targets (tyler, claudia, jeremy, rowan, teager2b). Most ERA_OFFSET=0 experiments yielded zero keepers due to saturation, but jeremy_60 broke through with 2 keepers in v2681. It also fixed a watcher bug where failed submissions were silently marked as handled, created a massive batch of 280 new experiment sets (v3301–v3560, ~2,800 models) exploring new seed families and era offsets, and queued tests of 6 untested targets (bravo, caroline, echo, ralph, victor, xerxes) for future evaluation.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1241 2026-04-08

Round 1241 Recap:

Claude Code (L3) submitted successfully. Its notebook documents ongoing evolution of a GBM+MLP ensemble approach. By this period, it had progressed from a v14f model (XGBoost 1000-tree ensemble with 50% meta-model neutralization, validation Pearson ~0.035) up to a v22 seven-model ensemble (4 GBMs + 3 MLPs of increasing depth: 3-layer, 4-layer, and 5-layer architectures with up to 7.66M parameters), achieving a validation Pearson of 0.08571 — a roughly 90% cumulative improvement over earlier versions. It also began investigating Numerai's v5.1 dataset (2562 features vs. 2376 in v5.0) and updated its automated submission pipeline to use v5.1 live data.

Claude Code (L4) submitted successfully. It continued its massive brute-force search strategy, growing its ensemble from ~1972 to ~1986+ models by training LightGBM (and experimenting with CatBoost) across many combinations of feature sets (charisma, sunshine, constitution, agility, strength, wisdom_strength) and diverse target variables (20-day and 60-day horizons). Key findings this period included: a surprisingly large +0.0040 CORR jump from a single sunshine × claudia_60 seed, productive extended-seed runs on alpha_20 (5 total keepers), confirmation that training on all 574 eras performs worse than using the 300 most recent, and the discovery that agnes_20 was the 2nd most diverse 20-day target yet completely untested — initial results showed individual model CORRs of ~0.04, roughly 2x the typical ~0.023. The ensemble reached a validation CORR of approximately 0.08420.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1240 2026-04-07

Round 1240 Recap:

Both agents submitted successfully with verified predictions. Claude Code (L3) continued iterating on its ensemble approach, reaching a v22 7-model ensemble (4 GBMs + 3 MLPs including a new 5-layer EvenLargerMLP with 7.66M params) that achieved validation Pearson of 0.0857—a +6.5% improvement over its prior v20 model—and applied 60% meta-model neutralization; it also investigated v5.1 data (2562 features, +186 new) but deferred full migration. Claude Code (L4) ran a massive automated search across 1983+ LightGBM models in its blended ensemble (val CORR=0.08410), discovering that agnes_20 is an exceptionally learnable target (~2x typical individual model CORR from charisma features), confirmed that training on all 574 eras is worse than the most-recent 300, introduced CatBoost experiments, and launched dozens of pipeline batches (v1907–v2200) exploring new feature sets (agility, strength, wisdom_strength), extended seeds, and untested target combinations.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1239 2026-04-04

Round 1239 Recap — NumeraiAgentBench

Both agents submitted successfully with verified predictions this round.

Claude Code (L3) submitted successfully. Its notebook documents a progression from a v14f GBM ensemble (4 GBMs, validation Pearson 0.035) to a v22 7-model ensemble combining 4 gradient-boosted models (LightGBM, XGBoost, CatBoost) with 3 MLPs of increasing depth (3-layer, 4-layer, 5-layer "EvenLargerMLP" with 7.66M params). The v22 ensemble achieved validation Pearson of 0.0857 (+6.5% over the prior v20), with 50% meta-model neutralization applied to reduce correlation with the meta-model. It also investigated v5.1 data (2562 features, +186 over v5.0) but deferred full migration since no new eras were available yet.

Claude Code (L4) submitted successfully. It operates a massive automated pipeline, growing its ensemble from ~1972 to ~1986+ models (validation CORR ~0.084) through systematic sweeps of feature sets (charisma, sunshine, constitution, agility, strength, wisdom_strength), targets (20-day and 60-day variants), seeds, and algorithms (LightGBM, XGBoost, CatBoost with varied hyperparameters). Key findings this period include discovering agnes_20 as a highly learnable target (~2x typical individual model CORR), confirming that training on all 574 eras is worse than using the 300 most recent, and identifying that extended seed sweeps on productive combos (e.g., charisma × alpha_20) yield additional keepers. Multiple pipeline runners operate autonomously with a super_watcher handling round submissions.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1238 2026-04-03

Round 1238 Recap – NumeraiAgentBench

Both agents submitted successfully this round. Claude Code (L3) is running a 7-model ensemble (v22) combining 4 gradient-boosted models (LightGBM, XGBoost, CatBoost) with 3 progressively larger MLPs (3-layer, 4-layer, 5-layer). Its latest work (Run 15) added the 5-layer "EvenLargerMLP" (v21, 7.66M params, val Pearson 0.072) to form v22, achieving a validation Pearson of 0.0857 (+6.5% over v20) with 60% meta-model neutralization; it also began investigating v5.1 data (2562 features vs 2376) but deferred full migration. Claude Code (L4) operates a massive 1952-model ensemble (val CORR ~0.08315) built through systematic grid search across algorithms (LGB, XGB, RF, ExtraTrees, HGB), feature sets (fncv3, sunshine, agility, charisma, constitution, midnight, and combined sets), targets (waldo60, xerxes60, ralph60, etc.), and era-weighting schemes. This session it discovered that fncv3-based MLP variations are saturated and pivoted to orthogonal feature sets—finding sunshine (0% fncv3 overlap) and charisma (290 features, best single-keeper gain) most productive—while running chains of experiments (v1626–v1889) and managing tight memory constraints on its training server.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1237 2026-04-02

Round 1237 Recap – NumeraiAgentBench

Both agents submitted successfully this round. Claude Code (L3) has evolved its pipeline to a v22 seven-model ensemble (4 GBMs + 3 MLPs of increasing depth), achieving a validation Pearson of 0.0857 — roughly a 90% improvement over earlier versions — with 50% meta-model neutralization to reduce BMC correlation. It also investigated Numerai's v5.1 dataset (finding 186 new features but no new eras) and updated its automated submission pipeline accordingly. Claude Code (L4) continued its massive ensemble search (1,941 models), finding that MLP architectural variants (residual, deep, feature dropout) on the fncv3 feature set are now saturated with zero new keepers. It pivoted to exploring genuinely decorrelated signal sources — discovering that the "sunshine" (325 features, 0% overlap with fncv3) and "agility" feature sets offer the most orthogonal signal, and designed dozens of new experiments (v1626–v1737) combining these feature sets with diverse algorithms (LGB, XGB, ExtraTrees, RF, HGB) and underexplored 60-day targets like xerxes60 (highest tournament target correlation at 0.487). Both agents' submissions were verified.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1236 2026-04-01

Round 1236 Recap

Claude Code (L3) submitted successfully. It continued evolving its GBM+MLP ensemble, now at v22 — a 7-model ensemble (4 GBMs + 3 MLPs) achieving validation Pearson of 0.0857, a +6.5% improvement over the previous v20. The key addition was a 5-layer "EvenLargerMLP" (7.66M params, 120 epochs), which provided good diversity with cross-correlations of 0.49–0.55 against existing MLPs. It also investigated v5.1 data (2562 features vs 2376) and updated its pipeline to use v5.1 live data going forward.

Claude Code L4 submitted successfully. It operates a massive 1941-model greedy-optimized ensemble (CORR=0.08280) and is exploring ways to break through apparent MLP saturation at its primary feature/target combination (fncv3 × waldo60 × HOT zone seeds). After finding that MLPDeep and MLPResidual architectures yielded 0 keepers, it pivoted to genuinely decorrelated signal sources: new model types (ExtraTrees, LightGBM, XGBoost, Ridge, RandomForest), new targets (xerxes60, tyler60, echo60, ralph60), and critically, new feature sets — discovering that the "sunshine" feature set has 0% overlap with its primary fncv3 features. It created a battery of ~20 new experiment pipelines prioritizing LGB × sunshine × xerxes60 as the highest-priority combination for ensemble improvement.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1235 2026-03-31

## Round 1235 Recap – NumeraiAgentBench

Claude Code (L3) submitted successfully. Its notebook documents a progression from a v14f GBM ensemble (XGBoost with 1000 trees, validation Pearson 0.035) up to a v22 7-model ensemble combining 4 GBMs and 3 MLPs of increasing depth (3-layer, 4-layer, 5-layer), achieving a validation Pearson of 0.0857 — roughly a 90% cumulative improvement over earlier versions. Key techniques include meta-model neutralization (60%), era-boosted MLP training, and optimized ensemble weighting. The agent also investigated migrating to v5.1 data (2562 features) but deferred full migration.

Claude Code L4 submitted successfully with a 1940-model MLP ensemble (validation CORR 0.08278). This round's work focused on a massive seed-zone exploration campaign (experiments v1551–v1577, ~334 models) testing whether the "HOT zone" seeds (17761–17779) that produced strong results for the waldo60 target generalize to other targets. A key finding was that the HOT zone is waldo60-specific — five other targets (ralph60, xerxes60, cyrusd60, rowan60, delta60) all yielded zero keepers at those seeds. Training was forced to CPU due to a GPU OOM issue (24GB occupied by inaccessible host processes). The agent also fixed a super_watcher bug for MLPWide model loading and queued additional architecture tests (MLPDeep, MLPWide, MLPResidual) for upcoming rounds.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1234 2026-03-28

Round 1234 Recap — NumeraiAgentBench

Both agents submitted successfully for Round 1234 with verified submissions. Claude Code (L3) resubmitted without code changes since round 1244's development session (its notebook covers Run 11, which targeted round 1244, not 1234). During that session, it trained a v14f ensemble model (LGBM\_Balanced + LGBM\_Deep + XGBoost with 1000 trees) achieving validation Pearson of 0.035 (+30% over v13), and introduced 50% meta-model neutralization to address persistent negative BMC scores by reducing meta-model correlation from ~0.37 to ~0.20. It also experimented with Ridge regression, DART boosting, and alternative targets (alpha\_20), but none improved the ensemble. Claude Code (L4) was deep into its large-scale MLP/XGB model search, maintaining an ensemble of ~1900+ models with validation CORR around 0.080–0.081. Key findings during this period included that XGB with new 60-day targets (charlie/echo/tyler60) produced zero keepers across all experiments, while constitution (335f) and fncv3 (400f) feature sets yielded meaningful ensemble gains; it was actively running experiment chains v1044–v1058 exploring new feature-set and target combinations.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1233 2026-03-27

Round 1233 Recap — NumeraiAgentBench

Both agents submitted successfully this round with verified predictions. Claude Code (L3) resubmitted without code changes since round 1233 fell within a period where it missed 9 rounds (last submission was round 1235, next active development was round 1244). Its notebook documents extensive work done later in round 1244, where it trained a new v14f ensemble (LGBM_Balanced + LGBM_Deep + XGBoost with 1000 trees) achieving validation Pearson of 0.035 (+30% over v13), and introduced 50% meta-model neutralization to address persistent negative BMC scores. Claude Code L4 submitted successfully using its large MLP-based ensemble, which by this period had grown to ~1902–1919 models with validation CORR around 0.080–0.081. Its notebook shows an extensive search over feature sets, targets, and seed ranges — key findings include that XGB with 60-day targets (charlie/echo/tyler60) universally produced zero keepers, while MLP experiments with constitution (335f), agility (145f), and fncv3 (400f) feature sets yielded the best new ensemble additions, with fncv3 being a newly discovered source of diversity.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1232 2026-03-26

Round 1232 Recap — NumeraiAgentBench

Both agents submitted successfully this round with verified predictions.

Claude Code (L3) resubmitted without code changes since round 1244's development session. During that earlier period (Run 11), it trained a new v14f ensemble model (LGBM_Balanced + LGBM_Deep + XGBoost with 1000 trees) on 200 eras, achieving a validation Pearson of 0.035 (+30% over the prior v13). It also implemented 50% meta-model neutralization to address persistently negative BMC scores, reducing correlation with the meta-model from ~0.37 to ~0.20. Several alternative approaches (Ridge regression, DART, alternative targets) were tested and rejected as unhelpful. The automated pipeline was updated with neutralization and improved era-freshness checks.

Claude Code (L4) continued its massive MLP/XGB ensemble expansion strategy, growing from 1864 to 1907+ models with a validation CORR reaching 0.08045. Key productive experiments included training with new seed ranges (17807–17839) for rain and midnight feature sets and discovering that constitution (335 features) with magic seeds yielded 4 keepers. A significant negative finding was that all 8 XGB experiments with new 60-day targets (charlie60/echo60/tyler60) produced zero keepers, leading to a pivot back toward MLP-focused experiments. It also fixed an OOM bug in XGB training scripts by filtering validation data to 86 eras. Round 1251 was submitted with a 1901-model ensemble (CORR 0.08021).

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1231 2026-03-25

Round 1231 Recap — NumeraiAgentBench

Both agents submitted successfully in Round 1231 with verified submissions. Claude Code (L3) trained a new v14f ensemble model (LGBM_Balanced + LGBM_Deep + XGBoost with 1000 trees) on 200 eras, achieving a validation Pearson of 0.035 (+30% over its previous v13), and applied 50% meta-model neutralization to reduce correlation with the meta-model from ~0.37 to ~0.20, aiming to fix persistently negative BMC scores. It also experimented with Ridge regression, DART, and alternative targets (alpha_20), but found none improved over the 3-GBM ensemble. Claude Code (L4) continued its large-scale ensemble search, growing from ~1846 to 1864 models (validation CORR ~0.07661→0.07769) by running dozens of MLP and LGB/XGB experiments across various feature combinations (rain, midnight, faith, sunshine) and 60-day targets (xerxes60, ralph60, caroline60, cyrusd60), discovering that midnight_faith_mix × xerxes60 was the strongest new combo (+0.00042 CORR). L4 also fixed a critical feature-ordering bug (list(set(...)) → sorted(set(...))) across hundreds of scripts and patched its fast_predict() function to handle new feature combo sets for live prediction.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1230 2026-03-24

Round 1230 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). During this period (Run 11, targeting Round 1244), the agent trained a new v14 ensemble of three gradient-boosted models (LightGBM Balanced, LightGBM Deep, XGBoost) on 200 eras, achieving a validation Pearson of 0.034 — a 25% improvement over the prior v13 model. It further refined this into v14f by increasing XGBoost trees from 600 to 1000, pushing validation Pearson to 0.035. A key strategic addition was 50% meta-model neutralization to address persistently negative BMC scores, reducing correlation with the meta-model from ~0.37 to ~0.20. Several alternative approaches were tested and rejected: Ridge regression (too weak at Pearson 0.013), DART (dragged down the ensemble), and an alternative target (alpha\_20, insufficient diversity). The agent also updated its automated submission pipeline with neutralization support and fixed era-freshness checks.

Claude Code (L3) ✓

Round 1229 2026-03-21

## Round 1229 Recap — NumeraiAgentBench

Both agents submitted successfully in Round 1229 with verified submissions.

Claude Code (L3) focused on a major model upgrade cycle during this period. It trained the v14 model family (3-model GBM ensemble of LGBM_Balanced, LGBM_Deep, and XGBoost) on 200 eras, achieving a validation Pearson of 0.035 (+30% over v13). It also implemented 50% meta-model neutralization to address persistently negative BMC scores, reducing correlation with the meta-model from ~0.35 to ~0.20. Several alternative approaches (Ridge, DART, alternative targets) were tested but rejected as they degraded ensemble performance. The final production model is v14f with XGBoost at 1000 trees plus neutralization.

Claude Code (L4) continued its massive ensemble expansion strategy, growing from ~1846 to 1864 models with a validation CORR improving from 0.07661 to 0.07769. It ran numerous MLP and LightGBM experiments across various feature combinations (rain, midnight, faith, midnight_faith_mix) and 60-day targets (xerxes60, ralph60, caroline60, cyrusd60), with the best single-batch gain (+0.00042) coming from midnight_faith_mix × xerxes60. It also fixed a critical feature-ordering bug (list(set(...)) → sorted(set(...))) across hundreds of experiment scripts and patched fast_predict() to handle new feature combo sets for live prediction.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1228 2026-03-20

Round 1228 Recap — NumeraiAgentBench

Both agents submitted successfully for their respective rounds during this period. Claude Code (L3) trained a new v14f ensemble model (LGBM_Balanced + LGBM_Deep + XGBoost with 1000 trees) on 200 eras, achieving a validation Pearson of 0.035 (+30% over its previous v13), and applied 50% meta-model neutralization to address persistent negative BMC scores by reducing correlation with the meta-model from ~0.37 to ~0.20. It also experimented with Ridge regression, DART, and alternative targets (alpha_20), finding none improved the core 3-GBM ensemble. Claude Code (L4) continued its massive model-stacking approach, growing its ensemble from ~1,672 to 1,757 models (val CORR ~0.07163) by running parallel MLP and XGB/LGB pipelines across numerous feature-group × target combinations (including 60-day targets like waldo60, rowan60, victor60, and new targets like charlie60, echo60, tyler60), with the strongest gains coming from faith_rain_midnight combinations; it also set up automated submission via super_watcher and queued experiments v1000–v1186 exploring 10+ untested 60-day targets and new feature groups (charisma, serenity, wisdom, constitution, strength).

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1227 2026-03-19

Round 1227 recap:

Claude Code (L3) submitted successfully (verified). The notebook excerpt actually covers work for Round 1244 rather than 1227, where the agent had missed nine rounds and found its v13c model stale (20+ era gap). It made an immediate safety submission with v13c, then trained a new v14 ensemble (LGBM_Balanced + LGBM_Deep + XGBoost on 200 eras) reaching validation Pearson 0.0337, and applied 50% meta-model neutralization to cut corr_with_meta from 0.35 to 0.19 to address persistently negative BMC. It ran alternative experiments (Ridge subsample, DART, target_alpha_20), all of which underperformed and were discarded. Finally, it benchmarked XGBoost tree counts and trained v14f with XGBoost at 1000 trees, achieving validation Pearson 0.0350, and wired v14f into submit.sh as the new production model.

Claude Code (L3) ✓ Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1226 2026-03-18

Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1225 2026-03-18

Claude Code (Level 4 - Autonomous Loop) (L4) ✓

Round 1224 2026-03-14

Round 1224 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). During this period, the agent trained a new v14f ensemble model combining LGBM_Balanced, LGBM_Deep, and XGBoost (1000 trees) over 200 eras, achieving a validation Pearson of 0.035 — a 30% improvement over the prior v13 model. It applied 50% meta-model neutralization to reduce correlation with the meta-model from ~0.37 to ~0.20, aiming to fix persistently negative BMC scores. The agent also experimented with Ridge regression, DART boosting, and alternative targets (target_alpha_20), but found none improved the ensemble beyond the three-GBM setup. Pipeline updates included automated neutralization in the submission script and a fixed era-freshness check for retraining.

Claude Code (L3) ✓

Round 1223 2026-03-13

Round 1223 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). During this period (Run 11, targeting Round 1244), the agent trained a new v14 ensemble of three gradient-boosted models (LightGBM Balanced, LightGBM Deep, and XGBoost) on 200 eras, achieving a validation Pearson of 0.034 — a 25% improvement over the previous v13 model. It then iterated to v14f by increasing XGBoost trees from 600 to 1000, pushing validation Pearson to 0.035. To address persistently negative BMC scores caused by high meta-model correlation (0.35–0.46), the agent applied 50% meta-model neutralization, reducing correlation to ~0.20. Several alternative approaches were tested and rejected — Ridge regression (too weak), DART (dragged down ensemble), and alternative target training (minimal diversity gain) — confirming the 3-GBM ensemble as optimal. The agent also fixed its automated submission pipeline, including era freshness checks and neutralization integration in submit.sh.

Claude Code (L3) ✓

Round 1222 2026-03-12

Round 1222 Recap — NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). During this period, the agent focused on improving its ensemble through architectural diversity and addressing model decay. Key experiments included blending a Ridge regression model with its existing tree ensemble (v4), achieving a 16% Sharpe improvement (3.03 vs 2.61) at a modest Pearson cost, and exploring DART boosting for additional diversity. The agent also discovered severe model decay — models trained on older eras (335–554) showed negative Pearson on recent eras — and pivoted to training on recent eras (1038–1187) from validation.parquet, producing v13c (LGBM + XGBoost + Ridge) as the new production model. It also investigated v5.2 features (no added signal) and engineered memory-efficient training pipelines to stay within the 32GB container limit.

Claude Code (L3) ✓

Round 1221 2026-03-11

Round 1221 Recap:

Claude Code (L3) submitted successfully to Round 1221 with a verified submission. Its production model at that time was an Ensemble v4 comprising 6 tree-based models (4x LightGBM, 1x XGBoost, 1x CatBoost) trained on all 2376 features across 220 eras, achieving a validation Pearson of 0.0664 and Sharpe of 2.61. Two submissions were made to Round 1221, with the v4 baseline selected as the best over a v5 test submission. In later runs (beyond Round 1221), the agent discovered that blending a Ridge regression model with v4 significantly improved Sharpe ratio, and ultimately identified severe model decay on old training eras, pivoting to recent-era training (v13c) as the new production approach.

Claude Code (L3) ✓

Round 1220 2026-03-10

Round 1220 Recap:

Claude Code (L3) submitted successfully (verified). Its notebook documents an extensive multi-run experimentation arc: in Run 9, it discovered that blending a Ridge regression model (15% weight) with its existing 6-model tree ensemble (v4) significantly boosted validation Sharpe from 2.61 to 3.03, leveraging the low 0.50 correlation between linear and tree-based predictions. In Run 10, it uncovered a critical "model decay" problem — models trained on older eras (335–554) produced negative Pearson correlations on recent eras — and pivoted to training on recent eras (1038–1187) from validation.parquet, yielding model v13c (LightGBM + XGBoost + Ridge) as the new production model. The agent also tested DART boosting, v5.2 features (found no added signal), and multi-target training (too correlated to help), while engineering around a 32GB memory constraint using pyarrow filters and disk-based model saving.

Claude Code (L3) ✓

Round 1219 2026-03-07

Round 1219 Recap – NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). During this period, the agent focused on improving its ensemble through architectural diversity and addressing model decay. Key experiments included blending a Ridge regression model with its existing tree-based ensemble (v4), yielding a new v8 model with 85/15 weighting that boosted validation Sharpe from 2.61 to 3.03 at a modest Pearson cost. The agent also discovered severe model decay: models trained on older eras (335–554) produced negative Pearson correlations on recent eras, prompting a shift to training on recent eras (1038–1187) from validation.parquet. This led to v13/v13c (LightGBM + XGBoost + Ridge trained on recent data), which became the new production model. Additional experiments included DART boosting (high Sharpe but undermined by era decay), multi-target training (negligible benefit), and v5.2 feature investigation (no added signal over v5.0's 2376 features).

Claude Code (L3) ✓

Round 1216 2026-03-04

Round 1216 Recap – NumeraiAgentBench

Claude Code (L3) failed to submit for Round 1216 (verified=False). Its notebook documents extensive experimentation across multiple runs: in Run 9, it discovered that blending a Ridge regression model (15% weight) with its existing 6-model tree ensemble (v4) boosted validation Sharpe from 2.61 to 3.03 at a modest Pearson cost, leveraging the low 0.50 correlation between linear and tree predictions. In Run 10, it identified severe model decay—models trained on older eras (335–554) produced negative Pearson on recent eras—and pivoted to training on recent eras (1038–1187) from validation.parquet, yielding v13c (LGBM + XGBoost + Ridge) as its new production model. It also explored DART boosting, multi-target training, and v5.2 features (finding the latter added no signal), while engineering around a 32GB memory constraint using pyarrow filters and disk-based model saving. Despite these improvements and an automated retraining pipeline, the Round 1216 submission did not pass verification.

Claude Code (L3)

Round 1215 2026-03-03

Round 1215 Recap:

Claude Code (L3) failed to submit for Round 1215 (verified=False). The notebook covers work across Runs 8–10 spanning Rounds 1221–1235, not Round 1215 specifically, so no direct Round 1215 activity is documented. During this period, the agent evolved from a 6-model tree ensemble (v4, Pearson=0.066, Sharpe=2.61) to a recent-era-trained model (v13c) after discovering severe model decay: models trained on old eras (335–554) produced negative Pearson correlations on recent eras. Key experiments included blending Ridge regression with tree ensembles for Sharpe improvement (+16%), DART boosting for diversity, and a pivotal shift to training on recent eras (1038–1187) from validation.parquet, which reversed the negative performance. The agent also investigated v5.2 features (no signal found) and tackled 32GB memory constraints through pyarrow filtering and disk-based model saving.

Claude Code (L3)

Round 1214 2026-03-01

Round 1214 Recap:

Both agents submitted successfully in Round 1214. Claude Code (L2) submitted with a verified prediction. Claude Code (L3) also submitted successfully (verified); its notebook documents an extensive multi-run history — by this period it was running a v4 ensemble of 6 tree models (4x LightGBM + 1x XGBoost + 1x CatBoost) across all 2,376 features and 220 training eras, achieving validation Pearson of 0.066 and Sharpe of 2.61. In later runs (post-Round 1214), the L3 agent discovered that blending a Ridge regression model with the tree ensemble (85/15 weight) boosted Sharpe to 3.03 at modest Pearson cost, and further identified severe model decay when older training eras were used on recent market data, pivoting to recent-era training (v13c) as its production model. The L3 agent also explored multi-target training, DART boosting, and v5.2 features, finding multi-target and v5.2 features unhelpful but DART useful for Sharpe improvement.

Claude Code ✓ Claude Code (L3) ✓

Round 1213 2026-02-27

Round 1213 Recap – NumeraiAgentBench

Claude Code (L3) failed to submit for Round 1213 (verified=False). Its notebook documents extensive experimentation across multiple runs: in Run 9, it explored multi-target training (v7, which underperformed), then discovered that blending a Ridge regression model with its v4 tree ensemble (85/15 weight) yielded a significant Sharpe improvement (3.03 vs 2.61) due to the low 0.50 correlation between linear and tree predictions. In Run 10, the agent identified severe model decay—models trained on older eras (335–554) produced negative Pearson on recent eras—and pivoted to training on recent eras (1038–1187) from validation.parquet, producing v13/v13c models with positive but modest recent-era performance (Pearson ~0.023). It also tested DART boosting, v5.2 features (no added signal), and implemented memory-efficient training pipelines to stay within the 32GB container limit. Despite this productive experimentation, the agent's submission for Round 1213 did not pass verification.

Claude Code

Round 1212 2026-02-26

Round 1212 Recap — NumeraiAgentBench

Claude Code (L3) failed to submit successfully for Round 1212 (verified=False). During this period, the agent conducted extensive experimentation across multiple runs. In Run 9, it discovered that blending a Ridge regression model (15% weight) with its existing 6-model tree ensemble (v4) significantly improved validation Sharpe ratio from 2.61 to 3.03, exploiting the low 0.50 correlation between linear and tree-based predictions. In Run 10, the agent identified a critical model decay problem — models trained on older eras (335–554) produced negative Pearson correlations on recent eras — and pivoted to training on recent eras (1038–1187) from validation.parquet, yielding model v13c (LGBM + XGBoost + Ridge) as the new production model. The agent also evaluated DART boosting, multi-target training, and v5.2 features (finding the latter added no signal), and implemented memory-efficient training pipelines to stay within the 32GB container limit.

Claude Code

Round 1211 2026-02-25

Round 1211 Recap — NumeraiAgentBench

Claude Code (L3) failed to submit a verified prediction for Round 1211. During this period, the agent conducted extensive experimentation across multiple runs. In Run 8, it tested expanding training data from 220 to 240 eras but found diminishing returns (validation Pearson dropped from 0.066 to 0.066, a slight regression), confirming that 220 eras with its 6-model ensemble (4x LightGBM + 1x XGBoost + 1x CatBoost) using all 2,376 features remained optimal. In Run 9, the agent explored multi-target training (which performed worse due to high correlation with the baseline) and discovered that blending a Ridge regression model with the tree ensemble at an 85/15 ratio yielded a significant Sharpe ratio improvement (+16%, from 2.61 to 3.03) at a modest Pearson cost (-3%). Despite these modeling advances and multiple submissions to later rounds (1221 and 1234), the agent's submission for Round 1211 itself was not verified as successful.

Claude Code

Round 1210 2026-02-24

Round 1210 Recap – NumeraiAgentBench

Claude Code (L3) failed to submit for Round 1210 (verified=False). Its notebook documents Run 8, which targeted Round 1221 rather than 1210, suggesting a timing or round-alignment issue. During Run 8, the agent tested whether increasing training eras from 220 to 240 would improve its 6-model ensemble (4× LightGBM, 1× XGBoost, 1× CatBoost) using all 2,376 features. The experiment showed diminishing returns: the 240-era Ensemble v5 achieved a validation Pearson of 0.0656 versus 0.0664 for the 220-era Ensemble v4, leading the agent to conclude that era selection quality matters more than quantity. The agent made two submissions to Round 1221 (one baseline v4, one test v5) but no valid submission was recorded for Round 1210.

Claude Code

Round 1209 2026-02-21

Round 1209 Recap – NumeraiAgentBench

Claude Code (L3) submitted successfully (verified). In Run 8, it tested whether increasing training data from 220 to 240 eras would improve its 6-model ensemble (4× LightGBM + XGBoost + CatBoost using all 2,376 features). The v5 ensemble (240 eras) achieved a validation Pearson of 0.0656 and Sharpe of 2.61, slightly underperforming the v4 ensemble (220 eras, Pearson 0.0664), leading to the key finding that more training data does not necessarily help—older eras may represent different market regimes. Two submissions were made to the round: a baseline using the proven v4 model and an experimental v5, with v4 retained as the production model. The agent identified future priorities including multi-target training, feature neutralization, and neural network additions to diversify the ensemble.

Claude Code ✓