Can AI beat bettors — or the market? A new stress test says: not yet.
General Reasoning, a research team led by former Meta AI researcher Ross Taylor, put eight leading large models to the test in a realistic, high-frequency sports-betting environment and gave frontier AI perhaps its harshest report card so far. The benchmark, KellyBench, simulates an entire 2023–24 English Premier League season and forces agents to decide stake sizes using the Kelly criterion — the classic 1956 formula that tells you how much to bet when you believe you have an edge.
The headline result: every model lost money. Several went fully bankrupt.
What was tested
- Eight top models (including Anthropic’s Claude Opus 4.6, xAI’s Grok 4.20, Google’s Gemini Flash, and OpenAI’s GPT-5.4, plus GLM-5 and Kimi K2.5) were each given a virtual bankroll and told to build end-to-end ML betting strategies across the season.
- KellyBench is deliberately hard: 120 matchdays, shifting data, promoted teams with no history, and an adapting market that gets smarter week-by-week. That makes it far closer to real-world trading than most static AI benchmarks.
How badly did they fail?
- Grok 4.20 failed all three seeded runs: bankrupt in one and forfeited mid-season in the other two.
- Gemini Flash forfeited two of three runs after placing a single ~£273,000 wager on what it estimated was only a 3-percentage-point edge — and losing it.
- Claude Opus 4.6, the highest-scoring model on the sophistication rubric, still lost on average about 11%.
- GPT-5.4 was the most conservative: it ran 160 tool calls before deciding its log-loss (0.974) was barely worse than the market’s (0.971), so it placed tiny “penny” bets for the season. OpenAI’s model lost 13.6% on average.
- GLM-5 repeatedly diagnosed its own mistakes — hardcoded bad draw-rate and overestimated home advantage — yet never fixed them, burning through its bankroll.
- Kimi K2.5 actually wrote a correct fractional Kelly staking function, but a formatting bug meant the function was never invoked. The model repeated the broken command dozens of times and eventually made an accidental £114,000 bet (98% of its bankroll) that finished it off.
Surprising baseline: a 1990s model outperformed many AIs
- The old Dixon–Coles model (late 1990s / early 2000s), despite being a dated baseline that ignores non-stationarity and modern data sources, beat six of the eight frontier models on KellyBench. The researchers called this “surprising” and worrying.
Root cause: the “knowledge–action gap”
- Models could often recite the Kelly formula and even diagnose why they were losing, but they repeatedly failed to close the loop — to verify code implementation, notice divergence between intent and execution, and act on self-critique.
- The team coined this failure mode the “knowledge–action gap”: agents can reason about what to do but can’t reliably turn that reasoning into robust, long-running action in a non-stationary environment.
- The researchers argue that successful agents must “maintain coherent intent across potentially thousands of sequential decisions, monitor consequences, and close the loop between observation and action.”
Quantifying strategy quality
- Beyond raw returns, the team created a 44-point sophistication rubric (features, stake sizing, handling non-stationarity, execution quality). Claude Opus 4.6 scored highest at 32.6% — less than a third of the available points.
- Higher sophistication scores significantly predicted lower bankruptcy rates (p = 0.008) and correlated with better returns. In short: how carefully a model engineered and executed a strategy mattered more than model name or compute.
Broader pattern and crypto parallels
- This isn’t just sports-betting theater. Previous studies found AI agents can develop “gambling-like” behavior when simply optimized for reward (bankrupting up to 48% of the time in slot-machine sims). A separate real-money crypto trading competition showed similar reliability problems over long horizons.
- For crypto traders and quant shops, the lessons are stark: model reasoning ≠ durable strategy. Non-stationary markets, execution bugs, insufficient monitoring, and the inability to adapt reliably over months are practical failure modes that matter more than theoretical edge estimates.
Costs and operational friction
- Running these experiments is not free: one seed of the benchmark reportedly cost roughly $2,012 in compute. That’s a reminder that testing robust, long-lived strategies at scale has real operational costs — and that sloppy execution errors can erase those investments.
Bottom line
Frontier LLMs can reason about money and even explain good staking rules, but today they struggle to implement consistent, long-term decision-making in messy, evolving markets. For crypto and algorithmic trading, the takeaway is clear: automation promises are premature unless teams close the knowledge–action gap with rigorous monitoring, robust execution layers, and continual retraining and validation. Until that happens, “smart” models will keep getting rekt — sometimes for high real-world costs.
Read more AI-generated news on: undefined/news