⚡ Quick Summary
This GovAI technical report argues that the next performance frontier is not bigger pre-training runs but scaling the compute used at inference. Ord distinguishes two paths: (1) inference-at-deployment (spending more test-time compute per task), and (2) inference-during-training (using heavy test-time reasoning to generate data and then distilling it). Each path reshapes risk, business models, and regulation—especially regimes keyed to training-compute thresholds (EU AI Act, CA SB53). He forecasts fewer simultaneously served model copies, higher marginal costs, uneven capability access, and reduced usefulness of “training FLOPs” as a proxy. If labs push iterated distillation & amplification, timelines could compress and transparency fall.
🧩 What’s Covered
- The paradigm shift. Evidence suggests diminishing returns from simply scaling pre-training beyond GPT-4 due to data quality limits; OpenAI’s o-series shows large gains by allocating more compute to reasoning at test time (Figure 1, p.5).
- Two loci for scaling:
- Inference-at-deployment. Consequences include:
- Fewer concurrent copies of frontier models because each served instance needs far more compute; even a 100× inference bump could cut immediate deployment to ~1% of prior scale.
- Costlier “first human-level” systems, blunting immediate societal impact and possibly allowing demo/safety time before mass roll-out.
- Lower payoff from stealing weights (since costs shift to run-time compute), though targeted misuse on a few tasks could still benefit.
- Reduced stakes of open-weights for broad capability diffusion; risks remain for narrow high-compute tasks.
- Unequal performance across users/tasks (rich tasks and rich users get more compute), with early pricing already tiered by reasoning budget.
- Industry structure flips from software-like (high fixed, low marginal) toward higher marginal costs, potentially enabling more competitors.
- Less need for monolithic data centers; inference can distribute geographically, weakening oversight tools tied to mega-facilities/chip chokepoints.
- Compute-threshold complications: models trained below 10²⁵–10²⁶ FLOPs could, with 10³–10⁵× inference, rival “regulated” capability tiers (Figure 3, p.12). Remedies include accounting for inference budgets or shifting to capabilities-based triggers.
- Inference-during-training.
- Synthetic data generation to overcome high-quality token scarcity (e.g., math/code where objective verification is strong).
- Iterated Distillation & Amplification (IDA): amplify a base model with heavy test-time search/reasoning, then distill into a faster model; repeat (Box 1 & Fig.4, pp.15–16), inspired by AlphaGo Zero. This could amount to recursive self-improvement for LLMs—if engineering hurdles are solved.
- Governance effects: less public visibility of SOTA during internal scaling, tougher readiness planning, and potentially shorter AGI timelines.
- Inference-at-deployment. Consequences include:
- Appendix model. A simple cost frame shows when deployment compute dominates totals and why 10× inference can dwarf 10× training in lifetime cost, depending on usage (pp.21–22).
💡 Why it matters?
Policy regimes that hinge on training FLOPs risk blind spots as capability moves to run-time budgets. Procurement, safety cases, and market access will increasingly depend on declared and enforced inference limits, capabilities-based evaluations at specified compute budgets, and entity-based oversight of firms scaling private, internal training loops. Governance agility—not static thresholds—becomes the differentiator.
❓ What’s Missing
- Operational metrics for auditing inference budgets across cloud/edge deployments.
- Concrete test suites that bind capabilities to compute envelopes (e.g., evals at 10², 10³, 10⁴ reasoning-token tiers).
- Market/consumer impacts modeling (pricing, access disparities) and grid/power analyses under massively distributed inference.
- IDA feasibility evidence for LLMs beyond analogy to games; timelines, plateau modes, and risk controls remain speculative.
👥 Best For
Regulators and AI offices designing post-EU-AI-Act secondary rules; corporate AI policy leads building usage-tiered risk controls; evaluators developing compute-conditioned capability tests; investors assessing shifts to high-marginal-cost AI business models.
📄 Source Details
GovAI Technical Report, October 2025, 25 pp.; Figures: o-series performance vs cost (p.12), IDA ladder (pp.15–16); Appendix: lifetime compute cost model (pp.21–22). Author: Toby Ord, Oxford Martin AI Governance Initiative.
📝 Thanks to
Toby Ord for framing the inference era; Seb Krier (acknowledged) for sparking ideas; GovAI for publication and context.