AI Governance Library

Inference Scaling and AI Governance — Toby Ord (GovAI, Oct 2025)

“The shift from scaling up pre-training compute to scaling up inference compute may have profound effects on AI governance… potentially undermining measures that rely on training-compute thresholds.”  
Inference Scaling and AI Governance — Toby Ord (GovAI, Oct 2025)

⚡ Quick Summary

This GovAI technical report argues that the next performance frontier is not bigger pre-training runs but scaling the compute used at inference. Ord distinguishes two paths: (1) inference-at-deployment (spending more test-time compute per task), and (2) inference-during-training (using heavy test-time reasoning to generate data and then distilling it). Each path reshapes risk, business models, and regulation—especially regimes keyed to training-compute thresholds (EU AI Act, CA SB53). He forecasts fewer simultaneously served model copies, higher marginal costs, uneven capability access, and reduced usefulness of “training FLOPs” as a proxy. If labs push iterated distillation & amplification, timelines could compress and transparency fall.  

🧩 What’s Covered

  • The paradigm shift. Evidence suggests diminishing returns from simply scaling pre-training beyond GPT-4 due to data quality limits; OpenAI’s o-series shows large gains by allocating more compute to reasoning at test time (Figure 1, p.5).  
  • Two loci for scaling:
    • Inference-at-deployment. Consequences include:
      1. Fewer concurrent copies of frontier models because each served instance needs far more compute; even a 100× inference bump could cut immediate deployment to ~1% of prior scale.  
      2. Costlier “first human-level” systems, blunting immediate societal impact and possibly allowing demo/safety time before mass roll-out.  
      3. Lower payoff from stealing weights (since costs shift to run-time compute), though targeted misuse on a few tasks could still benefit.  
      4. Reduced stakes of open-weights for broad capability diffusion; risks remain for narrow high-compute tasks.  
      5. Unequal performance across users/tasks (rich tasks and rich users get more compute), with early pricing already tiered by reasoning budget.  
      6. Industry structure flips from software-like (high fixed, low marginal) toward higher marginal costs, potentially enabling more competitors.  
      7. Less need for monolithic data centers; inference can distribute geographically, weakening oversight tools tied to mega-facilities/chip chokepoints.  
      8. Compute-threshold complications: models trained below 10²⁵–10²⁶ FLOPs could, with 10³–10⁵× inference, rival “regulated” capability tiers (Figure 3, p.12). Remedies include accounting for inference budgets or shifting to capabilities-based triggers.  
    • Inference-during-training.
      • Synthetic data generation to overcome high-quality token scarcity (e.g., math/code where objective verification is strong).  
      • Iterated Distillation & Amplification (IDA): amplify a base model with heavy test-time search/reasoning, then distill into a faster model; repeat (Box 1 & Fig.4, pp.15–16), inspired by AlphaGo Zero. This could amount to recursive self-improvement for LLMs—if engineering hurdles are solved.  
      • Governance effects: less public visibility of SOTA during internal scaling, tougher readiness planning, and potentially shorter AGI timelines.  
  • Appendix model. A simple cost frame shows when deployment compute dominates totals and why 10× inference can dwarf 10× training in lifetime cost, depending on usage (pp.21–22).  

💡 Why it matters?

Policy regimes that hinge on training FLOPs risk blind spots as capability moves to run-time budgets. Procurement, safety cases, and market access will increasingly depend on declared and enforced inference limitscapabilities-based evaluations at specified compute budgets, and entity-based oversight of firms scaling private, internal training loops. Governance agility—not static thresholds—becomes the differentiator.  

❓ What’s Missing

  • Operational metrics for auditing inference budgets across cloud/edge deployments.  
  • Concrete test suites that bind capabilities to compute envelopes (e.g., evals at 10², 10³, 10⁴ reasoning-token tiers).  
  • Market/consumer impacts modeling (pricing, access disparities) and grid/power analyses under massively distributed inference.  
  • IDA feasibility evidence for LLMs beyond analogy to games; timelines, plateau modes, and risk controls remain speculative.  

👥 Best For

Regulators and AI offices designing post-EU-AI-Act secondary rules; corporate AI policy leads building usage-tiered risk controls; evaluators developing compute-conditioned capability tests; investors assessing shifts to high-marginal-cost AI business models.  

📄 Source Details

GovAI Technical Report, October 2025, 25 pp.; Figures: o-series performance vs cost (p.12), IDA ladder (pp.15–16); Appendix: lifetime compute cost model (pp.21–22). Author: Toby Ord, Oxford Martin AI Governance Initiative.  

📝 Thanks to

Toby Ord for framing the inference era; Seb Krier (acknowledged) for sparking ideas; GovAI for publication and context.  

About the author
Jakub Szarmach

AI Governance Library

Curated Library of AI Governance Resources

AI Governance Library

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to AI Governance Library.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.