Inference Scaling and AI Governance

⚡ Quick Summary

This GovAI technical report argues that the next performance frontier is not bigger pre-training runs but scaling the compute used at inference. Ord distinguishes two paths: (1) inference-at-deployment (spending more test-time compute per task), and (2) inference-during-training (using heavy test-time reasoning to generate data and then distilling it). Each path reshapes risk, business models, and regulation—especially regimes keyed to training-compute thresholds (EU AI Act, CA SB53). He forecasts fewer simultaneously served model copies, higher marginal costs, uneven capability access, and reduced usefulness of “training FLOPs” as a proxy. If labs push iterated distillation & amplification, timelines could compress and transparency fall.

🧩 What’s Covered

The paradigm shift. Evidence suggests diminishing returns from simply scaling pre-training beyond GPT-4 due to data quality limits; OpenAI’s o-series shows large gains by allocating more compute to reasoning at test time (Figure 1, p.5).
Two loci for scaling:
- Inference-at-deployment. Consequences include:
  1. Fewer concurrent copies of frontier models because each served instance needs far more compute; even a 100× inference bump could cut immediate deployment to ~1% of prior scale.
  2. Costlier “first human-level” systems, blunting immediate societal impact and possibly allowing demo/safety time before mass roll-out.
  3. Lower payoff from stealing weights (since costs shift to run-time compute), though targeted misuse on a few tasks could still benefit.
  4. Reduced stakes of open-weights for broad capability diffusion; risks remain for narrow high-compute tasks.
  5. Unequal performance across users/tasks (rich tasks and rich users get more compute), with early pricing already tiered by reasoning budget.
  6. Industry structure flips from software-like (high fixed, low marginal) toward higher marginal costs, potentially enabling more competitors.
  7. Less need for monolithic data centers; inference can distribute geographically, weakening oversight tools tied to mega-facilities/chip chokepoints.
  8. Compute-threshold complications: models trained below 10²⁵–10²⁶ FLOPs could, with 10³–10⁵× inference, rival “regulated” capability tiers (Figure 3, p.12). Remedies include accounting for inference budgets or shifting to capabilities-based triggers.
- Inference-during-training.
  - Synthetic data generation to overcome high-quality token scarcity (e.g., math/code where objective verification is strong).
  - Iterated Distillation & Amplification (IDA): amplify a base model with heavy test-time search/reasoning, then distill into a faster model; repeat (Box 1 & Fig.4, pp.15–16), inspired by AlphaGo Zero. This could amount to recursive self-improvement for LLMs—if engineering hurdles are solved.
  - Governance effects: less public visibility of SOTA during internal scaling, tougher readiness planning, and potentially shorter AGI timelines.
Appendix model. A simple cost frame shows when deployment compute dominates totals and why 10× inference can dwarf 10× training in lifetime cost, depending on usage (pp.21–22).

💡 Why it matters?

Policy regimes that hinge on training FLOPs risk blind spots as capability moves to run-time budgets. Procurement, safety cases, and market access will increasingly depend on declared and enforced inference limits, capabilities-based evaluations at specified compute budgets, and entity-based oversight of firms scaling private, internal training loops. Governance agility—not static thresholds—becomes the differentiator.

❓ What’s Missing

Operational metrics for auditing inference budgets across cloud/edge deployments.
Concrete test suites that bind capabilities to compute envelopes (e.g., evals at 10², 10³, 10⁴ reasoning-token tiers).
Market/consumer impacts modeling (pricing, access disparities) and grid/power analyses under massively distributed inference.
IDA feasibility evidence for LLMs beyond analogy to games; timelines, plateau modes, and risk controls remain speculative.

👥 Best For

Regulators and AI offices designing post-EU-AI-Act secondary rules; corporate AI policy leads building usage-tiered risk controls; evaluators developing compute-conditioned capability tests; investors assessing shifts to high-marginal-cost AI business models.

📄 Source Details

GovAI Technical Report, October 2025, 25 pp.; Figures: o-series performance vs cost (p.12), IDA ladder (pp.15–16); Appendix: lifetime compute cost model (pp.21–22). Author: Toby Ord, Oxford Martin AI Governance Initiative.