Learning from other domains to advance AI evaluation and testing

⚡ Quick Summary

As AI capabilities grow rapidly, so does the urgency to build robust evaluation and testing practices. This August 2025 report by Microsoft’s Office of Responsible AI analyzes how other high-stakes sectors—like civil aviation, pharmaceuticals, and nuclear power—have embedded testing into their governance systems. Drawing on expert case studies, the paper identifies core principles that can inform AI governance: balancing pre- and post-deployment testing, standardizing methodologies, and building evaluation infrastructure. It also surveys ongoing initiatives like the AI Safety Institutes and proposes six detailed policy recommendations. This is a foundational text for policymakers and technical leaders working at the intersection of AI safety, assurance, and regulation.

🧩 What’s Covered

The report is structured into three parts and two appendices:

1. Lessons from Eight Domains

The study covers civil aviation, pharmaceuticals, medical devices, nuclear power, bank stress testing, cybersecurity, genome editing, and nanotechnology. It identifies key distinctions in evaluation ecosystems, such as:

Strict, pre-deployment testing regimes (e.g. aviation, pharma) rely on high standardization and regulation but are slow to adapt.
Dynamic governance models (e.g. cybersecurity, banking) emphasize post-deployment monitoring and flexibility.
General-purpose technologies (GPTs) like genome editing and nanotech face challenges with upstream vs. downstream governance—balancing broad rules with context-specific risks.

2. Ongoing Initiatives

The report highlights multistakeholder and public-sector efforts:

The EU AI Act, U.S. AI Action Plan, and Singapore’s AI Verify initiative embed AI evaluation in policy.
Tools like AILuminate, PyRIT, and Microsoft’s ADeLe framework show how industry is advancing benchmarks.
International joint testing exercises (e.g. UK-US red teaming of OpenAI’s o1 model) demonstrate a growing global evaluation ecosystem.

3. Policy Recommendations

Six recommendations are presented, including:

Shift focus toward evaluating end-to-end AI applications, not just models.
Invest in multistakeholder evaluation infrastructure.
Develop sector-specific frameworks and post-deployment monitoring.

Appendices

Appendix A outlines types of AI evaluation (auditing, red teaming, uplift studies, risk-focused evaluations, policy adherence testing).

Appendix B summarizes each domain’s testing methods, timing, thresholds, and role in governance via a clear comparison table.

💡 Why it matters?

AI evaluation is emerging as a linchpin of responsible AI governance. But unlike legacy domains with decades of regulation and consensus on testing (e.g., pharma or aviation), AI’s rapid pace and general-purpose nature demand new frameworks that are adaptive, rigorous, and scalable. This report offers a rare systems-level perspective, helping policymakers avoid pitfalls (like overemphasizing pre-deployment testing) and take cues from real-world governance successes. With frontier AI capabilities and associated risks accelerating, these insights are timely and directly applicable to shaping national and organizational AI evaluation strategies.

❓ What’s Missing

The report does not deeply engage with non-Western governance models, particularly from China or India.
There’s limited discussion of AI-specific metrics—such as robustness, interpretability, or alignment—beyond broad categories.
Case studies, while insightful, are summarized too briefly in the main body and rely on external references for depth.
Implementation barriers (e.g. costs, talent shortages, or legal constraints) receive less attention than conceptual recommendations.

👥 Best For

Policymakers designing AI regulatory frameworks
AI assurance researchers and practitioners
Risk, compliance, and ethics teams in AI-driven companies
Standards bodies and multistakeholder initiatives
Developers of high-stakes or general-purpose AI systems

📄 Source Details

Title: Learning from other domains to advance AI evaluation and testing

Publisher: Microsoft Office of Responsible AI

Date: August 2025

Contributors: Experts from academia (e.g. Alta Charo, Daniel Carpenter), regulators (e.g. Kathryn Judge, Ciaran Martin), and organizations like MLCommons

Length: 24 pages

Access: Internal white paper with supporting links to case studies and external tools

📝 Thanks to

Alta Charo, Jennifer Dionne, Daniel Carpenter, MLCommons, and the Oxford Martin AI Governance Initiative for shaping the cross-domain case studies that underpin this analysis.