U.S. AI Model Testing Expands With Google, Microsoft and xAI

May 5, 2026
Secure AI evaluation room with abstract model testing dashboards and national security review indicators
Original ReadBasket illustration about CAISI and pre-release frontier AI model testing.

The U.S. government is getting a deeper look at the next wave of frontier AI models before they reach the public. On May 5, 2026, the Center for AI Standards and Innovation, or CAISI, at NIST announced new agreements with Google DeepMind, Microsoft and xAI for pre-deployment evaluations and targeted research.

The practical meaning is straightforward: more leading AI labs are agreeing to let federal evaluators test advanced systems before public release, then continue assessing them after deployment. That does not make CAISI a product approval board, and NIST did not say the government will have veto power over releases. It does, however, make independent government testing a more normal part of the frontier AI release cycle.

What CAISI Announced

CAISI says the agreements will let it evaluate models before they are publicly available, carry out post-deployment assessments and run research into frontier AI capabilities. NIST also said CAISI has already completed more than 40 evaluations, including evaluations of state-of-the-art models that remain unreleased.

The new deals build on earlier U.S. AI Safety Institute agreements with OpenAI and Anthropic. Those 2024 agreements gave the government access to major new models before and after public release for safety research, testing and evaluation. The latest announcement widens that structure to include Google DeepMind, Microsoft and xAI.

Why Early Access Matters

Frontier AI models are now being evaluated for more than general helpfulness or chatbot safety. CAISI’s public materials say the center focuses on demonstrable risks such as cybersecurity, biosecurity and chemical weapons, while also assessing U.S. and foreign AI capabilities and the state of international AI competition.

That focus matters because the most serious risks are often capability-driven. A model that is better at coding, tool use and long-horizon reasoning may also become more useful for vulnerability discovery, reconnaissance or other dual-use tasks. The point of early testing is to see those behaviors before a broad public rollout makes them harder to contain.

From AI Safety Institute to CAISI

CAISI is not simply a rebranded research group. In June 2025, the Commerce Department announced that the former U.S. AI Safety Institute would become the Center for AI Standards and Innovation, with a stronger emphasis on measurement science, voluntary standards, national security and U.S. competitiveness.

That framing aligns with the White House’s America’s AI Action Plan, which centers on accelerating innovation, building AI infrastructure and leading internationally on security. The result is a testing posture that tries to avoid heavy release licensing while still giving the government technical visibility into the most capable systems.

How the Testing Could Work

NIST says developers frequently provide CAISI with models that have reduced or removed safeguards so evaluators can test national security-related capabilities and risks more thoroughly. That detail is important. Testing a heavily restricted public interface can miss what a model is capable of when safeguards fail, are bypassed or are intentionally removed in controlled settings.

CAISI also says evaluators from across government may participate through its TRAINS Taskforce, an interagency group focused on AI national security concerns. The agreements support testing in classified environments, which suggests the government wants to examine sensitive threat scenarios without pushing all details into public reports.

What Industry Gets Out of It

For AI labs, the incentive is not only regulatory goodwill. Early government testing can produce feedback before launch, support voluntary product improvements and give companies a stronger answer when enterprise buyers ask how advanced models were assessed. Microsoft, for example, said its new work with CAISI and the U.K. AI Security Institute will focus on testing frontier models, assessing safeguards and improving evaluation science.

For buyers, the key is to read these agreements correctly. A CAISI evaluation is not the same thing as a blanket guarantee that a model is safe for every workflow. It is a signal that the developer is participating in a more serious evaluation process, especially around national security risk, but customers still need their own governance, red-team testing, logging and incident response.

The Open Questions

The biggest unresolved issue is transparency. Some testing will necessarily involve sensitive details, especially where cyber or classified evaluation is involved. But if the public only sees high-level announcements, it will be difficult for enterprises, researchers and policymakers to compare how different models performed or whether the feedback materially changed release decisions.

The other question is whether voluntary agreements can keep up with model capability. CAISI’s model depends on cooperation from frontier developers and enough government talent to test systems deeply. If the testing becomes too slow, labs may treat it as process overhead. If it is too opaque, the market may treat it as a trust badge without enough substance behind it.

The Bottom Line

The new CAISI agreements are best understood as a shift in AI release norms. The U.S. government is not publicly claiming approval authority over Google DeepMind, Microsoft or xAI models. It is building a more formal channel for early access, technical measurement and national security review.

That is a meaningful development for AI policy and enterprise risk management. The frontier AI market is moving too fast for safety claims based only on vendor self-assessment. CAISI’s challenge is to turn early access into useful evidence, and to do it without slowing legitimate deployment or hiding all of the important findings from the people who have to buy, deploy and govern these systems.

Sources

Jeff McGilligan

Jeff McGilligan is a ReadBasket technology writer focused on artificial intelligence, startups, cybersecurity, digital platforms, and the business moves shaping the internet. He turns complex announcements from companies like OpenAI, Anthropic, Google, Microsoft, Tesla, and xAI into clear, practical analysis for readers who want the context, risks, and commercial impact behind the headline.

Don't Miss

Travel planning table with passport, laptop, abstract itinerary cards, and AI-assisted route lines

How to Use AI Travel Assistants Safely in 2026

AI travel assistants are useful for ideas, comparisons and itinerary
Abstract AI workflow reaching a delete action against a production database with backup and approval safeguards nearby

AI Agents Are Deleting Production Data. The Problem Is Permissions.

Recent AI-agent incidents at PocketOS, Replit, Meta, and Amazon show