Guaranteeing Foundation Models for High-Stakes Settings: DARPA’s AIQ Program

December 18, 2025

Brian Hu, Riley Sheedy and Anthony Hoogs

Kitware, a leader in developing software for AI test and evaluation, has been awarded a $3.7M contract to develop a novel AI evaluation framework and perform AI test and evaluation by the Defense Advanced Research Projects Agency (DARPA) for the Artificial Intelligence Quantified (AIQ) program. AIQ will combine theoretical and empirical approaches to assess and understand the capabilities of AI to enable guaranteed performance in domains such as defense, intelligence, manufacturing, and medicine. By providing deeper insights into when these systems can be relied upon and when caution is warranted, the program aims to make high-stakes AI deployment safer, more predictable, and more reliable.

Kitware’s Role: MAGNET for AI Evaluation

MAGNET. Mathematical Assurance and Generative AI Network Evaluation Toolkit

Kitware will lead test and evaluation (T&E) efforts on the AIQ program as a technical area 2 (TA2) performer, through the development of the Mathematical Assurance and Generative AI Network Evaluation Toolkit (MAGNET). TA1 performers will create mathematical theories and models that predict the outputs of AI transformer models with formal guarantees, providing verifiable performance bounds instead of relying on empirical studies. MAGNET will evaluate TA1 theories at scale by empirically validating their outcome predictions on large, relevant datasets and current, full-scale AI models. To perform this evaluation, MAGNET will provide a framework to ensure that AI systems can be reliably tested across a variety of tasks, datasets, and modalities. MAGNET will leverage open source datasets, models, and tasks where possible, and will employ generative and adversarial techniques to augment evaluation datasets with high-difficulty, out-of-domain problems.

Project Updates

The Conference on Neural Information Processing Systems (NeurIPS 2025) recently accepted Kitware’s workshop paper related to the MAGNET system, which focuses on the design of structured and transparent AI evaluations. Evaluation cards will document the critical metadata, constraints, and claims associated with each AI system, serving as a “contract” between developers and evaluators.

MAGNET is designed to:

Validate AI system performance empirically, ensuring evaluations are transparent and reproducible.
Generate dynamic benchmarks—including text, image, and multimodal tasks—to test generalization and robustness.
Support scalable inference and testing across CPUs and GPUs in hybrid cloud configurations.
Employ generative and adversarial techniques to create high-difficulty, out-of-domain evaluation datasets.
Provide flexible evaluation workflows that minimize friction for developers while maximizing insight for testers.

This work will be presented as a poster at the Evaluating the Evolving LLM Lifecycle workshop at NeurIPS in San Diego in early December 2025, highlighting the team’s ongoing research in AI evaluation.

Person sitting in front of a computer with engineering items on it.

Looking Ahead

AIQ is a broad, complex program with a large number of performer teams, and it represents an important step toward mathematical guarantees for AI deployment. Kitware’s experience in DARPA programs related to explainable and responsible AI positions our team to deliver scalable, reliable, and transparent AI evaluation methods spanning the wide range of TA1 approaches.

By creating rigorous, operationally relevant evaluation tools, AIQ aims to answer the fundamental question: when can AI be trusted in high-stakes settings? The outcomes of this program will not only help ensure the safe deployment of AI but also set a new standard for responsible and verifiable AI systems.

Kitware: Ensuring safe and effective foundation models

Kitware is proud to bring our expertise in AI evaluation to the AIQ program. Through our leadership of the MAGNET effort, we are developing scalable, transparent, and reliable tools that help ensure AI systems can be trusted in high-stakes scenarios. With experience in other responsible AI programs for DARPA and IARPA, we continue to advance AI assurance, bridging the gap between research and real-world deployment.

Partner with Kitware to elevate the safety, transparency, and trustworthiness of your AI initiatives. Our experts collaborate with organizations to design and deploy AI systems that perform reliably when it matters most.

Let’s connect to start a conversation about how we can support your projects and goals.

Acknowledgement of Support and Disclaimer
This material is based upon work supported by the Defense Advanced Research Project Agency (DARPA) under Contract No. HR001125CE017. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Project Agency (DARPA).

Tags:

AI Evaluation Artificial Intelligence Case Study DARPA Explainable AI Project Spotlight Responsible AI

Guaranteeing Foundation Models for High-Stakes Settings: DARPA’s AIQ Program

Kitware’s Role: MAGNET for AI Evaluation

Project Updates

Looking Ahead

Kitware: Ensuring safe and effective foundation models

Leave a ReplyCancel reply