How teams go from AI uncertainty to production confidence
NVIDIA validates PeopleNet at scale, with statistical confidence no benchmark could provide
AI System Description
NVIDIA PeopleNet, an object detection model from NVIDIA for detecting persons in images, evaluated here on a pedestrian-detection scenario.
Client & Context
NVIDIA is a global leader in accelerated computing and AI infrastructure. PeopleNet served as the reference computer vision model for this assessment, which was presented at NVIDIA GTC in March 2025.
Problem
AI systems need rigorous testing to be deployed with confidence, but generic benchmarks tend to be superficial, while ad-hoc context-specific evaluations can be unreliable or misleading. For a vision model like PeopleNet, characterising behaviour requires examining performance across operational properties (such as subject size, lighting, and perceived attributes) and under realistic perturbations (such as blur). All of this with statistical confidence quantification suitable to support important decisions.
What QuantPi Tested
QuantPi conducted a black-box assessment of PeopleNet using its platform, characterising the model's behaviour across performance, robustness, fairness, and bias in a single unified assessment. NVIDIA NIM services were integrated into the test pipeline to enable automated annotation of contextual properties that were not present as ground-truth labels in the original dataset.
How QuantPi Tested
Testing was structured around two axes: events of interest (the model's intended detection behaviour) and contextual properties (such as subject size, lighting, perceived gender, and environmental perturbations including Gaussian blur). The assessment combined 15 contextual properties, 7 perturbation types, and 2 performance measures. We generated approximately 1 million metric values used to quantify data representativity, technical biases, fairness, environmental robustness, measurement robustness, fairness-robustness, and robustness biases. NV-CLIP, accessed via NVIDIA's NIM service, was integrated to provide automated perceived-gender attribution where ground truth was unavailable; statistical confidence intervals were adjusted to account for annotation noise introduced by the automated labeller.
Multi-dimensional behavioural characterisation of PeopleNet across approximately 15 contextual properties and 7 perturbation types, with statistical confidence intervals on every estimate.
NVIDIA NIM services integrated into the testing pipeline, demonstrating how NVIDIA services can be composed within the evaluation workflow itself.
Methodology and findings presented at NVIDIA GTC, March 2025.
Fortune 100 enterprise cut hallucination risk in its internal policy RAG
AI System Description
A Retrieval-Augmented Generation (RAG) assistant built to help operations teams navigate dense internal Corporate Sales Policies (CSP). The application queries an extensive internal corpus, including structural policy slides, operational manuals, and FAQs, to generate direct answers alongside their retrieved document contexts.
Client & Context
The client is a Fortune 100 enterprise technology and infrastructure provider. They deployed the AI system to accelerate operational workflows and unify standard operational compliance across regional divisions. Because these guidelines govern high-stakes commercial agreements, verifying absolute system correctness was a critical corporate requirement.
Problem
As the internal corporate policy system expanded, the technical challenge shifted from initial deployment to achieving rigorous, production-grade optimization. Initial validation efforts relied primarily on manual review cycles, which naturally limited the baseline testing scale and created development bottlenecks for the engineering team. Additionally, evaluation depended on standard semantic benchmarks like basic cosine similarity. While effective for high-level monitoring, these standard checks lacked the granular resolution needed to isolate subtle contextual gaps, noise propagation, or text alignment inconsistencies across the pipeline.
What QuantPi Tested
QuantPi independently evaluated the retriever and extractor layers to isolate failure modes across multiple distinct parameters:
- Query Complexity: Drops in stability between simple requests and complex multi-variable inquiries.
- Intent Stratification: Performance variances across diverse question types, including definitions, factoids, and conditional logic instructions.
- Source Structures: Behavior changes across different data layouts, including raw text, formatted lists, and tables.
- Simulated Noise: System resilience against real-world query ambiguity, technical abbreviations, typos, and rephrasings.
How QuantPi Tested
QuantPi deployed its testing framework onto the customer’s secure corporate cluster, managing infrastructure dependencies without requiring external data access. The framework synthetically generated an expanded evaluation dataset of 500 gold-standard Q&A pairs tailored to the application's domain. QuantPi then utilized automated "Delta" (factual overlap) and "Gamma" (novelty) logic filters to audit the generated pipeline, successfully cleaning out 12.5% of data points that lacked factual document grounding. Finally, the platform systematically executed 20 distinct test scenarios via controlled hyperparameter sweeps such as varying chunk sizes, overlap lengths, system prompt behaviors, and retrieval depths.
47% Relative Improvement in End-to-End Success: Advanced prompt optimization vastly increased pipeline reliability, driving fewer outright failures across complex policy paths.
36% Relative Gain in Faithfulness: Fine-tuning the system drastically minimized hallucination risks, ensuring generated answers remain firmly grounded in original context.
Automated Validation at Scale: Replaced an ad-hoc, 70-sample manual constraint with a scalable, automated framework testing 20 distinct scenarios across 500 validated data points
Continuous robustness assurance for a vision model in claims processing
AI System Description
Computer vision model that segments vehicle damage from photos to automate insurance claim estimation.
Client & Context
A European technology provider for AI-driven automotive claims management. Computer vision on damage photos sits at the core of their offering, directly driving cost estimation and settlement decisions for partner insurers.
Problem
The model needed to perform reliably across real-world conditions such as rain, night, varying car angles—but validating robustness required expensive manual data annotation that took weeks per cycle.
What QuantPi Tested
The QuantPi platform was leveraged to rigorously test the model's performance in depth and continuously allow for improvements to the computer vision model in a matter of hours, instead of weeks.
How QuantPi Tested
Automated perturbation testing across weather and lighting conditions using synthetic transformations (Stable Diffusion + CLIP for image tagging). Evaluated segmentation accuracy (IoU) across tagged attribute subgroups to isolate where the model failed silently.
Significantly reduced manual effort for Data Scientists
Increased model performance
Reduced claims processing time
European recruiting platform built quantitative evidence of non-discrimination
AI System Description
LLM-based candidate recommender system that automatically generates candidate recommendations for recruiters from job listings.
Client & Context
A major European online recruitment platform connecting millions of job seekers and employers. Its candidate recommender directly shapes who gets surfaced for a role, making non-discrimination both a regulatory and reputational imperative.
Problem
The recommender system needed to be validated for non-discrimination across age, gender, and ethnic origin. And the resulting evidence had to be defensible under Germany's General Equal Treatment Act (AGG) and the EU AI Act's provisions for high-risk employment AI.
What QuantPi Tested
The QuantPi Platform was used to test whether the recommender treated applicants fairly across protected attributes, producing quantitative evidence of non-discrimination at both the individual and group level.
How QuantPi Tested
False negative rates and selection rates were measured across protected attributes (age, gender, ethnic origin) and their perturbations, isolating disparities at both the individual and group level. The quantitative results were then incorporated into a structured assurance case which as been co-developed with TIC partner TÜV AI.Lab and links each piece of evidence to specific legal requirements under the AGG and the EU AI Act.
Bias surfaced and reduced
System performance improved
Compliance argument formalized in an assurance case mapped to AGG and EU AI Act
Standards-aligned assurance for an agentic AI claims-processing system
AI System Description
Agentic AI system for end-to-end insurance claims processing. A supervisor agent orchestrates the full claim workflow: authentication, accident classification, policy check, damage estimation, and final payout or repair-shop booking. This is done through conversational interactions with the claimant, including the processing of submitted damage photos.
Client & Context
A European technology provider for AI-driven claims management. Agentic AI sits at the core of their next-generation claims platform, automating end-to-end interactions with claimants on behalf of partner insurers.
Problem
The agent had to be validated against a level of operational complexity that ad-hoc testing could not cover. The customer's existing setup consisted of 32 hand-crafted scenarios and 17 metrics on a Databricks foundation, with only 2 formal test criteria. That was too narrow to give the QA team confidence over an end-to-end agent that handles authentication, classification, damage estimation, payout and repair-shop booking while supporting varying user behaviours and accident types. Test costs were also opaque, making it hard to plan stability analysis or larger off-hours runs.
What QuantPi Tested
QuantPi built and operated an end-to-end, standards-aligned assurance pipeline for the customer's agentic system. It covered scenario authoring, parametrised variation generation, simulation execution, trace property extraction, metric evaluation, and report generation. This pipeline was integrated into the customer's CI/CD environment for nightly runs.
How QuantPi Tested
41 base scenarios provided by the customer's QA team were parameterised across user identity, user kindness, time of accident, accident object, and user evasiveness, resulting in generating 1,200 derived test scenarios. Each scenario carried explicit success criteria modelled as conditions over trace properties, referencing a function library that grew to 85 metrics over the engagement and a pipeline extracting more than 350 trace state properties. Risk was assessed against 8 criteria adapted from ISO/IEC 25010 and ISTQB, spanning functional completeness and appropriateness, time behaviour and resource utilisation, probabilistic and non-deterministic behaviour, functional robustness, ethical bias (inclusivity) and technical bias. Per-test and per-batch token costs were quantified, making testing depth a controllable parameter rather than an opaque one.
Coverage and depth scaled significantly: from 32 to 1,200 test scenarios, from 17 to 85 metrics, and from 2 to 8 standards-aligned criteria drawn from ISO/IEC 25010 and ISTQB.
Continuous testing integrated into the customer's CI/CD pipeline, with testing depth and cost made measurable and controllable per test and per batch.
20 system issues surfaced through the engagement; the customer's tester team enabled to operate the pipeline autonomously, supported by reusable Jupyter notebooks and recorded onboarding sessions.
Black-box assurance for Unique's agentic AI investment research assistant
AI System Description
Agentic RAG system for private banking. The Investment Research Assistant helps relationship managers query stock universes, analyze fact sheets, and generate tailored investment recommendations for end clients.
Client & Context
Unique is the vertical leader in agentic AI for financial services, serving 40+ blue-chip institutions including LGT Private Banking, Pictet Group, and Julius Baer. As one of the first European companies certified to ISO 42001, they hold their AI systems to a deliberately high assurance bar.
Problem
Investment recommendations have direct financial and regulatory consequences for end clients. Unique needed rigorous, independent evidence that their agentic AI assistant performed reliably across accuracy, hallucination, and robustness. Including realistic query variations such as typos, domain shifts, and varying complexity. In addition,the underlying retrieval layer had to surface the right documents.
What QuantPi Tested
The QuantPi Platform was used to assess the system at two layers: the agentic Investment Research Assistant itself, and the Document Search subsystem that feeds it. The assessment focused on accuracy, hallucination/faithfulness, and robustness under realistic query conditions.
How QuantPi Tested
For the agent: Cosine Similarity measured response accuracy against ground truth, and a claim-decomposition Faithfulness metric (LLM-as-a-judge) quantified hallucination, both evaluated across varied query lengths, types, domains, complexities, and typo perturbations. For the retrieval layer: Word Overlap Rate, Mean Reciprocal Rank, and Lenient Retrieval Accuracy were computed across diverse queries to surface inaccurate and irrelevant results. Tests ran on ~200 samples (real + perturbed) in a secure staging environment, with QuantPi's Platform generating reusable test scenarios that can be re-applied to future model versions or alternate LLM providers.
Quantitative evidence on accuracy, hallucination, and robustness for both the agent and its retrieval layer
Reusable, provider-agnostic test infrastructure; scenarios transfer across LLM swaps and future versions
Methodology and findings published as a pilot case study by Singapore's AI Verify Foundation
Quality assurance for Violet AI's RAG system in life-sciences manufacturing
AI System Description
LLM + RAG system for factory-floor troubleshooting in life-sciences manufacturing. The assistant gives line workers instant access to a large internal knowledge base, enabling rapid identification and resolution of production issues.
Client & Context
Violet AI builds LLM-based assistants for factory-floor troubleshooting, with a focus on the life-sciences sector. Because errors in manufacturing guidance carry both regulatory and operational consequences, system reliability is a core product requirement rather than an afterthought.
Problem
The RAG system needed to perform reliably across diverse troubleshooting scenarios: accurate retrieval from an extensive knowledge base, faithful LLM responses free of misinformation, and the level of evidence required to operate confidently in a regulated environment. Achieving this called for rigorous, repeatable evaluation across the full RAG pipeline, not ad-hoc spot-checks.
What QuantPi Tested
The QuantPi platform was used for end-to-end validation of Violet AI's RAG system, assessing both the retrieval layer and the LLM generation step against the operational diversity the system would encounter in production.
How QuantPi Tested
Retrieval performance was assessed using Lenient Accuracy, Strict Accuracy, and Mean Reciprocal Rank to verify that the system surfaced the right documents from Violet AI's knowledge base. LLM faithfulness was quantified using metrics from the RAGAs framework combined with QuantPi's own out-of-the-box faithfulness metrics, ensuring responses stayed grounded in retrieved context.
Optimized resource allocation: automated evaluation reduced manual effort for data scientists
Delivery of a best-in-class solution to demanding life-sciences customers
Enhanced system performance: improved accuracy and reliability across the RAG pipeline
