testing approach

How QuantPi valdiates object detection systems

Domain Information Dimensional Decomposition Acceptance Criteria Deployment Decision

Domain information and operational design domain (ODD)

Every validation cycle defines an explicit operational boundaries envelope: input document topology (e.g., multi-column slides, process manuals, FAQs), query distribution patterns, structural chunking constraints, and target retrieval parameters.

Dimensional Decomposition

The safety- and quality-relevant dimensions derived from the operating domain cluster into six axes along which performance is measured:
‍
• Subcomponent alignment (isolating retrieval weaknesses from generative flaws)
• Context relevance (precision and completeness of retrieved text chunks)
• Faithfulness (factual grounding of the output within the source context)
• Answer relevance (direct alignment of the final response to user intent)
• Robustness to query perturbations (stability under typos, semantic drift, and rephrasing)
• Retrieval drift (performance consistency across expanding knowledge bases over time)

Acceptance Criteria

Predefined, measurable performance thresholds the system must meet on each tested dimension to qualify for deployment. For RAG use cases, these are typically evaluation metrics (including faith, recall, and precision scores) measured per knowledge domain, with stricter thresholds enforced on compliance-heavy or high-risk topics.

Deployment Decision

Models meeting all critical thresholds are cleared for deployment; partial failures trigger "guided development" by localizing the exact subcomponent (retriever or generator) requiring optimization.

Driving Real-World Impact with Trusted AI

Real-world examples of how companies use QuantPi to build trustworthy AI — from identifying weaknesses to achieving reliable, production-grade performance

Car damage detection for insurance claims

Computer vision model to analyze vehicle images to identify and segment damage, enabling automated insurance coverage estimation to customers.

Client/Partner

ControlExpert

Challenge

Ensuring the robustness of a car damage segmentation model for diverse real-world conditions such as weather, time of day, car pose) is difficult and requires costly manual data annotation.

Assessment Scope

The QuantPi platform was leveraged to rigorously test the model’s performance in depth and continuously allow for improvements to the computer vision model in a matter of hours, instead of weeks.

Metrics

Automated weather perturbation testing (e.g. day/night, clear/rain) was implemented using Stable Diffusion and CLIP models to identify outdoor images.

Data annotation was automated by using a CLIP model to dynamically tag images with relevant attributes (e.g. car pose, indoor/outdoor, raindrops).
‍
Model performance was evaluated by using IoU across tagged attributes to identify weaknesses and guide targeted improvements

Business Impact

- Reduced manual effort for Data Scientist
- Increase model performance
- Reduced claims processing time

RAG & LLM for Factory Troubleshooting in Life Sciences

LLM and RAG based AI system for factory troubleshooting. Assisting line workers in identifying the root cause to issues quickly and enabling speedy resolution.

Client/Partner

Violet AI

Read the entire success story

Challenge

Violet AI needed to demonstrate the quality and reliability of their RAG-based AI systems, particularly regarding document retrieval accuracy and response faithfulness, to meet compliance demands in the life sciences industry, including the EU AI Act.

Assessment Scope

The QuantPi platform was leveraged to address these challenges, rigorously evaluating the RAG and LLM systems' performance and preparing Violet AI for compliance.

Metrics

We evaluated RAG system quality, incorporating metrics for retrieval accuracy (LAcc, SAcc, MRR) and LLM faithfulness/hallucination (based on RAGAs framework and existing out of the box metrics from QuantPi).

Business Impact

- Reduced manual effort for Data Scientist
- Increase model performance
- Reduced claims processing time

AI-Powered Investment Research Assistant

Agentic workflow using LLMs as an orchestrator for summarization, multi-turn chatbots, and data extraction from unstructured sources.

Client/Partner

Unique

Read the full case study

Challenge

The system faced significant risks regarding the failure to meet regulatory requirements or internal company guidelines (e.g., recommending restricted products). Other critical concerns included the potential oversharing of client data, inconsistent advice quality across different customer segments, and the risk of poor financial outcomes

Assessment Scope

QuantPi provided a comprehensive, black-box AI testing technology to evaluate the model for performance, bias, robustness, and data quality. The pilot testing focused specifically on three core areas: Accuracy, hallucination and reliability.

Metrics

Hallucination Risks: Measured via a faithfulness metric to ensure responses were grounded in the provided context.
Document Search Risks: Assessed using Word Overlap Rate, Mean Reciprocal Rank (MRR), and Lenient Retrieval Accuracy.
Reliability: Tested across various query difficulty levels, domain bias, and typo tolerance.

Business Impact

- Risk Mitigation: Identified and addressed risks related to advisor misuse and inadequate clarity in investment recommendations.
‍
- Performance Assurance: Guaranteed the grounding of AI-generated advice in provided financial contexts to prevent misinformation.

CV Matching System

StepStone develops an LLM-based candidate recommender system which automatically provides recruiters with candidate recommendations purely based on job listings.

Client/Partner

StepStone, TÜV AI.LAB

Read the full white paper

Challenge

Stepstone needed to validate and justify the fairness of their LLM-based candidate recommender system, particularly regarding non-discrimination of applicants based on age, gender and ethnic origin.

Assessment Scope

The QuantPi Platform was used to assess false negative and selection rates of the system with regards to the sensitive attributes and their perturbations – resulting in quantitative test results on non-discrimination on individual and group level.

Metrics

An assurance case was built together with TIC partner TÜV AI.Lab. This document serves as structured line of argumentation and links the quantitative evidences generated in the QuantPi Platform to applicable legal requirements under Germany’s Equal Treatment Act (AGG) and future provisions of the EU AI Act.

Business Impact

- Reduced bias and discrimination
- Increased system performance
- Compliance documented in assurance case

Retrieval-Augmented
Generation (RAG) Testing

What output-based metrics conceal that do not take components into consideration:

How QuantPi valdiates object detection systems

Domain information and operational design domain (ODD)

&

Dimensional Decomposition

&

Acceptance Criteria

&

Deployment Decision

&

Driving Real-World Impact with Trusted AI

Car damage detection for insurance claims

Client/Partner

Challenge

Assessment Scope

Metrics

Business Impact

RAG & LLM for Factory Troubleshooting in Life Sciences

Client/Partner

Challenge

Assessment Scope

Metrics

Business Impact

AI-Powered Investment Research Assistant

Client/Partner

Challenge

Assessment Scope

Metrics

Business Impact

CV Matching System

Client/Partner

Challenge

Assessment Scope

Metrics

Business Impact

QuantPi's evaluation of object detection systems produces:

See QuantPi's continuous robustness assurance
for a computer vision model
in automotive claims processing

Retrieval-Augmented Generation (RAG) Testing

What output-based metrics conceal that do not take components into consideration:

How QuantPi valdiates object detection systems

Domain information and operational design domain (ODD)

&

Dimensional Decomposition

&

Acceptance Criteria

&

Deployment Decision

&

Driving Real-World Impact with Trusted AI

Car damage detection for insurance claims

Client/Partner

Challenge

Assessment Scope

Metrics

Business Impact

RAG & LLM for Factory Troubleshooting in Life Sciences

Client/Partner

Challenge

Assessment Scope

Metrics

Business Impact

AI-Powered Investment Research Assistant

Client/Partner

Challenge

Assessment Scope

Metrics

Business Impact

CV Matching System

Client/Partner

Challenge

Assessment Scope

Metrics

Business Impact

QuantPi's evaluation of object detection systems produces:

See QuantPi's continuous robustness assurance for a computer vision model in automotive claims processing

Retrieval-Augmented
Generation (RAG) Testing

See QuantPi's continuous robustness assurance
for a computer vision model
in automotive claims processing