testing approach

How QuantPi validates multi-modal AI systems

Domain Information Dimensional Decomposition Acceptance Criteria Deployment Decision

Domain Information

Every assessment defines an explicit operational design domain (ODD) per modality and across modality combinations: per-modality input distributions, target tasks, expected fusion behaviour, and modality-specific degradation budgets.

Dimensional Decomposition

Multi-modal behaviour is characterised across six dimensions spanning within-modality and cross-modality axes:
‍
• Per-modality input quality: Resolution, length, noise level evaluated independently per modality.
• Modality-specific perturbations: Visual blur, audio noise, text typos applied to each input stream.
• Cross-modal coherence: System consistency when modalities agree vs. when they disagree.
• Modality weighting: Whether the system over- or under-relies on each modality relative to expected behaviour.
• Task-type sensitivity across modality combinations: Descriptive, explanatory, predictive, counterfactual reasoning.
• Robustness to single-modality shifts: Behaviour stability when one input degrades while others remain intact.

To eliminate evaluation bias, testing leverages AI-driven user simulation models to stress-test execution boundaries. All performance scores are strictly reported as a Metric + Confidence Interval pair to statistically quantify uncertainty stemming from data constraints or stochastic model environments.

Acceptance Criteria

Acceptance criteria are predefined, measurable thresholds the system must meet per modality and per cross-modal slice, with stricter thresholds enforced where modality combinations are safety- or compliance-critical.

Deployment Decision

Systems meeting all per-modality and cross-modal thresholds are cleared for deployment; partial failures localise whether the root cause sits in a single modality, in the integration layer, or in a specific modality intersection.

Driving Real-World Impact with Trusted AI

Real-world examples of how companies use QuantPi to build trustworthy AI — from identifying weaknesses to achieving reliable, production-grade performance

Car damage detection for insurance claims

Computer vision model to analyze vehicle images to identify and segment damage, enabling automated insurance coverage estimation to customers.

Client/Partner

ControlExpert

Challenge

Ensuring the robustness of a car damage segmentation model for diverse real-world conditions such as weather, time of day, car pose) is difficult and requires costly manual data annotation.

Assessment Scope

The QuantPi platform was leveraged to rigorously test the model’s performance in depth and continuously allow for improvements to the computer vision model in a matter of hours, instead of weeks.

Metrics

Automated weather perturbation testing (e.g. day/night, clear/rain) was implemented using Stable Diffusion and CLIP models to identify outdoor images.

Data annotation was automated by using a CLIP model to dynamically tag images with relevant attributes (e.g. car pose, indoor/outdoor, raindrops).
‍
Model performance was evaluated by using IoU across tagged attributes to identify weaknesses and guide targeted improvements

Business Impact

- Reduced manual effort for Data Scientist
- Increase model performance
- Reduced claims processing time

RAG & LLM for Factory Troubleshooting in Life Sciences

LLM and RAG based AI system for factory troubleshooting. Assisting line workers in identifying the root cause to issues quickly and enabling speedy resolution.

Client/Partner

Violet AI

Read the entire success story

Challenge

Violet AI needed to demonstrate the quality and reliability of their RAG-based AI systems, particularly regarding document retrieval accuracy and response faithfulness, to meet compliance demands in the life sciences industry, including the EU AI Act.

Assessment Scope

The QuantPi platform was leveraged to address these challenges, rigorously evaluating the RAG and LLM systems' performance and preparing Violet AI for compliance.

Metrics

We evaluated RAG system quality, incorporating metrics for retrieval accuracy (LAcc, SAcc, MRR) and LLM faithfulness/hallucination (based on RAGAs framework and existing out of the box metrics from QuantPi).

Business Impact

- Reduced manual effort for Data Scientist
- Increase model performance
- Reduced claims processing time

AI-Powered Investment Research Assistant

Agentic workflow using LLMs as an orchestrator for summarization, multi-turn chatbots, and data extraction from unstructured sources.

Client/Partner

Unique

Read the full case study

Challenge

The system faced significant risks regarding the failure to meet regulatory requirements or internal company guidelines (e.g., recommending restricted products). Other critical concerns included the potential oversharing of client data, inconsistent advice quality across different customer segments, and the risk of poor financial outcomes

Assessment Scope

QuantPi provided a comprehensive, black-box AI testing technology to evaluate the model for performance, bias, robustness, and data quality. The pilot testing focused specifically on three core areas: Accuracy, hallucination and reliability.

Metrics

Hallucination Risks: Measured via a faithfulness metric to ensure responses were grounded in the provided context.
Document Search Risks: Assessed using Word Overlap Rate, Mean Reciprocal Rank (MRR), and Lenient Retrieval Accuracy.
Reliability: Tested across various query difficulty levels, domain bias, and typo tolerance.

Business Impact

- Risk Mitigation: Identified and addressed risks related to advisor misuse and inadequate clarity in investment recommendations.
‍
- Performance Assurance: Guaranteed the grounding of AI-generated advice in provided financial contexts to prevent misinformation.

CV Matching System

StepStone develops an LLM-based candidate recommender system which automatically provides recruiters with candidate recommendations purely based on job listings.

Client/Partner

StepStone, TÜV AI.LAB

Read the full white paper

Challenge

Stepstone needed to validate and justify the fairness of their LLM-based candidate recommender system, particularly regarding non-discrimination of applicants based on age, gender and ethnic origin.

Assessment Scope

The QuantPi Platform was used to assess false negative and selection rates of the system with regards to the sensitive attributes and their perturbations – resulting in quantitative test results on non-discrimination on individual and group level.

Metrics

An assurance case was built together with TIC partner TÜV AI.Lab. This document serves as structured line of argumentation and links the quantitative evidences generated in the QuantPi Platform to applicable legal requirements under Germany’s Equal Treatment Act (AGG) and future provisions of the EU AI Act.

Business Impact

- Reduced bias and discrimination
- Increased system performance
- Compliance documented in assurance case

Coherent Multi-Modal AI Systems

What single-modality testing conceals

How QuantPi validates multi-modal AI systems

Domain Information

Dimensional Decomposition

Acceptance Criteria

&

Deployment Decision

&

Driving Real-World Impact with Trusted AI

Car damage detection for insurance claims

Client/Partner

Challenge

Assessment Scope

Metrics

Business Impact

RAG & LLM for Factory Troubleshooting in Life Sciences

Client/Partner

Challenge

Assessment Scope

Metrics

Business Impact

AI-Powered Investment Research Assistant

Client/Partner

Challenge

Assessment Scope

Metrics

Business Impact

CV Matching System

Client/Partner

Challenge

Assessment Scope

Metrics

Business Impact

QuantPi's multi-modal evaluation produces:

See how QuantPi's multi-modal testing capabilities have been leveraged by enterprise customers.