What are typical RAG system failure modes?

Typical failure modes include: Retriever and Extractor Omissions (failure to locate relevant info chunks within the source corpus or ignoring valid content during generation), Noise Propagation (compounding errors where the extraction layer fails to filter out irrelevant retrieved text context), and Knowledge Bias & Hallucinations (generating ungrounded or fabricated factual assertions derived from the model's internal weights rather than retrieved evidence).

How does QuantPi validate RAG systems?

QuantPi validates RAG systems through four steps: 1) Domain Information and Operational Design Domain - defining explicit operational design domain including input document topology, query distribution patterns, and target retrieval parameters. 2) Dimensional Decomposition - breaking down system components for isolated testing. 3) Acceptance Criteria - predefined, measurable performance thresholds including faithfulness, recall, and precision scores. 4) Deployment Decision - models meeting all critical thresholds are cleared for deployment; partial failures trigger guided development.

What does QuantPi's RAG evaluation produce?

QuantPi's RAG evaluation produces: A root-cause diagnostic map explicitly isolating end-to-end pipeline failure points across atomic information-flow states, a recommendation with concrete next steps per failure mode (prompt, chunk size, retriever, reranker, model) sliced by relevant metadata, and a traceable, repeatable evidence package supporting deployment decisions and compliance tracking.

testing approach

How QuantPi validates RAG systems

Domain Information Dimensional Decomposition Acceptance Criteria Deployment Decision

Domain Information and Operational Design Domain

Every validation cycle defines an explicit operational design domain (ODD): input document topology (e.g. multi-column slides, process manuals, FAQs), query distribution patterns, structural chunking constraints, and target retrieval parameters.

Dimensional Decomposition

The safety- and quality-relevant dimensions derived from the ODD cluster into six axes along which performance is measured:
‍
• Subcomponent alignment: Isolating retrieval weaknesses from generative flaws.
• Context relevance: Precision and completeness of retrieved text chunks.
• Faithfulness: Factual grounding of the output within the source context.
• Answer relevance: Direct alignment of the final response to user intent.
• Robustness to query perturbations: Stability under typos, semantic drift, abbreviations and rephrasing.
• Retrieval drift: Performance consistency across expanding knowledge bases over time.

Acceptance Criteria

Acceptance criteria are predefined, measurable performance thresholds the system must meet on each tested dimension to qualify for deployment. For RAG use cases, these are typically evaluation metrics (including faithfulness, recall, and precision scores) measured per knowledge domain, with stricter thresholds enforced on compliance-heavy or high-risk topics.

Deployment Decision

Models meeting all critical thresholds are cleared for deployment; partial failures trigger "guided development" by localizing the exact subcomponent (retriever or generator) requiring optimization.

Driving Real-World Impact with Trusted AI

Real-world examples of how companies use QuantPi to build trustworthy AI — from identifying weaknesses to achieving reliable, production-grade performance

Car damage detection for insurance claims

Computer vision model to analyze vehicle images to identify and segment damage, enabling automated insurance coverage estimation to customers.

Client/Partner

ControlExpert

Challenge

Ensuring the robustness of a car damage segmentation model for diverse real-world conditions such as weather, time of day, car pose) is difficult and requires costly manual data annotation.

Assessment Scope

The QuantPi platform was leveraged to rigorously test the model’s performance in depth and continuously allow for improvements to the computer vision model in a matter of hours, instead of weeks.

Metrics

Automated weather perturbation testing (e.g. day/night, clear/rain) was implemented using Stable Diffusion and CLIP models to identify outdoor images.

Data annotation was automated by using a CLIP model to dynamically tag images with relevant attributes (e.g. car pose, indoor/outdoor, raindrops).
‍
Model performance was evaluated by using IoU across tagged attributes to identify weaknesses and guide targeted improvements

Business Impact

- Reduced manual effort for Data Scientist
- Increase model performance
- Reduced claims processing time

RAG & LLM for Factory Troubleshooting in Life Sciences

LLM and RAG based AI system for factory troubleshooting. Assisting line workers in identifying the root cause to issues quickly and enabling speedy resolution.

Client/Partner

Violet AI

Read the entire success story

Challenge

Violet AI needed to demonstrate the quality and reliability of their RAG-based AI systems, particularly regarding document retrieval accuracy and response faithfulness, to meet compliance demands in the life sciences industry, including the EU AI Act.

Assessment Scope

The QuantPi platform was leveraged to address these challenges, rigorously evaluating the RAG and LLM systems' performance and preparing Violet AI for compliance.

Metrics

We evaluated RAG system quality, incorporating metrics for retrieval accuracy (LAcc, SAcc, MRR) and LLM faithfulness/hallucination (based on RAGAs framework and existing out of the box metrics from QuantPi).

Business Impact

- Reduced manual effort for Data Scientist
- Increase model performance
- Reduced claims processing time

AI-Powered Investment Research Assistant

Agentic workflow using LLMs as an orchestrator for summarization, multi-turn chatbots, and data extraction from unstructured sources.

Client/Partner

Unique

Read the full case study

Challenge

The system faced significant risks regarding the failure to meet regulatory requirements or internal company guidelines (e.g., recommending restricted products). Other critical concerns included the potential oversharing of client data, inconsistent advice quality across different customer segments, and the risk of poor financial outcomes

Assessment Scope

QuantPi provided a comprehensive, black-box AI testing technology to evaluate the model for performance, bias, robustness, and data quality. The pilot testing focused specifically on three core areas: Accuracy, hallucination and reliability.

Metrics

Hallucination Risks: Measured via a faithfulness metric to ensure responses were grounded in the provided context.
Document Search Risks: Assessed using Word Overlap Rate, Mean Reciprocal Rank (MRR), and Lenient Retrieval Accuracy.
Reliability: Tested across various query difficulty levels, domain bias, and typo tolerance.

Business Impact

- Risk Mitigation: Identified and addressed risks related to advisor misuse and inadequate clarity in investment recommendations.
‍
- Performance Assurance: Guaranteed the grounding of AI-generated advice in provided financial contexts to prevent misinformation.

CV Matching System

StepStone develops an LLM-based candidate recommender system which automatically provides recruiters with candidate recommendations purely based on job listings.

Client/Partner

StepStone, TÜV AI.LAB

Read the full white paper

Challenge

Stepstone needed to validate and justify the fairness of their LLM-based candidate recommender system, particularly regarding non-discrimination of applicants based on age, gender and ethnic origin.

Assessment Scope

The QuantPi Platform was used to assess false negative and selection rates of the system with regards to the sensitive attributes and their perturbations – resulting in quantitative test results on non-discrimination on individual and group level.

Metrics

An assurance case was built together with TIC partner TÜV AI.Lab. This document serves as structured line of argumentation and links the quantitative evidences generated in the QuantPi Platform to applicable legal requirements under Germany’s Equal Treatment Act (AGG) and future provisions of the EU AI Act.

Business Impact

- Reduced bias and discrimination
- Increased system performance
- Compliance documented in assurance case

Trustworthy
RAG Systems

What output-only metrics miss when subcomponents are ignored

How QuantPi validates RAG systems

Domain Information and Operational Design Domain