Enterprise RAG: how to prevent hallucinations on corporate datasets

The problem nobody wants to admit

An insurance company deployed an internal GPT-4-based chatbot to answer questions about contracts. Three weeks in, an operator noticed the system was citing non-existent clauses with complete confidence. The internal reputational damage was significant. The rollback cost exceeded the original implementation budget.

This is not an isolated case. According to aggregated Gartner data and internal consulting surveys, 67% of enterprise RAG projects fail or are scaled back within six months of go-live. The reason is almost never the model itself — it is the retrieval architecture.

---

What actually causes hallucinations in RAG systems

In an enterprise RAG system, hallucinations have a precise cause: the model receives insufficient, ambiguous, or incorrect context, and fills the gaps with inferences not grounded in real data.

The three most common patterns:

1. Incorrect chunk retrieval Documents are split into fixed-size chunks (e.g. 512 tokens) without respecting semantic structure. A paragraph about contract penalties gets separated from the contract header. The retriever fetches the penalty clause, but the model has no idea which contract it belongs to.

2. Overly permissive similarity scores The retriever returns documents with a cosine similarity of 0.62, while the minimum acceptable threshold for legal or regulatory text should be ≥ 0.80. The model works with irrelevant context and compensates by fabricating.

3. No grounding verification No layer checks whether the model's statements are traceable to the retrieved chunks. The model can cite data not present in the context, and nothing catches it.

---

The correct architecture for enterprise RAG

Semantic chunking, not fixed-size

Drop fixed-size chunking. Use document-structure-aware chunking:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150,
    separators=["\n\n", "\n", ". ", " "],
    length_function=len
)

For structured documents (contracts, technical manuals, regulations), implement a hierarchical parser that preserves contextual breadcrumbs: every chunk carries metadata about its parent document, section, and article number.

chunk_metadata = {
    "source": "supply_contract_2024_v3.pdf",
    "section": "Article 7 - Penalties",
    "page": 12,
    "doc_type": "legal",
    "last_updated": "2024-03-15"
}

This metadata is retrieved alongside the chunk and included in the prompt. The model always knows where the information comes from.

Two-stage retrieval with re-ranking

A single vector retriever is not sufficient for complex enterprise datasets. The correct approach:

Stage 1 — Candidate retrieval: broad retrieval with vector search (top-k = 20, low threshold)
Stage 2 — Re-ranking: a cross-encoder (e.g. cross-encoder/ms-marco-MiniLM-L-6-v2) re-evaluates the relevance of 20 candidates and selects the top-5 truly relevant to the query

from sentence_transformers import CrossEncoder

re_ranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = re_ranker.predict([(query, chunk) for chunk in candidate_chunks])
top_chunks = [candidate_chunks[i] for i in sorted(range(len(scores)), key=lambda x: scores[x], reverse=True)[:5]]

The re-ranker drastically reduces noise in the context passed to the model.

Explicit confidence thresholds

Define minimum thresholds by document type:

| Document type | Minimum cosine similarity | |---|---| | Regulations / compliance | 0.82 | | Contracts | 0.80 | | Technical knowledge base | 0.75 | | Internal FAQs | 0.70 |

If no chunk exceeds the threshold, the system must respond with an explicit fallback: "I could not find sufficiently relevant information in the corporate dataset to answer this question." — not an invention.

Grounding verification layer

The last line of defense: a component that verifies post-generation that every factual claim in the response is traceable to the context chunks provided.

Practical approaches:

NLI-based: use a Natural Language Inference model (e.g. facebook/bart-large-mnli) to verify the response is entailed by the context
Citation forcing: force the model to cite the source chunk for each claim via a structured prompt
Confidence scoring: ask the model to self-assess its certainty on a 1-5 scale before returning the answer

System prompt (excerpt):
"For every factual claim in your response, indicate in square brackets the source chunk number [chunk_N]. If you cannot find support in the provided chunks, write explicitly: INFORMATION NOT AVAILABLE IN DATASET."

---

Production monitoring

A RAG system without monitoring is blind. Minimum metrics to track:

Retrieval recall@5: how many relevant answers are in the top-5 retrieved chunks
Answer faithfulness: ratio of context-grounded statements in the response vs. total statements (tool: RAGAS)
Fallback rate: percentage of queries that do not meet the confidence threshold — values above 15% signal dataset coverage problems
Per-stage latency: separate retrieval latency from generation latency to identify bottlenecks

# Evaluation example with RAGAS
ragas evaluate \
  --dataset ./eval_dataset.json \
  --metrics faithfulness,answer_relevancy,context_recall

---

Operational take-away

Fixed-size chunking is the leading cause of hallucinations in enterprise RAG. Replace it with hierarchical semantic chunking with contextual metadata.
Implement two-stage retrieval. A single embedding vector store is not enough for enterprise datasets.
Define confidence thresholds by document type. An explicit fallback is always better than a fabricated answer.
Add a grounding verification layer before returning any response to the user.
Measure faithfulness with RAGAS or equivalent tools. Without metrics, quality is guesswork.

A well-built RAG system does not eliminate every risk of error, but it makes every error traceable and correctable. That is the difference between a prototype and an enterprise system.

---

Evviva Group supports IT partners in designing and implementing enterprise-grade RAG architectures. If you are evaluating a similar deployment, we are open to a technical conversation.