Generative AI·June 8, 2026·5 min read

Citation-Guard: Production RAG Patterns for Regulated Fintech

Naive RAG passes the demo and fails the audit. Citation-guard keeps fintech AI honest: retrieve with citations, quote numbers instead of writing them, abstain when unsure, verify before shipping. With pseudocode, a pipeline diagram, and the metrics that matter.

By JustSoftLab Team

Citation-Guard: Production RAG Patterns for Regulated Fintech

A wealth platform demoed an AI assistant that answered client questions in plain English. It worked well until someone asked about early-withdrawal penalties and it returned the wrong number. The number read with full confidence. If a client acts on it, you have a regulatory complaint.

Most fintech AI dies in the gap between a strong demo and a system you can put in front of regulated users. We see it constantly. We wrote up the broader set in 10 RAG architecture mistakes fintechs make in production; this piece covers the failure that gets companies in real trouble: confident wrong answers.

The tail is the whole problem

The benchmark said 92% accuracy. In a consumer app, 92% makes a good product. In finance, the wrong 8% lands on the edge cases that carry the most liability, and it reads in the same tone as the right 92%.

You cannot ship "usually correct" to someone making a financial decision. So the real engineering question is how to make the system refuse to answer when it would be wrong.

We have shipped this across several production fintech systems: RAG over regulatory documents, client communications, and financial data. The pattern that survives compliance review we call citation-guard. The pipeline:

Query
  │
  ▼
Retrieve (vector search → ranked spans WITH source refs)
  │
  ▼
Generate (constrained: every claim cites a span; numbers are quoted, not written)
  │
  ▼
Verify (does each span actually support the claim? do quoted numbers match?)
  │            │
  │ pass       └─ fail ──► regenerate once, else ABSTAIN
  ▼
Answer + citations     —or—     "I don't have a reliable answer for that"

Four rules make it work.

Rule 1 — Retrieve with citations: every claim maps to a source span

Most pipelines retrieve chunks, stuff them into the prompt, and let the model synthesize. Hallucination enters during that synthesis: the model blends sources, fills gaps, and you can't tell which sentence came from where.

Citation-guard inverts the contract. Retrieval returns spans with identity: document, section, character range. The model then attaches a span reference to every factual sentence.

type Span = { docId: string; section: string; start: number; end: number; text: string };

async function retrieve(query: string): Promise<Span[]> {
  const hits = await vectorSearch(query, { k: 12 });
  // keep provenance on every hit — never flatten to raw text
  return hits.map(toSpanWithRefs);
}

Your retrieval layer sets the ceiling. Chunking, embeddings, and the vector store all matter. We benchmarked that layer in Postgres pgvector vs Pinecone and covered the accuracy techniques in RAG for reliable AI. Citation-guard sits on top of good retrieval; it doesn't replace it.

If a claim can't be traced to a retrieved span, it doesn't ship. That rule removes the worst failure mode: plausible sentences with no source.

Rule 2 — For numbers, quote; never paraphrase

Numbers are where pattern-matching betrays you. "$10,000" becomes "$100,000". "3.5%" becomes "3.05%". Both look authoritative.

So the system extracts numeric values from the span and renders them exactly as written. The model picks which number is relevant. It never writes the number itself.

// The model emits a reference, not a literal number.
// The cite token is resolved from the source AFTER generation.
const rendered = answer.replace(CITE_TOKEN, (_, spanId, field) =>
  extractVerbatim(spans[spanId], field)  // pulls the exact characters from the source
);

The value comes from the source document, never from the model's next-token probability.

Rule 3 — Abstain over guess

The instinct is to make the assistant always helpful. In regulated finance, that instinct is a liability. An assistant that guesses when context is thin does more damage than one that declines and routes the user to a human.

function shouldAnswer(spans: Span[], topScore: number): boolean {
  if (topScore < RETRIEVAL_CONFIDENCE_FLOOR) return false; // nothing solid retrieved
  if (spans.length === 0) return false;
  return true;
}
// else: "I don't have a reliable answer for that — here's who can help."

You tune RETRIEVAL_CONFIDENCE_FLOOR to your risk tolerance. Treat a higher abstention rate as a cost you choose, not a failure you suffer. In finance, refusing to answer beats answering wrong.

Rule 4 — Verify before ship

Add a guard between generation and delivery. Before an answer reaches the user, a verification pass checks each claim against the spans it cites.

async function verify(answer: Answer): Promise<boolean> {
  for (const claim of answer.claims) {
    const span = answer.spans[claim.citation];
    if (!span) return false;                           // uncited claim
    if (!entails(span.text, claim.text)) return false; // span doesn't support it
    if (claim.numbers.some((n) => !span.text.includes(n))) return false; // number drift
  }
  return true;
}
// fail → regenerate once with stricter constraints, else abstain

entails() can be a smaller, cheaper model scoped to one yes/no question: does this span support this sentence? With it, the system stands behind the answer instead of just emitting it.

Measure the right things

Accuracy alone hides the failure mode. Track these instead:

Citation coverage — share of factual sentences with a valid source span. Aim for 100%.
Numeric fidelity — share of numbers that match their source exactly.
Abstention rate — how often the system declines. Watch it over time.
Audit traceability — for any past answer, can you reconstruct which spans produced it? In regulated finance you have to.

What it costs

Citation-guard is not free. It adds retrieval discipline, a verification pass, and an abstention rate that frustrates anyone expecting an always-on oracle. Latency rises, and coverage dips before it climbs as retrieval improves.

In return you get an answer you can defend. When compliance asks where a number came from, the system points to a document, a section, and a span instead of shrugging about model weights.

Across our systems, model size moved the numbers little. The guardrails moved them. Teams that reach for a bigger, pricier model to fix hallucinations are solving the wrong problem.

Takeaway

Naive RAG optimizes for the average case and demos well. Regulated fintech turns on the tail. Build for the tail: cite every claim, quote every number, abstain when unsure, verify before you ship. That architecture passes the audit, not only the demo.

We build production AI for fintech, healthcare, and other high-stakes domains, engineered to survive compliance review rather than only impress in a demo. If you're shipping AI into a regulated product, let's compare notes on what holds up.

Talk to the team behind this

Building something like this in production?

Our senior engineers ship this kind of work for real teams. 45-minute call, no pitch deck — just architecture, trade-offs, and whether we're the right fit for your problem.

Book a discovery call Estimate this in 60 sec

All insights