Skip to main content
JustSoftLabJustSoftLab
JustSoftLabJustSoftLab
AI Assistant
Fintech/RAG Architecture

Engineering reference

Production-grade RAG for regulated finance.

Generic RAG patterns break under fintech load — regulatory citations get hallucinated, audit trails are missing, retrieval misses jurisdiction-scoped data, and cost models fall apart at scale. This is the reference architecture we ship: seven layers, with the trade-offs we've learned the hard way on production deployments.

On this page

  • ->Why generic RAG fails for fintech
  • ->A 7-layer reference architecture
  • ->Compliance controls and audit trail design
  • ->Tech stack we ship (vector store, embeddings, reranker, generation)
  • ->Cost economics at production scale
  • ->When to use Postgres + pgvector vs hosted vector DB

The fintech RAG problem

Why generic RAG patterns break in regulated finance.

Most RAG tutorials assume open-domain Q&A on clean corpora. Production fintech RAG is a different problem: the corpus is large, regulated, and version-sensitive; users ask questions where a hallucinated answer creates legal liability; compliance asks for full lineage on every response; and retrieval has to scope to the right jurisdiction, product, and effective date.

We've seen the same failure modes repeatedly across fintech engagements — chunking that splits regulatory clauses mid-sentence, vector-only retrieval that misses CUSIPs and regulation citations, missing audit trails that fail SR 11-7 model risk reviews, and cost models that work at 10K queries/month and fall apart at 1M. The architecture below is what we've converged on after fixing those failures.

For a deeper treatment of the failure modes — including code-level patterns and migration paths — read our companion article: 10 RAG Architecture Mistakes Fintechs Make in Their First Production Deployment.

Reference architecture

Seven layers from source documents to audited answers.

1. Ingestion & lineage

Source-system connectors (S3, SFTP, Confluence, Salesforce, custom APIs) with full document lineage. Every chunk traces back to a source document, version, page, and effective date — required for SOX, PCI-DSS, and SR 11-7 audits.

S3 / SFTP / Confluence connectorsDocument versioningEffective-date trackingPII redaction at ingest

2. Chunking strategy

Document-type-aware chunking. Regulatory PDFs respect section/clause boundaries; transactional logs chunk by logical event; structured docs preserve table integrity. Generic 512-token splitters are the #1 cause of fintech RAG failure in production.

Section-aware splittingClause boundary preservationTable-aware chunkingSliding context windows

3. Hybrid retrieval

Vector similarity (pgvector or Pinecone) combined with BM25 keyword matching and metadata filtering. Pure-vector retrieval misses exact identifiers (CUSIP, ISIN, regulation citations) — hybrid recovers them. Metadata filters scope queries to the right jurisdiction, time window, or product line.

pgvector HNSW indexingBM25 keyword fallbackMetadata pre-filteringReciprocal rank fusion

4. Re-ranking

Cohere Rerank or a fine-tuned cross-encoder rescores the top N candidates by query relevance. Cuts hallucination rate roughly in half on regulatory Q&A workloads vs vector-only retrieval, with a marginal latency cost.

Cohere Rerank v3Cross-encoder fine-tuningTop-N rescoringLatency budgeting

5. Citations & audit trail

Every generated answer carries source citations: document, page, and span. Audit logs persist the full retrieval window, the prompt, the model output, and the citation set — joinable to user identity and timestamp. Regulators ask "what did the system tell whom, when, and why" — this answers it.

Span-level citationsImmutable audit logUser-action joiningRetention policies

6. Evaluation & observability

A golden-set evaluation harness that runs on every model, prompt, or retriever change. Hallucination rate, citation accuracy, refusal rate, and latency tracked as deployment gates — not afterthoughts. Prod observability via OpenTelemetry traces tied to retrieval components.

Golden-set eval harnessHallucination metricsOpenTelemetry tracingA/B prompt testing

7. Compliance controls

PII de-identification at ingest, role-based access on retrieval, encryption at rest with customer-managed keys, model output filtering for restricted topics, refusal templates that stay inside policy. Compliance signs off as a reviewer before launch — not a blocker after.

PII de-identificationCMK encryptionOutput policy filtersRefusal templates

Tech stack we ship

Default choices and the alternatives we benchmark against.

Every choice below is opinionated. We'll go off-default for legitimate reasons — regulatory data residency, an existing infrastructure constraint, or a measured benchmark on your data. Not because something is trendy.

Vector storePostgres + pgvector (HNSW)Pinecone, Weaviate, Qdrant
EmbeddingsVoyage 3 large, OpenAI text-embedding-3-largeCohere Embed v3, fine-tuned domain models
RetrievalHybrid (pgvector + BM25 + metadata filters)Vespa, Elasticsearch, OpenSearch
RerankerCohere Rerank v3Cross-encoder fine-tuned on golden set
GenerationAnthropic Claude Sonnet / OpusGPT-4 family, Gemini 1.5 Pro, fine-tuned Llama 3
Eval harnessCustom golden set + RagasTruLens, Phoenix, LangSmith
ObservabilityOpenTelemetry + Sentry + custom dashboardsDatadog APM, New Relic, Honeycomb
CloudAWS (us-east-1, eu-west-1) with VPC isolationGCP, Azure, on-prem Kubernetes

Cost economics

Where the money actually goes at production scale.

Most fintech RAG cost surprises come from three places: re-embedding (every document update re-runs an embedding model), reranker latency (which forces overprovisioning), and observability (audit-grade logging is not free). Generation cost — the headline LLM bill — is usually third or fourth on the list.

Our typical mix at 1M monthly queries on a regulated corpus of 5–10M chunks: roughly 40% generation, 25% embedding and re-embedding, 15% reranker, 15% vector store and compute, 5% observability and storage. Numbers shift hard with prompt design, retrieval window size, and how often the source corpus changes.

The pgvector vs hosted-vector-DB choice often dominates the cost-and-compliance conversation. We benchmarked both head-to-head on production load: Postgres + pgvector vs Pinecone — a production benchmark to 50M vectors.

For a deterministic ballpark on your specific scope, our Project Estimator runs a six-step wizard and emits a PDF — no sales call required to get a number.

Ready to build

Build fintech RAG that passes audit on day one.

45 minutes with our fintech AI engineers. We'll review your data, regulatory constraints, and integration surface — and tell you honestly what the architecture needs to look like.

Companion articles

Engineering deep-dives that pair with this reference.

All articles