Calculating the cost of generative AI — and how to keep it under control
Honest engineering breakdown of GenAI cost in 2026 — four implementation paths, current API economics, where the money actually goes, and the cost surprises we see most often in production.

GenAI cost surprises rarely come from the headline LLM bill. They come from three places teams underestimate at scoping: data preparation (often 25–30% of the budget alone), evaluation and observability infrastructure (audit-grade logging is not free), and re-embedding cycles when source documents update. The model API bill is usually third or fourth on the list — which is why obsessing over per-token pricing while ignoring the rest is the most common pattern we see lead to cost overruns.
This article maps the four real implementation paths for GenAI in business — from "use the API as-is" to "fine-tune an open-source model on customer infrastructure" — with current 2026 pricing, the cost components that dominate at production scale, and the engagement patterns we actually ship.
For agent-specific cost framing, see how much does AI agent development cost. For the production-grade reference architecture in regulated finance, see /fintech/rag.
The four implementation paths
Every GenAI project picks one of four paths. The right choice is workload-specific — there is no universally cheapest option. The wrong path costs 5–10× more than the right one for the same outcome.
Path 1: Closed-source API as-is. Use a hosted foundation model (Claude, GPT, Gemini) via API with prompt engineering only. Fastest to ship, no fine-tuning, no infrastructure. Works for content generation, structured extraction, classification, and most knowledge-grounded Q&A when paired with retrieval-augmented generation (RAG). Cost ranges from cents per query at low volume to six-figure monthly bills at enterprise scale.
Path 2: Fine-tuned closed-source. Hosted model + provider-managed fine-tuning on your data. OpenAI, Google Vertex AI, Anthropic via Bedrock all support this. Improves response quality for narrow domains, but costs add up — both the fine-tuning step and ongoing per-token charges that don't drop. Best for organizations with domain-specific accuracy needs that still want vendor infrastructure. Total cost typically $25K–$80K including fine-tuning fees and 6–12 months of usage.
Path 3: Open-source self-hosted as-is. Llama 3, Mistral, Phi, Gemma deployed on your cloud or on-prem infrastructure. No per-token API bill, full data residency, full control over the deployment. Requires real DevOps capability and ongoing operational cost. Production-typical at $30K–$80K including infrastructure setup, integration, and first-year operations.
Path 4: Open-source fine-tuned. Same as Path 3 plus fine-tuning the base model on proprietary data. Maximum customization and data control. Required when latency is critical, regulatory load is heavy, and domain accuracy matters more than ship speed. Total cost $100K–$250K+ including data prep, fine-tuning, infrastructure, integration, and ongoing MLOps.
The decision factors that actually matter:
- Data residency. HIPAA, SR 11-7, EU AI Act high-risk deployments, government workloads — all require Path 3 or 4. Hosted APIs are non-starters.
- Volume. API economics dominate at low volume; infrastructure economics dominate at high volume. Crossover for typical workloads is around 1M queries/month, but workload shape matters more than the gross number.
- Latency. Hosted models have 800ms–5s typical generation latency. Fine-tuned small models on customer infrastructure run 50–300ms. For real-time interactions, Path 3/4 wins.
- Iteration speed. Path 1 ships fastest. If you can pilot on Path 1 and validate the workload before committing to Path 3/4, that's almost always the right sequencing.
API economics in 2026
Current per-token pricing for the foundation models we deploy most often:
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Notes |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | Strong general capability, broad ecosystem |
| GPT-4o mini | $0.15 | $0.60 | Cost-efficient for most production workloads |
| Claude Sonnet 4.6 | $3.00 | $15.00 | Strongest reasoning per token, our default for fintech |
| Claude Haiku 4.5 | $0.80 | $4.00 | Fast, cheap, good for high-volume workloads |
| Gemini 2.5 Pro | $1.25 | $5.00 | Strong multimodal, good GCP integration |
| Gemini 2.5 Flash | $0.075 | $0.30 | Cheapest credible model in the tier |
For visual workloads: DALL·E 3 runs $0.04 per standard 1024×1024 image; high-definition or larger sizes climb to $0.08–$0.12. Imagen 3 and Flux are competitive alternatives. Sora and Veo are bundled in ChatGPT Pro and Gemini Advanced subscriptions.
A practical worked example. A customer-support workload running 200K queries per month, average prompt 2K tokens input + 400 tokens output, on Claude Sonnet 4.6:
- Input: 200K × 2K = 400M tokens × $3.00/1M = $1,200
- Output: 200K × 400 = 80M tokens × $15.00/1M = $1,200
- Monthly API cost: ~$2,400
Move that to Claude Haiku 4.5 with the same volume: ~$640/month. Move to Gemini 2.5 Flash: ~$54/month. The model selection moves the API bill by 50× without changing any other architecture.
What this misses: the same workload on a self-hosted Llama 3.3 8B instance with a single A100 GPU costs about $7K/month in cloud GPU spend (or $1.5K/month amortized on owned hardware) — independent of query count. So if your volume crosses ~3M queries/month on Sonnet-class output, the self-hosted SLM economics start to win.
Fine-tuning closed-source
OpenAI, Google Vertex, and Anthropic via Bedrock all support fine-tuning their hosted models. The economics:
- One-time fine-tuning fee. Currently $25/M training tokens for OpenAI GPT-4o, comparable rates for Vertex and Bedrock. A typical 50K-example fine-tune at 1K tokens average runs ~$1,250 in fine-tuning fees alone.
- Inference fees stay the same. A fine-tuned GPT-4o still costs $2.50 input / $10 output per 1M tokens. You're not getting cheaper inference — you're getting better outputs.
- Engineering time. Data prep, eval set construction, fine-tuning supervision, and quality validation typically run $20K–$60K of engineering effort.
Total for a serious closed-source fine-tune: $25K–$80K, 6–12 weeks. Worth it when the domain accuracy gain justifies the engineering effort and the output quality lift is measurable on a clean eval set.
For a deeper take on when fine-tuning is worth it vs. RAG, see our LLM training stages article.
Open-source self-hosted
The cost profile inverts here. No per-token bill, but real infrastructure cost.

GPU costs. A single NVIDIA A100 80GB runs $10K–$20K to purchase outright, or roughly $3–$5/hour on AWS, GCP, or Azure. A Llama 3.3 8B model fits comfortably on a single A100 with room for batched inference. A 70B model needs multiple GPUs — 4× A100 setup runs $40K–$80K to purchase, or roughly $15–$25/hour on cloud.
Cloud vs on-prem. For workloads under 50% sustained utilization, cloud GPUs are cheaper. Above that threshold, on-prem starts to pay back over 12–18 months. For regulated workloads with data residency requirements, on-prem may be the only option regardless of utilization.
Integration and deployment. Wiring the model into your business systems (ERP, CRM, support tooling) typically runs $15K–$40K of engineering effort. This is the line item most often underestimated in initial scoping — the LLM is the easy part; the integration is where projects stall.
Data storage. Cloud storage at $0.021–$0.023/GB/month for vector indices, document corpora, and training data. For a typical RAG corpus of 5–10M chunks plus embeddings, monthly storage runs $50–$200. On-prem storage shifts this to capital expense.
Electricity and maintenance. For on-prem deployments, $2K–$5K per year covers electricity and basic hardware maintenance. Cloud deployments roll this into the hourly GPU rate.
Total for a mid-sized self-hosted GenAI deployment (Llama 3.3 8B + RAG, mid-volume traffic): $30K–$80K initial + $7K–$20K recurring annually.

Open-source fine-tuned
The most expensive path, and the right one for narrow stable workloads where domain accuracy is critical and data residency is required.
Hardware/cloud cost. Same as Path 3, scaled for fine-tuning compute. Fine-tuning a 7B–13B SLM with LoRA runs roughly $500–$3K in compute. Full fine-tuning of the base model is 10× more.
Development time. 6–12 months of fine-tuning, evaluation, and deployment work. Mix of senior AI engineering and DevOps, totaling roughly $80K–$200K depending on team composition (in-house at $150K average loaded annual; outsourced at $75–$120/hour through senior delivery partners).
Data preparation. $20K–$50K typical for a serious fine-tuning corpus — collection, cleaning, labeling, deduplication, eval set construction. This is where most projects either succeed or fail. Synthetic data generation is increasingly used to fill gaps without expanding labeling payroll.
Maintenance. $10K–$30K annual for monitoring, retraining cycles, prompt and retrieval tuning, model drift correction. Production AI is a learning system, not a one-time build.
Total for a serious open-source fine-tuned deployment: $100K–$250K initial + $15K–$30K recurring annually. For workloads where Path 4 fits — fintech RAG with regulatory load, healthcare AI with HIPAA, life sciences with 21 CFR Part 11 — the cost is justified by what no other path can deliver.
Where the money actually goes
When we quote a GenAI project, the line items rarely match the "pay for the model" mental model that buyers come in with. Production budget distribution we see across engagements:
- Engineering effort (~30–35%) — architecture, prompt engineering, retrieval and orchestration code, integration adapters, evaluation harnesses
- Data preparation (~25–30%) — collecting, cleaning, labeling, structuring source corpora; building eval sets; PII redaction at ingest
- Integration and middleware (~15–20%) — connecting the AI to ERPs, CRMs, ticketing systems with proper authentication and idempotency handling
- Evaluation, safety, HITL (~10%) — golden-set evaluation, hallucination metrics, refusal-rate tracking, bias testing, human-in-the-loop workflows
- Infrastructure and compute (~5–10%) — GPU costs for inference and fine-tuning; vector store and storage
- Compliance overhead (~5%, multiplier in regulated industries) — audits, encryption, model interpretability, legal review
- Year-2 maintenance (separate budget) — 15–20% of initial build annually for retraining, prompt tuning, monitoring
Note that the "API bill" or "model cost" doesn't appear as a top-three line item for most production deployments. That's because the model is the cheapest part of any serious GenAI system — the system around it is where the cost lives.
A real cost-economics flip: AI Dungeon
A useful concrete example. Latitude, the startup behind AI Dungeon, used OpenAI's GPT models to power its text generation. As traffic grew, the OpenAI bill plus AWS infrastructure reached $200K/month. After switching to a different generative AI provider and adjusting architecture, monthly operating cost dropped to ~$100K, and the company restructured monetization around a subscription tier for advanced features.
The lesson isn't that one provider is cheaper than another. It's that architecture and provider choice can move costs by 2× or more without changing the user-facing product. At enterprise scale, the right architecture choice is the largest cost lever in any GenAI deployment.
Two JSL case studies
GenAI sales training platform powered by RAG. A US-based SaaS company that specializes in corporate education partnered with us to reduce sales rep onboarding time. Traditional onboarding ran 6 months and cost $100K+ per rep. We engineered a modular GenAI training platform on OpenAI GPT-4 plus a custom RAG pipeline. Internal content (PDFs, presentations, transcripts) was parsed into structured text, embedded with OpenAI and SentenceTransformers, and made queryable via adaptive retrieval. Few-shot learning, personalized lesson generation from resumes, dynamic difficulty calibration, and a live Q&A module rounded out the product. Hosted on Microsoft Azure (Service Bus, SQL Server, Blob Storage), with LLM services kept modular for future model swaps.
- Estimated cost: $100K–$200K
- Duration: under 4 months
- Team: 1 AI engineer, 1 front-end, 1 back-end, 0.5 QA, 0.5 PM
- Outcome: 92% reduction in onboarding cycle (six months to two weeks)
- GenAI components were ~20% of total budget; the rest went to platform features (user roles, monetization flows, subscription logic)
Melody Sage — GenAI music learning platform. Internal R&D project to test how GenAI could transform personalized education. Goal: a fully autonomous GenAI tutor that develops custom curricula and responds intelligently to learner queries in real time. Built on Google Cloud (Vertex AI, Gemini 2.5 Pro, Imagen 3) plus a custom RAG pipeline and AI agent flow. Document AI and Vertex AI parse uploaded materials into chunks; Gemini and Imagen 3 generate structured lessons and illustrated covers. The consultation agent augments internal knowledge with real-time web search via Google Search API, with a self-assessment step to evaluate conflicting sources. Two-step prompting reduces manual overhead in quiz generation. Early experiments with Claude 3.5 produced high-quality results but introduced latency complexity; Gemini 2.5 Pro proved more efficient on GCP infrastructure.
- Estimated cost: $100K–$200K (full product); ~1 month and 3-person team for the R&D prototype
- Duration: ~1 month (prototype) or 2–4 months (full product)
- Team: 1 AI engineer, 1 full-stack, 1 DevOps
- GenAI components were ~20% of total cost; the rest went to business logic and supporting infrastructure
The 20% pattern is consistent across most GenAI engagements we ship. The model and RAG pipeline are real engineering, but they're a fraction of what it takes to put a GenAI product in front of paying users.
Common cost surprises in production
Five places we see budgets routinely overrun:
-
Re-embedding cycles. Source documents update; embeddings need to refresh. For a 5M-chunk corpus updating monthly, re-embedding alone runs $500–$2,500/month depending on the embedding model. Forecast it.
-
Reranker latency overprovisioning. Cohere Rerank or a fine-tuned cross-encoder adds 50–200ms to query latency. To meet p99 SLAs, infrastructure gets overprovisioned. Honest capacity planning saves 20–40% on infrastructure spend.
-
Audit-grade observability. Joining model output to user identity, retrieval window, and timestamp for compliance audits is operationally expensive. Plan for it from the start; retrofitting is 3× the cost.
-
Compliance review cycles. Healthcare and finance projects often hit 6–8 weeks of compliance review on architecture and outputs before launch. Budget the calendar time, not just the engineering time.
-
Data preparation rework. Teams that scope $10K of data prep for a serious project routinely spend $40K when the corpus turns out to need normalization, de-duplication, and PII redaction. Audit data quality before fixing the budget.
What to consider before scoping
A practical pre-scoping checklist:
- Buy vs build. Where would GenAI actually become a differentiator vs. a generic SaaS feature? The right scope avoids vendor lock-in but doesn't reinvent commodity capabilities.
- MLOps capacity. Does the in-house team have the skills to test, fine-tune, and maintain ML models? If not, partner with a delivery team that does.
- Compute access. Cloud GPU quotas, on-prem hardware, or hybrid? Capacity is often the gating factor on timeline.
- PoC capability. Can you (or your partner) ship a proof of concept that validates the business case before committing to full implementation? PoCs are 3–4 weeks and 5–10% of full project cost — almost always the right first step.
- Privacy and security. Is the encryption, access control, and audit logging in place to handle the data the AI will see?
Having those answers before scoping moves the cost conversation from theoretical to actionable.
FAQ
How much does it cost to implement generative AI in business? The honest range: a few hundred dollars per month for SaaS-based tools at one extreme, $250K+ for a custom enterprise-grade fine-tuned open-source deployment at the other. Most production engagements land in the $40K–$200K range for the initial build. The biggest cost driver isn't the model — it's the path (closed-source API vs fine-tuned vs self-hosted) and the regulatory load.
What's the cheapest way to start? Path 1 (closed-source API as-is) plus RAG. Validates the workload, ships in 8–12 weeks, lets you measure quality on real traffic before committing to fine-tuning or self-hosting. We use this as the default starting point unless residency or latency constraints rule it out.
When does self-hosting beat API economics? Crossover for typical workloads is around 1–3M queries/month. Below that, hosted APIs are cheaper. Above it, self-hosting wins. The exact crossover depends on prompt length, output length, and which model class you're using — GPT-4o vs GPT-4o mini changes the math by 10×.
How do I avoid the AI Dungeon scenario (cost spiral)? Three patterns. First, route by query type — simple queries to a small fast model, complex ones to a frontier model. Second, cache deterministically answerable queries so the model doesn't re-think the same question. Third, monitor the unit economics monthly — if cost per query is climbing while quality stays flat, your architecture has drift.
Should we do a PoC first? Almost always yes. A 3–4 week PoC at 5–10% of full project cost validates the business case, surfaces integration risks, and lets you scope the full project on real data. Skipping the PoC is the most common reason GenAI projects overrun.
Ready to scope a GenAI project? Run the Project Estimator for a deterministic ballpark across all four implementation paths, or book a 45-minute Discovery with our GenAI engineers — we'll help you decide which path actually fits your workload.










