BYOK
Bring-your-own-Kubernetes deployment for open-source LLMs and small language models. Full data sovereignty, predictable GPU economics, and up to 100x cost reduction over frontier APIs — governed by the same Vharta control plane you already trust.
Why self-hosted AI
Enterprise AI has outgrown "send everything to a SaaS endpoint." Cost, privacy, and lock-in are now board-level concerns.
Frontier LLM bills are scaling linearly with usage. SLMs handle 80% of enterprise tasks — classification, extraction, routing — for 1-5% of the cost. Right-size the model to the task, not the other way around.
Regulated industries — finance, healthcare, government, legal — cannot send customer data to third-party SaaS LLM APIs. BYOK keeps every prompt, completion, and embedding inside your VPC.
Pricing changes, deprecations, and rate limits from a single provider put your roadmap at risk. Open models running on open infrastructure mean you control the stack end-to-end.
BYOK Clusters
The Vharta control plane orchestrates AI workloads onto Kubernetes clusters you already own and operate. The data plane stays inside your network perimeter — always.
Prompts, completions, embeddings, and fine-tuning data stay within your cluster boundary. Vharta's control plane orchestrates workloads without ever touching your data plane.
Deploy on your reserved capacity — EKS, GKE, AKS, OpenShift, or bare-metal Kubernetes. Amortize committed-use discounts and on-prem GPUs instead of paying hyperscaler premiums.
NetworkPolicies, service meshes, egress firewalls, and private DNS continue to apply. No exceptions carved out for AI traffic. Air-gapped deployments supported.
Data residency stays in-region. Audit logs, immutable records, and encryption-at-rest inherit from your existing Kubernetes posture. SOC 2, HIPAA, GDPR, and FedRAMP workloads supported.
Ollama Provisioning
Vharta provisions and manages Ollama inside your clusters. Pick a model from the registry or declare it in Git — we handle packaging, GPU scheduling, version control, and progressive rollout.
Any model in the Ollama registry is one manifest away. Popular choices across reasoning, code, and multilingual workloads:
Treat models like any other infrastructure artifact — versioned, reviewable, and continuously deployed.
SLMs & DSLMs
Not every task needs a 400B frontier model. Small language models (SLMs) and domain-specific language models (DSLMs) beat frontier APIs on cost, latency, and privacy for most enterprise workloads.
Reserve frontier APIs for what they're actually good at: the hardest reasoning, novel tool-use, and last-mile quality on the top 5-20% of your traffic.
| Tier | Size | Cost / 1M tokens | Latency (p50) | Best-fit use cases |
|---|---|---|---|---|
| SLM (Small) | 0.5B – 3B params | $0.02 – $0.10 | 20 – 80 ms | Classification, routing, intent detection, PII extraction |
| SLM (Mid) | 7B – 14B params | $0.15 – $0.50 | 60 – 200 ms | NER, structured extraction, short summarization, tool selection |
| DSLM | 7B – 32B fine-tuned | $0.20 – $0.80 | 80 – 300 ms | Domain summarization, contract clause detection, coding assist |
| Open Large | 70B+ params | $1 – $4 | 300 – 900 ms | Complex reasoning, multi-doc synthesis, agentic planning |
| Frontier API | Proprietary | $3 – $75 | 600 – 2000 ms | Hardest reasoning, novel tool-use, last-mile quality |
Indicative figures. Actual cost and latency depend on hardware, batching, and quantization. Local inference also saves 50-200ms of network round-trip compared to cloud APIs.
Use Cases
Four concrete patterns regulated customers run on Vharta with BYOK clusters and Ollama-provisioned models.
Extract entities, amounts, counterparties, and obligations from statements, invoices, and SWIFT messages with a fine-tuned 7B SLM. Keeps PII inside the bank's VPC and cuts per-document cost from dollars to fractions of a cent.
A DSLM fine-tuned on de-identified discharge summaries produces structured handoffs under HIPAA with no PHI egress. Sub-200ms responses enable real-time assistance during clinician workflows.
Run a fine-tuned DSLM over MSAs, NDAs, and SaaS agreements to flag non-standard indemnity, liability, and IP clauses. Attorney-client privilege preserved — nothing crosses the firm's network boundary.
A 3B classifier routes tickets to the right queue, detects sentiment, and extracts product context in under 50ms. Frontier models only invoked for the top 5% of cases that need them.
Platform Integration
BYOK clusters and open models plug into the Vharta control plane as first-class providers. Every capability you rely on for SaaS LLMs applies equally to models running on your own hardware.
Self-hosted models are first-class providers in the Vharta AI Gateway. Route by cost, latency, or capability with the same policies, rate limits, and failover rules you apply to SaaS providers.
OPA policies evaluate every model invocation. Restrict which tenants can use which models, enforce PII redaction, and require approvals for production model changes.
Every prompt, completion, tool call, and model version is logged to the same append-only audit store as your SaaS traffic. One compliance story across hosted and self-hosted inference.
GPU-seconds, memory-hours, and token throughput roll up into the same per-tenant metering you use for SaaS LLM spend. Budget alerts and chargeback reports include self-hosted usage.
Talk to our team about BYOK deployment, Ollama provisioning, and the right SLM/DSLM mix for your workloads. We'll help you map a path from frontier-only spend to sovereign, right-sized AI.