BYOK

Run AI in YOUR cluster. On YOUR terms.

Bring-your-own-Kubernetes deployment for open-source LLMs and small language models. Full data sovereignty, predictable GPU economics, and up to 100x cost reduction over frontier APIs — governed by the same Vharta control plane you already trust.

Request Demo See Supported Models

Why self-hosted AI

Three problems frontier APIs can't solve

Enterprise AI has outgrown "send everything to a SaaS endpoint." Cost, privacy, and lock-in are now board-level concerns.

Cost Explosion

Frontier LLM bills are scaling linearly with usage. SLMs handle 80% of enterprise tasks — classification, extraction, routing — for 1-5% of the cost. Right-size the model to the task, not the other way around.

Up to 100x cheaper per token

Data Privacy

Regulated industries — finance, healthcare, government, legal — cannot send customer data to third-party SaaS LLM APIs. BYOK keeps every prompt, completion, and embedding inside your VPC.

Zero data egress to vendor APIs

Vendor Lock-in

Pricing changes, deprecations, and rate limits from a single provider put your roadmap at risk. Open models running on open infrastructure mean you control the stack end-to-end.

No API pricing dependency

BYOK Clusters

Your Kubernetes. Your GPUs. Your data.

The Vharta control plane orchestrates AI workloads onto Kubernetes clusters you already own and operate. The data plane stays inside your network perimeter — always.

Data never leaves your infrastructure

Prompts, completions, embeddings, and fine-tuning data stay within your cluster boundary. Vharta's control plane orchestrates workloads without ever touching your data plane.

Use your existing GPU fleet

Deploy on your reserved capacity — EKS, GKE, AKS, OpenShift, or bare-metal Kubernetes. Amortize committed-use discounts and on-prem GPUs instead of paying hyperscaler premiums.

Your network policies, your perimeter

NetworkPolicies, service meshes, egress firewalls, and private DNS continue to apply. No exceptions carved out for AI traffic. Air-gapped deployments supported.

Compliance-ready by construction

Data residency stays in-region. Audit logs, immutable records, and encryption-at-rest inherit from your existing Kubernetes posture. SOC 2, HIPAA, GDPR, and FedRAMP workloads supported.

Ollama Provisioning

One-click open-model deployment

Vharta provisions and manages Ollama inside your clusters. Pick a model from the registry or declare it in Git — we handle packaging, GPU scheduling, version control, and progressive rollout.

Supported models

Any model in the Ollama registry is one manifest away. Popular choices across reasoning, code, and multilingual workloads:

Llama 3.3Meta, 70B — general reasoning
Mistral LargeMistral AI, 123B — frontier-class open weight
Mixtral 8x22BSparse MoE — efficient long-context
Gemma 2Google, 9B / 27B — high quality small
Phi-4Microsoft, 14B — reasoning-focused SLM
Qwen 2.5Alibaba, 0.5B–72B — strong multilingual
DeepSeek R1Reasoning model with chain-of-thought
CodeLlamaMeta — code generation and review

Provisioning capabilities

Treat models like any other infrastructure artifact — versioned, reviewable, and continuously deployed.

One-click model deployment from the Vharta console
GitOps-style model manifests versioned in your repo
Progressive rollout with A/B testing between model versions
Automatic GPU autoscaling with HPA and KEDA integration
Per-tenant model namespaces with RBAC and quotas
Model registry with checksum validation and signed artifacts

SLMs & DSLMs

Right-size your AI

Not every task needs a 400B frontier model. Small language models (SLMs) and domain-specific language models (DSLMs) beat frontier APIs on cost, latency, and privacy for most enterprise workloads.

When SLMs win

Text classification, sentiment, and intent detection
Named entity recognition and structured extraction
Routing, tool selection, and function-calling dispatch
Safety filtering, PII detection, and redaction

When DSLMs win

Summarization of domain data (legal, clinical, financial)
Code review and generation against internal codebases
Fine-tuned Q&A over policy and knowledge bases
Structured generation with organization-specific schemas

Reserve frontier APIs for what they're actually good at: the hardest reasoning, novel tool-use, and last-mile quality on the top 5-20% of your traffic.

Tier	Size	Cost / 1M tokens	Latency (p50)	Best-fit use cases
SLM (Small)	0.5B – 3B params	$0.02 – $0.10	20 – 80 ms	Classification, routing, intent detection, PII extraction
SLM (Mid)	7B – 14B params	$0.15 – $0.50	60 – 200 ms	NER, structured extraction, short summarization, tool selection
DSLM	7B – 32B fine-tuned	$0.20 – $0.80	80 – 300 ms	Domain summarization, contract clause detection, coding assist
Open Large	70B+ params	$1 – $4	300 – 900 ms	Complex reasoning, multi-doc synthesis, agentic planning
Frontier API	Proprietary	$3 – $75	600 – 2000 ms	Hardest reasoning, novel tool-use, last-mile quality

Indicative figures. Actual cost and latency depend on hardware, batching, and quantization. Local inference also saves 50-200ms of network round-trip compared to cloud APIs.

Use Cases

Where enterprises deploy self-hosted AI today

Four concrete patterns regulated customers run on Vharta with BYOK clusters and Ollama-provisioned models.

Financial Services

KYC & transaction extraction

Extract entities, amounts, counterparties, and obligations from statements, invoices, and SWIFT messages with a fine-tuned 7B SLM. Keeps PII inside the bank's VPC and cuts per-document cost from dollars to fractions of a cent.

Healthcare

Clinical note summarization

A DSLM fine-tuned on de-identified discharge summaries produces structured handoffs under HIPAA with no PHI egress. Sub-200ms responses enable real-time assistance during clinician workflows.

Legal

Contract clause analysis

Run a fine-tuned DSLM over MSAs, NDAs, and SaaS agreements to flag non-standard indemnity, liability, and IP clauses. Attorney-client privilege preserved — nothing crosses the firm's network boundary.

Customer Support

Ticket routing & triage

A 3B classifier routes tickets to the right queue, detects sentiment, and extracts product context in under 50ms. Frontier models only invoked for the top 5% of cases that need them.

Platform Integration

Same governance, hosted or self-hosted

BYOK clusters and open models plug into the Vharta control plane as first-class providers. Every capability you rely on for SaaS LLMs applies equally to models running on your own hardware.

AI Gateway compatibility

Self-hosted models are first-class providers in the Vharta AI Gateway. Route by cost, latency, or capability with the same policies, rate limits, and failover rules you apply to SaaS providers.

Policy-as-Code enforcement

OPA policies evaluate every model invocation. Restrict which tenants can use which models, enforce PII redaction, and require approvals for production model changes.

Immutable audit trails

Every prompt, completion, tool call, and model version is logged to the same append-only audit store as your SaaS traffic. One compliance story across hosted and self-hosted inference.

Unified cost tracking

GPU-seconds, memory-hours, and token throughput roll up into the same per-tenant metering you use for SaaS LLM spend. Budget alerts and chargeback reports include self-hosted usage.

Ready to move AI into your cluster?

Talk to our team about BYOK deployment, Ollama provisioning, and the right SLM/DSLM mix for your workloads. We'll help you map a path from frontier-only spend to sovereign, right-sized AI.

Request Demo Enterprise Overview

Tier

Size

Cost / 1M tokens

Latency (p50)

Best-fit use cases

SLM (Small)

0.5B – 3B params

$0.02 – $0.10

20 – 80 ms

Classification, routing, intent detection, PII extraction

SLM (Mid)

7B – 14B params

$0.15 – $0.50

60 – 200 ms

NER, structured extraction, short summarization, tool selection

DSLM

7B – 32B fine-tuned

$0.20 – $0.80

80 – 300 ms

Domain summarization, contract clause detection, coding assist

Open Large

70B+ params

$1 – $4

300 – 900 ms

Complex reasoning, multi-doc synthesis, agentic planning

Frontier API

Proprietary

$3 – $75

600 – 2000 ms

Hardest reasoning, novel tool-use, last-mile quality