Shared vs Dedicated LLM Serving

LLM Serving is delivered in two service models. Shared LLM Serving Service is the standard product — multi-tenant, pay-as-you-go, available via the T-Cloud Marketplace and the API Key self-serve portal. Dedicated LLM Serving reserves GPU hardware exclusively for your contract, with throughput limited only by the hardware itself.

Both models expose the same OpenAI-compatible API and the same catalog of open-source models. The difference is how capacity is allocated and how you pay for it.

At a Glance

	Shared LLM Serving Service	Dedicated LLM Serving
Tenancy	Multi-tenant — capacity is pooled across all customers	Single-tenant — hardware reserved for your contract
Pricing	Pay-as-you-go, billed per token	Fixed monthly fee per GPU equivalent
Rate limits	Per-plan RPM and TPM ceilings — best-effort, not part of the SLA	None — limited only by the hardware you reserve
Models	The full catalog available on the Plans page	Any model that fits on the reserved hardware (including private fine-tuned models)
Performance (latency, throughput)	Best-effort, varies with platform load; not part of the SLA	Predictable on the reserved hardware; performance SLAs negotiable
Availability SLA	99.9% API availability on the standard product (business hours); covers reachability only	Custom availability and performance SLAs available — negotiated per contract
Ordering	Self-service via T-Cloud Marketplace and the API Key self-serve portal	Contact the AIFS team for a quote
Best for	Steady production traffic, exploration, prototyping	Tight latency SLAs, bursty workloads, high sustained throughput, private models

Pick a Shared Plan Compare Essential, Professional, and Agentic — pricing and rate limits per model.

Dedicated LLM Serving Full hardware throughput, no usage limits, host any compatible model.

How Rate Limits Differ

On the Shared LLM Serving Service, every model is published with an RPM (requests per minute) and TPM (tokens per minute) value for each plan — see Rate Limits for the full table.

These published values are best-effort, best-case ceilings — the maximum you are allowed to consume, not a value Telekom commits to deliver. They are not part of the SLA.

Closed-source models (GPT, Claude, Gemini) — Telekom forwards your requests to the upstream provider. The published RPM/TPM are the contractual throttling ceilings; the upstream provider enforces the actual rate limit and the achieved throughput depends on the provider’s capacity.
Open-source models on T-Cloud (Llama, Mistral, Qwen, GPT-OSS, etc.) — these run on shared GPU infrastructure operated by T-Systems. The published RPM/TPM are still ceilings, but effective throughput can be substantially lower at peak load, when many tenants are calling the same model concurrently. End-to-end latency and time-to-first-token can also vary with platform load.

The Shared LLM Serving Service SLA commits 99.9% API availability only. It does not cover RPM, TPM, throughput, latency, or time-to-first-token.

On Dedicated LLM Serving, there are no per-minute ceilings. You consume as many tokens as your reserved hardware can physically produce. Throughput, latency, and time-to-first-token become deterministic — they depend on your model, batch size, and prompt length, not on what other customers are doing — and can be tied to contractual performance SLAs.

Choosing Between Shared and Dedicated LLM Serving

Shared LLM Serving Service is the right fit when:

Your traffic is steady and predictable, comfortably within the plan’s RPM/TPM ceilings
You can tolerate occasional latency variance at peak hours (latency is not covered by the shared SLA)
You don’t have a contractual time-to-first-token or end-to-end latency commitment to your end users
You’re prototyping, exploring, or in early production and want pay-as-you-go pricing
You want self-service ordering via the T-Cloud Marketplace or the API Key self-serve portal

The shared service is suitable for the vast majority of production workloads. Most customers never need to move off it.

Dedicated LLM Serving is the right fit when:

Your workload is bursty — short windows of very high token demand that would exceed the shared service’s best-effort rate ceilings
You need contractual latency or throughput SLAs — time-to-first-token or end-to-end latency that the shared service cannot promise (and structurally does not include in its SLA)
You need sustained high throughput beyond what the Agentic plan’s best-effort ceilings offer
You want to host a private, fine-tuned, or custom model that isn’t in the shared catalog
Your compliance posture requires single-tenant infrastructure
You want predictable monthly cost instead of usage-based billing

Dedicated LLM Serving trades a fixed monthly fee for a hard guarantee on hardware throughput. See Dedicated LLM Serving for the full description, GPU options, and the ordering process.

Hybrid Deployments

Many enterprise customers run a hybrid setup:

A Dedicated LLM Serving instance for the hot path — the latency-sensitive, high-throughput workload (e.g. a customer-facing chatbot)
The Shared LLM Serving Service for everything else — internal tooling, batch jobs, evaluation runs, less-critical features

Both consume the same API and the same SDK. Routing between them is a configuration concern, not an architectural one.

For a tailored quote covering Shared, Dedicated LLM Serving, or a combination, contact the AIFS team at ai@t-systems.com.