Shared vs Dedicated LLM Serving
LLM Serving is delivered in two service models. Shared LLM Serving Service is the standard product — multi-tenant, pay-as-you-go, available via the T-Cloud Marketplace and the API Key self-serve portal. Dedicated LLM Serving reserves GPU hardware exclusively for your contract, with throughput limited only by the hardware itself.
Both models expose the same OpenAI-compatible API and the same catalog of open-source models. The difference is how capacity is allocated and how you pay for it.
At a Glance
Section titled “At a Glance”| Shared LLM Serving Service | Dedicated LLM Serving | |
|---|---|---|
| Tenancy | Multi-tenant — capacity is pooled across all customers | Single-tenant — hardware reserved for your contract |
| Pricing | Pay-as-you-go, billed per token | Fixed monthly fee per GPU equivalent |
| Rate limits | Per-plan RPM and TPM ceilings — best-effort, not part of the SLA | None — limited only by the hardware you reserve |
| Models | The full catalog available on the Plans page | Any model that fits on the reserved hardware (including private fine-tuned models) |
| Performance (latency, throughput) | Best-effort, varies with platform load; not part of the SLA | Predictable on the reserved hardware; performance SLAs negotiable |
| Availability SLA | 99.9% API availability on the standard product (business hours); covers reachability only | Custom availability and performance SLAs available — negotiated per contract |
| Ordering | Self-service via T-Cloud Marketplace and the API Key self-serve portal | Contact the AIFS team for a quote |
| Best for | Steady production traffic, exploration, prototyping | Tight latency SLAs, bursty workloads, high sustained throughput, private models |
How Rate Limits Differ
Section titled “How Rate Limits Differ”On the Shared LLM Serving Service, every model is published with an RPM (requests per minute) and TPM (tokens per minute) value for each plan — see Rate Limits for the full table.
These published values are best-effort, best-case ceilings — the maximum you are allowed to consume, not a value Telekom commits to deliver. They are not part of the SLA.
- Closed-source models (GPT, Claude, Gemini) — Telekom forwards your requests to the upstream provider. The published RPM/TPM are the contractual throttling ceilings; the upstream provider enforces the actual rate limit and the achieved throughput depends on the provider’s capacity.
- Open-source models on T-Cloud (Llama, Mistral, Qwen, GPT-OSS, etc.) — these run on shared GPU infrastructure operated by T-Systems. The published RPM/TPM are still ceilings, but effective throughput can be substantially lower at peak load, when many tenants are calling the same model concurrently. End-to-end latency and time-to-first-token can also vary with platform load.
The Shared LLM Serving Service SLA commits 99.9% API availability only. It does not cover RPM, TPM, throughput, latency, or time-to-first-token.
On Dedicated LLM Serving, there are no per-minute ceilings. You consume as many tokens as your reserved hardware can physically produce. Throughput, latency, and time-to-first-token become deterministic — they depend on your model, batch size, and prompt length, not on what other customers are doing — and can be tied to contractual performance SLAs.
Choosing Between Shared and Dedicated LLM Serving
Section titled “Choosing Between Shared and Dedicated LLM Serving”Shared LLM Serving Service is the right fit when:
Section titled “Shared LLM Serving Service is the right fit when:”- Your traffic is steady and predictable, comfortably within the plan’s RPM/TPM ceilings
- You can tolerate occasional latency variance at peak hours (latency is not covered by the shared SLA)
- You don’t have a contractual time-to-first-token or end-to-end latency commitment to your end users
- You’re prototyping, exploring, or in early production and want pay-as-you-go pricing
- You want self-service ordering via the T-Cloud Marketplace or the API Key self-serve portal
The shared service is suitable for the vast majority of production workloads. Most customers never need to move off it.
Dedicated LLM Serving is the right fit when:
Section titled “Dedicated LLM Serving is the right fit when:”- Your workload is bursty — short windows of very high token demand that would exceed the shared service’s best-effort rate ceilings
- You need contractual latency or throughput SLAs — time-to-first-token or end-to-end latency that the shared service cannot promise (and structurally does not include in its SLA)
- You need sustained high throughput beyond what the Agentic plan’s best-effort ceilings offer
- You want to host a private, fine-tuned, or custom model that isn’t in the shared catalog
- Your compliance posture requires single-tenant infrastructure
- You want predictable monthly cost instead of usage-based billing
Dedicated LLM Serving trades a fixed monthly fee for a hard guarantee on hardware throughput. See Dedicated LLM Serving for the full description, GPU options, and the ordering process.
Hybrid Deployments
Section titled “Hybrid Deployments”Many enterprise customers run a hybrid setup:
- A Dedicated LLM Serving instance for the hot path — the latency-sensitive, high-throughput workload (e.g. a customer-facing chatbot)
- The Shared LLM Serving Service for everything else — internal tooling, batch jobs, evaluation runs, less-critical features
Both consume the same API and the same SDK. Routing between them is a configuration concern, not an architectural one.
For a tailored quote covering Shared, Dedicated LLM Serving, or a combination, contact the AIFS team at ai@t-systems.com.