Skip to content

Shared vs Dedicated LLM Serving

LLM Serving is delivered in two service models. Shared LLM Serving Service is the standard product — multi-tenant, pay-as-you-go, available via the T-Cloud Marketplace and the API Key self-serve portal. Dedicated LLM Serving reserves GPU hardware exclusively for your contract, with throughput limited only by the hardware itself.

Both models expose the same OpenAI-compatible API and the same catalog of open-source models. The difference is how capacity is allocated and how you pay for it.

Shared LLM Serving ServiceDedicated LLM Serving
TenancyMulti-tenant — capacity is pooled across all customersSingle-tenant — hardware reserved for your contract
PricingPay-as-you-go, billed per tokenFixed monthly fee per GPU equivalent
Rate limitsPer-plan RPM and TPM ceilings — best-effort, not part of the SLANone — limited only by the hardware you reserve
ModelsThe full catalog available on the Plans pageAny model that fits on the reserved hardware (including private fine-tuned models)
Performance (latency, throughput)Best-effort, varies with platform load; not part of the SLAPredictable on the reserved hardware; performance SLAs negotiable
Availability SLA99.9% API availability on the standard product (business hours); covers reachability onlyCustom availability and performance SLAs available — negotiated per contract
OrderingSelf-service via T-Cloud Marketplace and the API Key self-serve portalContact the AIFS team for a quote
Best forSteady production traffic, exploration, prototypingTight latency SLAs, bursty workloads, high sustained throughput, private models

On the Shared LLM Serving Service, every model is published with an RPM (requests per minute) and TPM (tokens per minute) value for each plan — see Rate Limits for the full table.

These published values are best-effort, best-case ceilings — the maximum you are allowed to consume, not a value Telekom commits to deliver. They are not part of the SLA.

  • Closed-source models (GPT, Claude, Gemini) — Telekom forwards your requests to the upstream provider. The published RPM/TPM are the contractual throttling ceilings; the upstream provider enforces the actual rate limit and the achieved throughput depends on the provider’s capacity.
  • Open-source models on T-Cloud (Llama, Mistral, Qwen, GPT-OSS, etc.) — these run on shared GPU infrastructure operated by T-Systems. The published RPM/TPM are still ceilings, but effective throughput can be substantially lower at peak load, when many tenants are calling the same model concurrently. End-to-end latency and time-to-first-token can also vary with platform load.

The Shared LLM Serving Service SLA commits 99.9% API availability only. It does not cover RPM, TPM, throughput, latency, or time-to-first-token.

On Dedicated LLM Serving, there are no per-minute ceilings. You consume as many tokens as your reserved hardware can physically produce. Throughput, latency, and time-to-first-token become deterministic — they depend on your model, batch size, and prompt length, not on what other customers are doing — and can be tied to contractual performance SLAs.

Choosing Between Shared and Dedicated LLM Serving

Section titled “Choosing Between Shared and Dedicated LLM Serving”

Shared LLM Serving Service is the right fit when:

Section titled “Shared LLM Serving Service is the right fit when:”
  • Your traffic is steady and predictable, comfortably within the plan’s RPM/TPM ceilings
  • You can tolerate occasional latency variance at peak hours (latency is not covered by the shared SLA)
  • You don’t have a contractual time-to-first-token or end-to-end latency commitment to your end users
  • You’re prototyping, exploring, or in early production and want pay-as-you-go pricing
  • You want self-service ordering via the T-Cloud Marketplace or the API Key self-serve portal

The shared service is suitable for the vast majority of production workloads. Most customers never need to move off it.

Dedicated LLM Serving is the right fit when:

Section titled “Dedicated LLM Serving is the right fit when:”
  • Your workload is bursty — short windows of very high token demand that would exceed the shared service’s best-effort rate ceilings
  • You need contractual latency or throughput SLAs — time-to-first-token or end-to-end latency that the shared service cannot promise (and structurally does not include in its SLA)
  • You need sustained high throughput beyond what the Agentic plan’s best-effort ceilings offer
  • You want to host a private, fine-tuned, or custom model that isn’t in the shared catalog
  • Your compliance posture requires single-tenant infrastructure
  • You want predictable monthly cost instead of usage-based billing

Dedicated LLM Serving trades a fixed monthly fee for a hard guarantee on hardware throughput. See Dedicated LLM Serving for the full description, GPU options, and the ordering process.

Many enterprise customers run a hybrid setup:

  • A Dedicated LLM Serving instance for the hot path — the latency-sensitive, high-throughput workload (e.g. a customer-facing chatbot)
  • The Shared LLM Serving Service for everything else — internal tooling, batch jobs, evaluation runs, less-critical features

Both consume the same API and the same SDK. Routing between them is a configuration concern, not an architectural one.

For a tailored quote covering Shared, Dedicated LLM Serving, or a combination, contact the AIFS team at ai@t-systems.com.