Rate Limits

Rate limits protect the service and ensure fair usage. Limits are defined per plan and vary by model. Every RPM and TPM value below is marked with an asterisk ^* — click it for the full conditions under which these values are provided.

Limits by Plan

Limits vary by both plan and model. The table below shows representative figures across the three plans:

Metric	Essential	Professional	Agentic
Models available	42	44	44
Representative RPM ^*	300 ^*	600 ^*	1,000 ^*

Example: Per-model limits across plans

To illustrate how limits scale, here are the limits for two representative models:

GPT-OSS 120B (T-Cloud, Germany):

Plan	RPM ^*	Input TPM ^*	Output TPM ^*
Essential	300 ^*	300,000 ^*	150,000 ^*
Professional	600 ^*	600,000 ^*	300,000 ^*
Agentic	1,000 ^*	2,000,000 ^*	1,000,000 ^*

GPT-5.2 (Azure, EU):

Plan	RPM ^*	Input TPM ^*	Output TPM ^*
Essential	30,000 ^*	3,000,000 ^*	1,500,000 ^*
Professional	60,000 ^*	6,000,000 ^*	3,000,000 ^*
Agentic	100,000 ^*	10,000,000 ^*	5,000,000 ^*

For the full per-model breakdown for your plan, see the Plans & Pricing page or download the Service Description PDF.

How Rate Limits Work

TPM (Tokens per Minute) — best-effort ceiling on input tokens processed per minute; not an SLA-backed throughput target
RPM (Requests per Minute) — best-effort ceiling on API requests per minute; not an SLA-backed request-rate target
Limits apply at the contract level — all API keys you generate share the same quota
Both input and output tokens count toward TPM limits

In other words, RPM/TPM tell you when the platform will start throttling your traffic. They do not tell you the rate at which the platform will serve your traffic. On the shared service, the realized rate depends on platform load and other tenants’ usage of the same model.

Rate Limit Response Headers

Responses from Azure-hosted models (GPT, o-series) include headers that help you track usage against limits:

Header	Description
`x-ratelimit-limit-requests`	Maximum requests allowed per minute
`x-ratelimit-limit-tokens`	Maximum tokens allowed per minute
`x-ratelimit-remaining-requests`	Requests remaining in the current window
`x-ratelimit-remaining-tokens`	Tokens remaining in the current window
`x-ratelimit-reset-requests`	Time until the request limit resets
`x-ratelimit-reset-tokens`	Time until the token limit resets

import os
import httpx

# Use httpx directly to inspect response headers
response = httpx.post(
    "https://llm-server.llmhub.t-systems.net/v2/chat/completions",
    headers={"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"},
    json={
        "model": "gpt-4.1",
        "messages": [{"role": "user", "content": "Hello"}],
        "max_tokens": 10,
    },
)

print(f"Requests remaining: {response.headers.get('x-ratelimit-remaining-requests')}")
print(f"Tokens remaining:   {response.headers.get('x-ratelimit-remaining-tokens')}")
print(f"Resets in:          {response.headers.get('x-ratelimit-reset-requests')}")

curl -i -X POST "$OPENAI_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4.1",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 10
  }'

# The -i flag shows response headers including:
# x-ratelimit-remaining-requests: 19
# x-ratelimit-remaining-tokens: 18740
# x-ratelimit-reset-requests: 3s

Handling Rate Limits

When you exceed a rate limit, the API returns a 429 Too Many Requests error. Best practices:

Monitor response headers — Check x-ratelimit-remaining-* headers to avoid hitting limits
Implement exponential backoff — Wait longer between retries
Batch requests — Combine multiple small requests into fewer larger ones
Cache responses — Avoid repeating identical requests
Use the Queue API — For batch workloads, use asynchronous requests to spread load

import time

from openai import OpenAI, RateLimitError

client = OpenAI()

def safe_completion(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="Llama-3.3-70B-Instruct",
                messages=messages,
            )
        except RateLimitError:
            wait_time = 2 ** attempt
            print(f"Rate limited. Waiting {wait_time}s...")
            time.sleep(wait_time)
    raise Exception("Max retries exceeded")

Service note on the published RPM and TPM values

^* The Requests-per-Minute (RPM) and Tokens-per-Minute (TPM) figures throughout this page are best-effort, best-case ceilings for the Shared LLM Serving Service. They denote the maximum traffic permitted under the contract — not a guaranteed level of throughput. The following conditions apply:

Not part of the Service Level Agreement (SLA). The 99.9% availability SLA covers API reachability only; it does not extend to RPM, TPM, throughput, end-to-end latency, or time-to-first-token.
For open-source models hosted on T-Cloud, realised throughput may fall substantially below the published ceilings during periods of peak concurrent demand. End-to-end latency and time-to-first-token also vary with platform load.
The “Representative RPM” figure in the Limits-by-Plan table is an illustrative value for a representative T-Cloud model; precise limits vary by model. Professional and Agentic include Premium-tier models (e.g. Claude Opus) in addition to the Standard catalogue available on Essential.
For contractual guarantees on throughput, end-to-end latency, or time-to-first-token, see Dedicated LLM Serving, where such commitments can be negotiated as performance SLAs on reserved hardware.

Need Higher Limits?

Upgrade your plan — Higher tiers have significantly higher TPM and RPM limits
Move to Dedicated LLM Serving — On Dedicated LLM Serving, there are no RPM/TPM ceilings. You reserve GPU hardware and consume as many tokens as it can produce — useful for bursty traffic or tight latency SLAs that the shared service cannot guarantee
Contact: T-Cloud Marketplace or reach out to the AIFS team at ai@t-systems.com