Skip to content

Rate Limits

Rate limits protect the service and ensure fair usage. Limits are defined per plan and vary by model. Every RPM and TPM value below is marked with an asterisk * — click it for the full conditions under which these values are provided.

Limits vary by both plan and model. The table below shows representative figures across the three plans:

MetricEssentialProfessionalAgentic
Models available424444
Representative RPM *300 *600 *1,000 *

To illustrate how limits scale, here are the limits for two representative models:

GPT-OSS 120B (T-Cloud, Germany):

PlanRPM *Input TPM *Output TPM *
Essential300 *300,000 *150,000 *
Professional600 *600,000 *300,000 *
Agentic1,000 *2,000,000 *1,000,000 *

GPT-5.2 (Azure, EU):

PlanRPM *Input TPM *Output TPM *
Essential30,000 *3,000,000 *1,500,000 *
Professional60,000 *6,000,000 *3,000,000 *
Agentic100,000 *10,000,000 *5,000,000 *

For the full per-model breakdown for your plan, see the Plans & Pricing page or download the Service Description PDF.

  • TPM (Tokens per Minute) — best-effort ceiling on input tokens processed per minute; not an SLA-backed throughput target
  • RPM (Requests per Minute) — best-effort ceiling on API requests per minute; not an SLA-backed request-rate target
  • Limits apply at the contract level — all API keys you generate share the same quota
  • Both input and output tokens count toward TPM limits

In other words, RPM/TPM tell you when the platform will start throttling your traffic. They do not tell you the rate at which the platform will serve your traffic. On the shared service, the realized rate depends on platform load and other tenants’ usage of the same model.

Responses from Azure-hosted models (GPT, o-series) include headers that help you track usage against limits:

HeaderDescription
x-ratelimit-limit-requestsMaximum requests allowed per minute
x-ratelimit-limit-tokensMaximum tokens allowed per minute
x-ratelimit-remaining-requestsRequests remaining in the current window
x-ratelimit-remaining-tokensTokens remaining in the current window
x-ratelimit-reset-requestsTime until the request limit resets
x-ratelimit-reset-tokensTime until the token limit resets
import os
import httpx
# Use httpx directly to inspect response headers
response = httpx.post(
"https://llm-server.llmhub.t-systems.net/v2/chat/completions",
headers={"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"},
json={
"model": "gpt-4.1",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 10,
},
)
print(f"Requests remaining: {response.headers.get('x-ratelimit-remaining-requests')}")
print(f"Tokens remaining: {response.headers.get('x-ratelimit-remaining-tokens')}")
print(f"Resets in: {response.headers.get('x-ratelimit-reset-requests')}")

When you exceed a rate limit, the API returns a 429 Too Many Requests error. Best practices:

  1. Monitor response headers — Check x-ratelimit-remaining-* headers to avoid hitting limits
  2. Implement exponential backoff — Wait longer between retries
  3. Batch requests — Combine multiple small requests into fewer larger ones
  4. Cache responses — Avoid repeating identical requests
  5. Use the Queue API — For batch workloads, use asynchronous requests to spread load
import time
from openai import OpenAI, RateLimitError
client = OpenAI()
def safe_completion(messages, max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="Llama-3.3-70B-Instruct",
messages=messages,
)
except RateLimitError:
wait_time = 2 ** attempt
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
raise Exception("Max retries exceeded")

Service note on the published RPM and TPM values

Section titled “Service note on the published RPM and TPM values”

* The Requests-per-Minute (RPM) and Tokens-per-Minute (TPM) figures throughout this page are best-effort, best-case ceilings for the Shared LLM Serving Service. They denote the maximum traffic permitted under the contract — not a guaranteed level of throughput. The following conditions apply:

  • Not part of the Service Level Agreement (SLA). The 99.9% availability SLA covers API reachability only; it does not extend to RPM, TPM, throughput, end-to-end latency, or time-to-first-token.
  • For open-source models hosted on T-Cloud, realised throughput may fall substantially below the published ceilings during periods of peak concurrent demand. End-to-end latency and time-to-first-token also vary with platform load.
  • The “Representative RPM” figure in the Limits-by-Plan table is an illustrative value for a representative T-Cloud model; precise limits vary by model. Professional and Agentic include Premium-tier models (e.g. Claude Opus) in addition to the Standard catalogue available on Essential.
  • For contractual guarantees on throughput, end-to-end latency, or time-to-first-token, see Dedicated LLM Serving, where such commitments can be negotiated as performance SLAs on reserved hardware.
  • Upgrade your plan — Higher tiers have significantly higher TPM and RPM limits
  • Move to Dedicated LLM Serving — On Dedicated LLM Serving, there are no RPM/TPM ceilings. You reserve GPU hardware and consume as many tokens as it can produce — useful for bursty traffic or tight latency SLAs that the shared service cannot guarantee
  • Contact: T-Cloud Marketplace or reach out to the AIFS team at ai@t-systems.com