Skip to content

Rate Limits

Rate limits protect the service and ensure fair usage. Limits are defined per plan and vary by model.

Limits vary by both plan and model. The table below shows representative figures across the three plans:

MetricEssentialProfessionalAgentic
Models available424444
Representative RPM3001,0003,000

† Example RPM for a representative T-Cloud model — exact limits vary by model. Professional and Agentic add Premium-tier models (e.g. Claude Opus) on top of the Standard catalog available in Essential.

To illustrate how limits scale, here are the limits for two representative models:

GPT-OSS 120B (T-Cloud, Germany):

PlanRPMInput TPMOutput TPM
Essential300300,000150,000
Professional600600,000300,000
Agentic1,0002,000,0001,000,000

GPT-5.2 (Azure, EU):

PlanRPMInput TPMOutput TPM
Essential30,0003,000,0001,500,000
Professional60,0006,000,0003,000,000
Agentic100,00010,000,0005,000,000

For the full per-model breakdown for your plan, see the Plans & Pricing page or download the Service Description PDF.

  • TPM (Tokens per Minute) — Maximum number of input tokens processed per minute
  • RPM (Requests per Minute) — Maximum number of API requests per minute
  • Limits are applied per API key
  • Both input and output tokens count toward TPM limits

Responses from Azure-hosted models (GPT, o-series) include headers that help you track usage against limits:

HeaderDescription
x-ratelimit-limit-requestsMaximum requests allowed per minute
x-ratelimit-limit-tokensMaximum tokens allowed per minute
x-ratelimit-remaining-requestsRequests remaining in the current window
x-ratelimit-remaining-tokensTokens remaining in the current window
x-ratelimit-reset-requestsTime until the request limit resets
x-ratelimit-reset-tokensTime until the token limit resets
import os
import httpx
# Use httpx directly to inspect response headers
response = httpx.post(
"https://llm-server.llmhub.t-systems.net/v2/chat/completions",
headers={"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"},
json={
"model": "gpt-4.1",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 10,
},
)
print(f"Requests remaining: {response.headers.get('x-ratelimit-remaining-requests')}")
print(f"Tokens remaining: {response.headers.get('x-ratelimit-remaining-tokens')}")
print(f"Resets in: {response.headers.get('x-ratelimit-reset-requests')}")

When you exceed a rate limit, the API returns a 429 Too Many Requests error. Best practices:

  1. Monitor response headers — Check x-ratelimit-remaining-* headers to avoid hitting limits
  2. Implement exponential backoff — Wait longer between retries
  3. Batch requests — Combine multiple small requests into fewer larger ones
  4. Cache responses — Avoid repeating identical requests
  5. Use the Queue API — For batch workloads, use asynchronous requests to spread load
import time
from openai import OpenAI, RateLimitError
client = OpenAI()
def safe_completion(messages, max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="Llama-3.3-70B-Instruct",
messages=messages,
)
except RateLimitError:
wait_time = 2 ** attempt
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
raise Exception("Max retries exceeded")
  • Upgrade your plan — Higher tiers have significantly higher TPM and RPM limits
  • Dedicated instances — For enterprise workloads, contact us for dedicated GPU resources with custom rate limits
  • Contact: T-Cloud Marketplace or reach out to the AIFS team at ai@t-systems.com