Skip to main content
Version: Latest

Rate Limits

Rate limits protect the service and ensure fair usage. Limits are defined per plan tier and vary by model.


Limits by Plan Tier

MetricBasicStandard4000
Tokens per Minute (TPM)18,750 inputUp to 150,000 input
Requests per Minute (RPM)20Up to 600

Exact limits vary by model. See the Plans & Pricing page for per-model details, or download the Service Description PDF.


How Rate Limits Work

  • TPM (Tokens per Minute) — Maximum number of input tokens processed per minute
  • RPM (Requests per Minute) — Maximum number of API requests per minute
  • Limits are applied per API key
  • Both input and output tokens count toward TPM limits

Rate Limit Response Headers

Responses from Azure-hosted models (GPT, o-series) include headers that help you track usage against limits:

note

Rate limit headers are currently returned by Azure-hosted models (e.g., gpt-4.1, o3-mini). Open-source models hosted on T-Cloud (e.g., Llama-3.3-70B-Instruct) do not return these headers.

HeaderDescription
x-ratelimit-limit-requestsMaximum requests allowed per minute
x-ratelimit-limit-tokensMaximum tokens allowed per minute
x-ratelimit-remaining-requestsRequests remaining in the current window
x-ratelimit-remaining-tokensTokens remaining in the current window
x-ratelimit-reset-requestsTime until the request limit resets
x-ratelimit-reset-tokensTime until the token limit resets

Reading Headers

import os
import httpx

# Use httpx directly to inspect response headers
response = httpx.post(
"https://llm-server.llmhub.t-systems.net/v2/chat/completions",
headers={"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"},
json={
"model": "gpt-4.1",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 10,
},
)

print(f"Requests remaining: {response.headers.get('x-ratelimit-remaining-requests')}")
print(f"Tokens remaining: {response.headers.get('x-ratelimit-remaining-tokens')}")
print(f"Resets in: {response.headers.get('x-ratelimit-reset-requests')}")

Handling Rate Limits

When you exceed a rate limit, the API returns a 429 Too Many Requests error. Best practices:

  1. Monitor response headers — Check x-ratelimit-remaining-* headers to avoid hitting limits
  2. Implement exponential backoff — Wait longer between retries
  3. Batch requests — Combine multiple small requests into fewer larger ones
  4. Cache responses — Avoid repeating identical requests
  5. Use the Queue API — For batch workloads, use asynchronous requests to spread load
import time
from openai import OpenAI, RateLimitError

client = OpenAI()

def safe_completion(messages, max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="Llama-3.3-70B-Instruct",
messages=messages,
)
except RateLimitError:
wait_time = 2 ** attempt
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
raise Exception("Max retries exceeded")

Need Higher Limits?

  • Upgrade your plan — Higher tiers have significantly higher TPM and RPM limits
  • Dedicated instances — For enterprise workloads, contact us for dedicated GPU resources with custom rate limits
  • Contact: T-Cloud Marketplace or reach out to the AIFS team
© Deutsche Telekom AG