Rate Limits
Rate limits protect the service and ensure fair usage. Limits are defined per plan tier and vary by model.
Limits by Plan Tier
| Metric | Basic | Standard4000 |
|---|---|---|
| Tokens per Minute (TPM) | 18,750 input | Up to 150,000 input |
| Requests per Minute (RPM) | 20 | Up to 600 |
Exact limits vary by model. See the Plans & Pricing page for per-model details, or download the Service Description PDF.
How Rate Limits Work
- TPM (Tokens per Minute) — Maximum number of input tokens processed per minute
- RPM (Requests per Minute) — Maximum number of API requests per minute
- Limits are applied per API key
- Both input and output tokens count toward TPM limits
Rate Limit Response Headers
Responses from Azure-hosted models (GPT, o-series) include headers that help you track usage against limits:
note
Rate limit headers are currently returned by Azure-hosted models (e.g., gpt-4.1, o3-mini). Open-source models hosted on T-Cloud (e.g., Llama-3.3-70B-Instruct) do not return these headers.
| Header | Description |
|---|---|
x-ratelimit-limit-requests | Maximum requests allowed per minute |
x-ratelimit-limit-tokens | Maximum tokens allowed per minute |
x-ratelimit-remaining-requests | Requests remaining in the current window |
x-ratelimit-remaining-tokens | Tokens remaining in the current window |
x-ratelimit-reset-requests | Time until the request limit resets |
x-ratelimit-reset-tokens | Time until the token limit resets |
Reading Headers
- Python
- curl
import os
import httpx
# Use httpx directly to inspect response headers
response = httpx.post(
"https://llm-server.llmhub.t-systems.net/v2/chat/completions",
headers={"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"},
json={
"model": "gpt-4.1",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 10,
},
)
print(f"Requests remaining: {response.headers.get('x-ratelimit-remaining-requests')}")
print(f"Tokens remaining: {response.headers.get('x-ratelimit-remaining-tokens')}")
print(f"Resets in: {response.headers.get('x-ratelimit-reset-requests')}")
curl -i -X POST "$OPENAI_BASE_URL/chat/completions" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4.1",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 10
}'
# The -i flag shows response headers including:
# x-ratelimit-remaining-requests: 19
# x-ratelimit-remaining-tokens: 18740
# x-ratelimit-reset-requests: 3s
Handling Rate Limits
When you exceed a rate limit, the API returns a 429 Too Many Requests error. Best practices:
- Monitor response headers — Check
x-ratelimit-remaining-*headers to avoid hitting limits - Implement exponential backoff — Wait longer between retries
- Batch requests — Combine multiple small requests into fewer larger ones
- Cache responses — Avoid repeating identical requests
- Use the Queue API — For batch workloads, use asynchronous requests to spread load
import time
from openai import OpenAI, RateLimitError
client = OpenAI()
def safe_completion(messages, max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="Llama-3.3-70B-Instruct",
messages=messages,
)
except RateLimitError:
wait_time = 2 ** attempt
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
raise Exception("Max retries exceeded")
Need Higher Limits?
- Upgrade your plan — Higher tiers have significantly higher TPM and RPM limits
- Dedicated instances — For enterprise workloads, contact us for dedicated GPU resources with custom rate limits
- Contact: T-Cloud Marketplace or reach out to the AIFS team