Zum Inhalt springen

Asynchronous Requests (Queue API)

Dieser Inhalt ist für v1.0.0. Geh zur neuesten Version, um die aktuellste Dokumentation zu bekommen.

Dieser Inhalt ist noch nicht in deiner Sprache verfügbar.

Submit long-running API requests to a queue for asynchronous processing. The Queue API mirrors the standard endpoints but processes requests in the background, making it ideal for batch workloads and large inference jobs.

What you’ll learn:

  • How to submit requests to the Queue API
  • When to use async vs. synchronous endpoints
  • How to poll for and retrieve results

Use the /queue endpoints when:

  • Long-running inference — Large models or complex prompts that may exceed synchronous timeout limits
  • Batch processing — Submitting many requests without waiting for each to complete
  • Background jobs — Fire-and-forget workloads where you process results later
  • Rate limit management — Spreading load across time instead of bursting

For interactive or real-time use cases (chatbots, streaming), use the standard /v2 endpoints instead.


Every standard endpoint has a /queue equivalent:

Standard EndpointQueue EndpointDescription
POST /v2/chat/completionsPOST /queue/chat/completionsChat completion
POST /v2/completionsPOST /queue/completionsText completion
POST /v2/embeddingsPOST /queue/embeddingsEmbeddings
POST /v2/audio/transcriptionsPOST /queue/audio/transcriptionsAudio transcription
POST /v2/audio/translationsPOST /queue/audio/translationsAudio translation
POST /v2/images/generationsPOST /queue/images/generationsImage generation
POST /v2/images/editsPOST /queue/images/editsImage editing
GET /v2/modelsGET /queue/modelsList models

The request body for each queue endpoint is identical to its standard counterpart — just change the base path from /v2 to /queue.


Submit a chat completion to the queue:

Terminal window
curl -X POST "https://llm-server.llmhub.t-systems.net/queue/chat/completions" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.3-70B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a detailed analysis of renewable energy trends in Europe."}
],
"max_tokens": 2000
}'

Process multiple prompts efficiently using the queue:

import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="https://llm-server.llmhub.t-systems.net/queue",
)
prompts = [
"Summarize the benefits of solar energy.",
"Explain how wind turbines generate electricity.",
"Describe the future of hydrogen fuel cells.",
]
async def process_prompt(prompt):
response = await client.chat.completions.create(
model="Llama-3.3-70B-Instruct",
messages=[{"role": "user", "content": prompt}],
max_tokens=500,
)
return response.choices[0].message.content
async def main():
tasks = [process_prompt(p) for p in prompts]
results = await asyncio.gather(*tasks)
for prompt, result in zip(prompts, results):
print(f"Prompt: {prompt[:50]}...")
print(f"Result: {result[:100]}...\n")
asyncio.run(main())

from openai import OpenAI
client = OpenAI(
base_url="https://llm-server.llmhub.t-systems.net/queue",
)
response = client.embeddings.create(
model="text-embedding-bge-m3",
input="The benefits of renewable energy in Europe",
)
print(f"Embedding dimension: {len(response.data[0].embedding)}")
from openai import OpenAI
client = OpenAI(
base_url="https://llm-server.llmhub.t-systems.net/queue",
)
with open("meeting_recording.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file,
)
print(transcript.text)
from openai import OpenAI
client = OpenAI(
base_url="https://llm-server.llmhub.t-systems.net/queue",
)
result = client.images.generate(
model="gpt-image-1",
prompt="A futuristic data center powered by renewable energy",
)
print(result.data[0].b64_json[:50] + "...")