Skip to main content
Version: Latest

Chat Completions

The Chat Completions API is the primary way to interact with LLMs on AI Foundation Services. It's fully compatible with the OpenAI Chat API.

Prerequisites

What you'll learn:

  • How to send chat completion requests with system and user messages
  • How to use streaming for real-time responses
  • How to use the Completion and Responses APIs
  • Key parameters for controlling output

Basic Usage

curl -X POST "$OPENAI_BASE_URL/chat/completions" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.3-70B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is T-Cloud?"}
],
"temperature": 0.1,
"max_tokens": 256
}'

Streaming

Enable streaming to receive tokens as they're generated. Set stream: true in your request:

stream = client.chat.completions.create(
model="Llama-3.3-70B-Instruct",
messages=[{"role": "user", "content": "Write a short poem about AI."}],
stream=True,
)

for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)

For full streaming documentation including error handling, function calling with streams, and stream options, see the Streaming guide.


Completion API

The Completion API sends raw text directly to the LLM without the chat message format.

from openai import OpenAI

client = OpenAI()

completion = client.completions.create(
model="Llama-3.3-70B-Instruct",
prompt="What is the Python programming language?",
stream=False,
temperature=0.2,
max_tokens=128,
)

print(completion.choices[0].text)
info

The Completions API is only available for open-source models. To get correct behavior, you need to follow the model's specific prompt template. We recommend always using the Chat API instead.


Response API

Using an up-to-date version of the OpenAI Python package, you can use the Responses API:

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
model="gpt-4.1",
input="Write a one-sentence bedtime story about a unicorn.",
)

print(response.output_text)

You can also use both input and instructions fields:

response = client.responses.create(
model="gpt-4.1",
instructions="You are a professional copywriter. Focus on benefits rather than features.",
input="Create a product description for NoiseGuard Pro Headphones.",
temperature=0.7,
max_output_tokens=200,
)

print(response.output[0].content[0].text)
info

The Responses API is currently fully supported only for OpenAI models (e.g., gpt-4.1, gpt-4o). T-Cloud-hosted open-source models have limited support.


Parameters

ParameterTypeDescription
modelstringModel ID (e.g., Llama-3.3-70B-Instruct)
messagesarrayList of message objects with role and content
temperaturefloatSampling temperature (0-2). Lower = more deterministic
max_tokensintegerMaximum tokens to generate
streambooleanEnable streaming responses
top_pfloatNucleus sampling parameter

For the full API specification, see the API Reference.


Next Steps

© Deutsche Telekom AG