Zum Inhalt springen

Multimodal (Vision)

Dieser Inhalt ist für v1.0.0. Geh zur neuesten Version, um die aktuellste Dokumentation zu bekommen.

Dieser Inhalt ist noch nicht in deiner Sprache verfügbar.

AI Foundation Services provides vision models that can analyze images alongside text. Use the same Chat Completions API with image content.

What you’ll learn:

  • How to analyze images from URLs
  • How to send local images via base64 encoding
  • Which models support vision capabilities

Terminal window
curl -X POST "$OPENAI_BASE_URL/chat/completions" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gemini-2.5-flash",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "https://images.unsplash.com/photo-1546069901-ba9599a7e63c?w=400"}}
]
}
],
"max_tokens": 1024
}'

You can also pass a local image as a base64-encoded string:

import base64
from openai import OpenAI
client = OpenAI()
def encode_image(image_path):
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
base64_image = encode_image("/path/to/your/image.jpg")
response = client.chat.completions.create(
model="Qwen3-VL-30B-A3B-Instruct-FP8",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
},
],
}
],
max_tokens=1000,
)
print(response.choices[0].message.content)

ModelProviderCapabilities
Qwen3-VL-30B-A3B-Instruct-FP8T-Cloud (Germany)Image understanding, OCR
gemini-2.5-flashGoogle CloudImage + video understanding
gpt-4.1AzureImage understanding

Check Available Models for the latest list.


  • Visual RAG — Index and retrieve from documents with text + image understanding
  • Function Calling — Connect models to external tools
  • Streaming — Stream responses for better UX