Multimodal (Vision)
Dieser Inhalt ist für v1.0.0. Geh zur neuesten Version, um die aktuellste Dokumentation zu bekommen.
Dieser Inhalt ist noch nicht in deiner Sprache verfügbar.
Multimodal (Vision)
Section titled “Multimodal (Vision)”AI Foundation Services provides vision models that can analyze images alongside text. Use the same Chat Completions API with image content.
What you’ll learn:
- How to analyze images from URLs
- How to send local images via base64 encoding
- Which models support vision capabilities
Analyze an Image from URL
Section titled “Analyze an Image from URL”curl -X POST "$OPENAI_BASE_URL/chat/completions" \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "gemini-2.5-flash", "messages": [ { "role": "user", "content": [ {"type": "text", "text": "What is in this image?"}, {"type": "image_url", "image_url": {"url": "https://images.unsplash.com/photo-1546069901-ba9599a7e63c?w=400"}} ] } ], "max_tokens": 1024 }'from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create( model="gemini-2.5-flash", messages=[ { "role": "user", "content": [ {"type": "text", "text": "What's in this image?"}, { "type": "image_url", "image_url": { "url": "https://images.unsplash.com/photo-1546069901-ba9599a7e63c?w=400" }, }, ], } ], max_tokens=1024,)
print(response.choices[0].message.content)import OpenAI from "openai";
const client = new OpenAI();
const response = await client.chat.completions.create({ model: "gemini-2.5-flash", messages: [ { role: "user", content: [ { type: "text", text: "What's in this image?" }, { type: "image_url", image_url: { url: "https://images.unsplash.com/photo-1546069901-ba9599a7e63c?w=400", }, }, ], }, ], max_tokens: 1024,});
console.log(response.choices[0].message.content);Analyze a Local Image (Base64)
Section titled “Analyze a Local Image (Base64)”You can also pass a local image as a base64-encoded string:
import base64from openai import OpenAI
client = OpenAI()
def encode_image(image_path): with open(image_path, "rb") as f: return base64.b64encode(f.read()).decode("utf-8")
base64_image = encode_image("/path/to/your/image.jpg")
response = client.chat.completions.create( model="Qwen3-VL-30B-A3B-Instruct-FP8", messages=[ { "role": "user", "content": [ {"type": "text", "text": "What's in this image?"}, { "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}, }, ], } ], max_tokens=1000,)
print(response.choices[0].message.content)Available Vision Models
Section titled “Available Vision Models”| Model | Provider | Capabilities |
|---|---|---|
Qwen3-VL-30B-A3B-Instruct-FP8 | T-Cloud (Germany) | Image understanding, OCR |
gemini-2.5-flash | Google Cloud | Image + video understanding |
gpt-4.1 | Azure | Image understanding |
Check Available Models for the latest list.
Next Steps
Section titled “Next Steps”- Visual RAG — Index and retrieve from documents with text + image understanding
- Function Calling — Connect models to external tools
- Streaming — Stream responses for better UX