Multimodal (Vision)
AI Foundation Services provides vision models that can analyze images alongside text. Use the same Chat Completions API with image content.
Prerequisites
- An API key (get one here)
- OpenAI SDK or HTTP client installed (Quickstart)
- A vision-capable model (see Available Models)
What you'll learn:
- How to analyze images from URLs
- How to send local images via base64 encoding
- Which models support vision capabilities
Analyze an Image from URL
- curl
- Python
- Node.js
curl -X POST "$OPENAI_BASE_URL/chat/completions" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gemini-2.5-flash",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "https://images.unsplash.com/photo-1546069901-ba9599a7e63c?w=400"}}
]
}
],
"max_tokens": 1024
}'
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gemini-2.5-flash",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://images.unsplash.com/photo-1546069901-ba9599a7e63c?w=400"
},
},
],
}
],
max_tokens=1024,
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI();
const response = await client.chat.completions.create({
model: "gemini-2.5-flash",
messages: [
{
role: "user",
content: [
{ type: "text", text: "What's in this image?" },
{
type: "image_url",
image_url: {
url: "https://images.unsplash.com/photo-1546069901-ba9599a7e63c?w=400",
},
},
],
},
],
max_tokens: 1024,
});
console.log(response.choices[0].message.content);
Analyze a Local Image (Base64)
You can also pass a local image as a base64-encoded string:
import base64
from openai import OpenAI
client = OpenAI()
def encode_image(image_path):
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
base64_image = encode_image("/path/to/your/image.jpg")
response = client.chat.completions.create(
model="Qwen3-VL-30B-A3B-Instruct-FP8",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
},
],
}
],
max_tokens=1000,
)
print(response.choices[0].message.content)
Available Vision Models
| Model | Provider | Capabilities |
|---|---|---|
Qwen3-VL-30B-A3B-Instruct-FP8 | T-Cloud (Germany) | Image understanding, OCR |
gemini-2.5-flash | Google Cloud | Image + video understanding |
gpt-4.1 | Azure | Image understanding |
Check Available Models for the latest list.
Next Steps
- Visual RAG — Index and retrieve from documents with text + image understanding
- Function Calling — Connect models to external tools
- Streaming — Stream responses for better UX