Skip to main content
Version: Latest

Audio (Speech-to-Text)

AI Foundation Services provides Whisper-based audio models for transcription and translation, compatible with the OpenAI Audio API.

Prerequisites
  • An API key (get one here)
  • OpenAI SDK or HTTP client installed (Quickstart)
  • An audio file (MP3, WAV, M4A, etc.)

What you'll learn:

  • How to transcribe audio to text in the original language
  • How to translate audio from any language to English
  • Available audio models and parameters

List Audio Models

Audio models have model_type: "STT" in their metadata. Use the models endpoint and filter:

from openai import OpenAI

client = OpenAI()

models = client.models.list()
for model in models.data:
if model.meta_data.get("model_type") == "STT":
print(model.id)

Audio Transcription

The transcription API converts audio into text in the same language as the input. It auto-detects the language from the first 30 seconds if language is not specified.

from openai import OpenAI

client = OpenAI()

with open("/path/to/audio_file.mp3", "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file,
# language="en" # Optional: specify language
)

print(f"Transcription: {transcription.text}")

Example output:

Transcription: The stale smell of old beer lingers. It takes heat to bring out the odor.
A cold dip restores health and zest. A salt pickle tastes fine with ham.

Audio Translation

The translation API translates audio from any language into English.

from openai import OpenAI

client = OpenAI()

with open("/path/to/audio_file.mp3", "rb") as audio_file:
translation = client.audio.translations.create(
model="whisper-large-v3",
file=audio_file,
temperature=1.0,
)

print(f"Translation: {translation.text}")

Parameters

ParameterTypeDescription
modelstringAudio model ID (e.g., whisper-large-v3)
filefileThe audio file to process
languagestringOptional. ISO language code. Auto-detected if omitted.
temperaturefloat0.0 for deterministic, higher for more varied output

Key Features

  1. Auto Language Detection — Identifies input language from the first 30 seconds
  2. Customizable Output — Adjust behavior with language and temperature parameters
  3. Efficient Processing — Low latency for both transcription and translation

Next Steps

© Deutsche Telekom AG