Vision: Analyzing Images with AI
Send images to vision-capable models for analysis, description, and understanding. This example shows how to analyze an image from a URL using GPT-4o with high detail processing for better accuracy.Audio Understanding: Analyzing Audio with AI
If your chat application supports text input, you can add audio input and output—just include audio in the modalities array and use an audio model, like gpt-4o-audio-preview.Audio Input to Model
Text-to-Speech: Converting Text to Audio
Convert text into natural-sounding speech using AI voice models. This example demonstrates generating an MP3 audio file from text using the “alloy” voice. The result is saved to a local file for playback.Speech-to-Text: Transcribing Audio Files
Convert audio files into text using AI transcription models. This example shows how to transcribe an MP3 file using OpenAI’s Whisper model, with an optional context prompt to improve accuracy.Advanced Vision Examples
Multiple Images
Send multiple images in a single request for comparison or analysis. This is useful for comparing products, analyzing changes over time, or understanding relationships between different visual elements.Base64 Images
Process local images by encoding them as base64 data URLs. This approach is ideal when you need to analyze images stored locally on your system without uploading them to external URLs first.Audio Configuration Options
Voice Selection for Speech Synthesis
OpenAI provides six distinct voice options, each with different characteristics. This example generates sample audio files for each voice so you can compare and choose the one that best fits your application.Audio Formats
Generate audio in different formats depending on your use case. MP3 for general use, Opus for web streaming, AAC for mobile apps, and FLAC for high-quality audio applications.Transcription Options
Language Specification
Improve transcription accuracy by specifying the source language. This is particularly helpful for non-English audio or when the audio contains technical terms or specific domain vocabulary.Response Formats
Choose between simple text output or detailed JSON responses with timestamps. The verbose JSON format provides word-level and segment-level timing information, useful for creating subtitles or analyzing speech patterns.Provider Support
Different providers support different multimodal capabilities:| Provider | Vision | Text-to-Speech | Speech-to-Text |
|---|---|---|---|
| OpenAI | ✅ GPT-4V, GPT-4o | ✅ TTS-1, TTS-1-HD | ✅ Whisper |
| Anthropic | ✅ Claude 3 Sonnet/Opus | ❌ | ❌ |
| Google Vertex | ✅ Gemini Pro Vision | ✅ | ✅ |
| Azure OpenAI | ✅ GPT-4V | ✅ | ✅ Whisper |
Next Steps
- Streaming Responses - Real-time multimodal processing
- Tool Calling - Combine with external tools
- Provider Configuration - Multiple providers for different capabilities
- Core Features - Advanced Bifrost capabilities

