Overview

ElevenLabs is a specialized audio provider for text-to-speech and speech-to-text operations. Bifrost performs conversions including:

Model ID mapping - Uses provider model identifier directly
Voice configuration - Maps voice settings (stability, similarity, boost, speed, style)
Response format conversion - Speech format handling (MP3, Opus, PCM/WAV)
Timestamp support - Character-level timing alignment for TTS
Transcription with alignment - Word and character-level timing, diarization, and additional formats
Pronunciation dictionaries - Support for custom pronunciation rules
Voice quality parameters - Stability, similarity boost, and speaker boost controls

Supported Operations

Operation	Non-Streaming	Streaming	Endpoint
Speech (TTS)	✅	✅	`/v1/text-to-speech/{voice_id}`
Transcriptions (STT)	✅	-	`/v1/speech-to-text`
List Models	✅	-	`/v1/models`
Chat Completions	❌	❌	-
Responses API	❌	❌	-
Text Completions	❌	❌	-
Embeddings	❌	❌	-
Image Generation	❌	❌	-

Unsupported Operations (❌): Chat Completions, Responses API, Text Completions, and Embeddings are not supported by ElevenLabs (audio-focused provider). These return UnsupportedOperationError.Note: ElevenLabs also supports a “Speech with Timestamps” endpoint at /v1/text-to-speech/{voice_id}/with-timestamps (non-streaming only) for enhanced timestamp information.

1. Speech (Text-to-Speech)

Request Parameters

Core Parameters

Parameter	Mapping	Notes
`input.input`	`text`	The text to convert to speech (required)
`model`	`model_id`	Model identifier (e.g., `"eleven_multilingual_v2"`)
`response_format`	Query param `output_format`	Speech format (see Response Format)

Voice Configuration

Voice settings are optional and controlled via params:

Parameter	ElevenLabs Mapping	Default	Range
`speed`	`voice_settings.speed`	1.0	0.5-2.0
`extra_params.stability`	`voice_settings.stability`	0.5	0-1.0
`extra_params.similarity_boost`	`voice_settings.similarity_boost`	0.75	0-1.0
`extra_params.use_speaker_boost`	`voice_settings.use_speaker_boost`	true	boolean
`extra_params.style`	`voice_settings.style`	0	0-1.0

Advanced Parameters

Use extra_params for ElevenLabs-specific TTS features:

Gateway
Go SDK

curl -X POST http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "eleven_multilingual_v2",
    "input": {"input": "Hello, how are you?"},
    "voice": "21m00Tcm4TlvDq8ikWAM",
    "response_format": "mp3",
    "stability": 0.5,
    "similarity_boost": 0.75,
    "use_speaker_boost": true,
    "style": 0,
    "speed": 1.0,
    "language_code": "en",
    "seed": 42,
    "previous_text": "Context text",
    "next_text": "Future context",
    "apply_text_normalization": "auto"
  }'

resp, err := client.SpeechRequest(ctx, &schemas.BifrostSpeechRequest{
    Provider: schemas.Elevenlabs,
    Model:    "eleven_multilingual_v2",
    Input: &schemas.SpeechInput{
        Input: "Hello, how are you?",
    },
    Params: &schemas.SpeechParameters{
        VoiceConfig: &schemas.VoiceConfig{
            Voice: schemas.Ptr("21m00Tcm4TlvDq8ikWAM"),
        },
        Speed: schemas.Ptr(1.0),
        ResponseFormat: schemas.Ptr("mp3"),
        ExtraParams: map[string]interface{}{
            "stability": 0.5,
            "similarity_boost": 0.75,
            "use_speaker_boost": true,
            "style": 0.0,
            "language_code": "en",
            "seed": 42,
            "previous_text": "Context text",
            "next_text": "Future context",
            "apply_text_normalization": "auto",
        },
    },
})

Advanced TTS Parameters

Parameter	Type	Description
`language_code`	string	Language code (e.g., “en”, “es”)
`seed`	integer	Reproducible output (0-4294967295)
`previous_text`	string	Previous text context for consistency
`next_text`	string	Next text context for consistency
`previous_request_ids`	string[]	Previous request IDs for continuity
`next_request_ids`	string[]	Next request IDs for continuity
`apply_text_normalization`	string	Text normalization mode: `"auto"`, `"on"`, `"off"`
`apply_language_text_normalization`	boolean	Apply language-specific text normalization

Response Format

Format	Output	Quality	Bitrate
`mp3`	MP3	High	128 kbps @ 44100 Hz
`opus`	Opus	High	128 kbps @ 48000 Hz
`wav` / `pcm`	PCM WAV	Lossless	16-bit @ 44100 Hz

Defaults to MP3 format if not specified. Format is passed via query parameter output_format.

Timestamps Support

To get character-level timing alignment, enable with_timestamps:

{
  "with_timestamps": true
}

When enabled, the endpoint /v1/text-to-speech/{voice_id}/with-timestamps is used and the response includes:

audio_base64 - Audio data as base64-encoded string
alignment.char_start_times_ms - Character start times in milliseconds
alignment.char_end_times_ms - Character end times in milliseconds
alignment.characters - Array of characters
normalized_alignment - Same as alignment but for normalized text

Response Conversion

Non-Timestamp Response

{
  "audio": "<binary audio data>"
}

Timestamp Response

{
  "audio_base64": "<base64 encoded audio>",
  "alignment": {
    "char_start_times_ms": [0, 150, 280, ...],
    "char_end_times_ms": [150, 280, 420, ...],
    "characters": ["H", "e", "l", "l", "o", ...]
  },
  "normalized_alignment": {
    "char_start_times_ms": [...],
    "char_end_times_ms": [...],
    "characters": [...]
  }
}

Streaming

Streaming speech returns audio in chunks as they are generated:

{
  "type": "audio.delta",
  "audio": "<binary audio chunk>"
}

Final chunk:

{
  "type": "audio.done"
}

2. Transcription (Speech-to-Text)

Request Parameters

Input Source

Choose one of the following (mutually exclusive):

Parameter	Type	Description
`input.file`	bytes	Audio file content (WAV, MP3, etc.)
`extra_params.cloud_storage_url`	string	URL to cloud-hosted audio file

Error: Providing both or neither will result in error.

Core Parameters

Parameter	Mapping	Description
`model`	`model_id`	Model identifier (required)
`params.language`	`language_code`	Language code (ISO 639-1, e.g., “en”)

Advanced Parameters

Use extra_params for transcription-specific features:

Gateway
Go SDK

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=eleven_latest" \
  -F "language_code=en" \
  -F "tag_audio_events=true" \
  -F "num_speakers=2" \
  -F "timestamps_granularity=word" \
  -F "diarize=true" \
  -F "diarization_threshold=0.5" \
  -F "temperature=0.1" \
  -F "seed=42" \
  -F "use_multi_channel=true" \
  -F "webhook=true" \
  -F "webhook_id=webhook-123"

resp, err := client.TranscriptionRequest(ctx, &schemas.BifrostTranscriptionRequest{
    Provider: schemas.Elevenlabs,
    Model:    "eleven_latest",
    Input: &schemas.TranscriptionInput{
        File: audioBytes,
    },
    Params: &schemas.TranscriptionParameters{
        Language: schemas.Ptr("en"),
        ExtraParams: map[string]interface{}{
            "tag_audio_events": true,
            "num_speakers": 2,
            "timestamps_granularity": "word",
            "diarize": true,
            "diarization_threshold": 0.5,
            "temperature": 0.1,
            "seed": 42,
            "use_multi_channel": true,
            "webhook": true,
            "webhook_id": "webhook-123",
        },
    },
})

Transcription Options

Parameter	Type	Description
`tag_audio_events`	boolean	Tag audio events (background noise, music, etc.)
`num_speakers`	integer	Expected number of speakers (for diarization)
`timestamps_granularity`	string	Timestamp level: `"none"`, `"word"`, `"character"`
`diarize`	boolean	Identify different speakers
`diarization_threshold`	float	Speaker diarization sensitivity (0.0-1.0)
`file_format`	string	Input format: `"pcm_s16le_16"`, `"other"`
`temperature`	float	Transcription temperature (0.0-1.0)
`seed`	integer	Reproducible transcription
`use_multi_channel`	boolean	Process multi-channel audio separately
`webhook`	boolean	Enable webhook for async processing
`webhook_id`	string	Webhook endpoint ID
`webhook_metadata`	object/string	Additional webhook metadata
`cloud_storage_url`	string	URL to cloud-hosted audio (alternative to file)

Additional Formats

Request multiple output formats simultaneously:

{
  "additional_formats": [
    {
      "format": "segmented_json",
      "include_speakers": true,
      "include_timestamps": true,
      "segment_on_silence_longer_than_s": 1.0,
      "max_segment_duration_s": 30.0
    },
    {
      "format": "srt",
      "max_segment_duration_s": 30.0
    }
  ]
}

Supported formats: segmented_json, docx, pdf, txt, html, srt

Response Conversion

Basic Transcription

{
  "transcript": {
    "language_code": "en",
    "language_probability": 0.95,
    "text": "Full transcribed text...",
    "words": [
      {
        "text": "Hello",
        "start": 0.0,
        "end": 0.5,
        "type": "word",
        "speaker_id": "speaker_1",
        "logprob": -0.05
      }
    ]
  }
}

With Diarization

When diarize: true, the response includes speaker identification:

{
  "transcript": {
    "text": "Hello how are you?",
    "words": [
      {
        "text": "Hello",
        "speaker_id": "speaker_1"
      },
      {
        "text": "how",
        "speaker_id": "speaker_2"
      }
    ]
  }
}

With Timestamps

Character-level timing when timestamps_granularity: "character":

{
  "words": [
    {
      "text": "Hello",
      "characters": [
        {"text": "H", "start": 0.0, "end": 0.1},
        {"text": "e", "start": 0.1, "end": 0.2}
      ]
    }
  ]
}

With Additional Formats

{
  "transcript": { ... },
  "additional_formats": [
    {
      "requested_format": "srt",
      "file_extension": "srt",
      "content_type": "text/plain",
      "is_base64_encoded": false,
      "content": "1\n00:00:00,000 --> 00:00:01,000\nHello\n\n2\n..."
    }
  ]
}

Caveats

Voice ID Required

Severity: High Behavior: Voice ID must be provided for TTS requests Impact: Request fails without voice configuration Code: elevenlabs.go:198-208

File or URL Required for Transcription

Severity: High Behavior: Either file or cloud_storage_url must be provided (not both) Impact: Request fails with ambiguous input Code: elevenlabs.go:471-478

Audio Format Conversion

Severity: Low Behavior: Response formats (MP3, Opus, WAV) mapped via format string Impact: Format parameter passed as query string to endpoint Code: elevenlabs.go:712-715, utils.go:5-35

Timestamps as Separate Endpoint

Severity: Low Behavior: Timestamp requests use /with-timestamps endpoint variant Impact: Switches endpoint based on with_timestamps flag Code: elevenlabs.go:195-205

Multipart Form Data for Transcription

Severity: Low Behavior: Transcription uses multipart/form-data, not JSON Impact: File and parameters sent as form fields Code: elevenlabs.go:480-690

3. List Models

Request Parameters

Parameter	Type	Description
(none)	-	No parameters required

Returns available models with their capabilities and language support.

Response Conversion

{
  "models": [
    {
      "model_id": "eleven_multilingual_v2",
      "name": "Eleven Multilingual v2",
      "description": "Multilingual speech synthesis",
      "serves_pro_voices": true,
      "token_cost_factor": 1.0,
      "can_do_text_to_speech": true,
      "can_do_voice_conversion": true,
      "can_use_style": true,
      "can_use_speaker_boost": true,
      "languages": [
        {"language_id": "en", "name": "English"},
        {"language_id": "es", "name": "Spanish"}
      ],
      "requires_alpha_access": false,
      "max_characters_request_free_user": 1000,
      "max_characters_request_subscribed_user": 100000,
      "maximum_text_length_per_request": 5000,
      "model_rates": {
        "character_cost_multiplier": 1.0
      }
    }
  ]
}

Overview

Quick Start

Providers & Guides

SDK Integrations

MCP Gateway

Custom plugins

Open Source Features

Enterprise Features

ElevenLabs

Overview

Supported Operations

1. Speech (Text-to-Speech)

Request Parameters

Core Parameters

Voice Configuration

Advanced Parameters

Advanced TTS Parameters

Response Format

Timestamps Support

Response Conversion

Non-Timestamp Response

Timestamp Response

Streaming

2. Transcription (Speech-to-Text)

Request Parameters

Input Source

Core Parameters

Advanced Parameters

Transcription Options

Additional Formats

Response Conversion

Basic Transcription

With Diarization

With Timestamps

With Additional Formats

Caveats

3. List Models

Request Parameters

Response Conversion

Overview

Quick Start

Providers & Guides

SDK Integrations

MCP Gateway

Custom plugins

Open Source Features

Enterprise Features

​Overview

​Supported Operations

​1. Speech (Text-to-Speech)

​Request Parameters

​Core Parameters

​Voice Configuration

​Advanced Parameters

​Advanced TTS Parameters

​Response Format

​Timestamps Support

​Response Conversion

​Non-Timestamp Response

​Timestamp Response

​Streaming

​2. Transcription (Speech-to-Text)

​Request Parameters

​Input Source

​Core Parameters

​Advanced Parameters

​Transcription Options

​Additional Formats

​Response Conversion

​Basic Transcription

​With Diarization

​With Timestamps

​With Additional Formats

​Caveats

​3. List Models

​Request Parameters

​Response Conversion

Overview

Supported Operations

1. Speech (Text-to-Speech)

Request Parameters

Core Parameters

Voice Configuration

Advanced Parameters

Advanced TTS Parameters

Response Format

Timestamps Support

Response Conversion

Non-Timestamp Response

Timestamp Response

Streaming

2. Transcription (Speech-to-Text)

Request Parameters

Input Source

Core Parameters

Advanced Parameters

Transcription Options

Additional Formats

Response Conversion

Basic Transcription

With Diarization

With Timestamps

With Additional Formats

Caveats

3. List Models

Request Parameters

Response Conversion