Overview
vLLM is an OpenAI-compatible provider for self-hosted inference. Bifrost delegates to the shared OpenAI provider implementation. Key characteristics:- OpenAI compatibility - Chat, text completions, embeddings, and streaming
- Self-hosted - Typically runs at
http://localhost:8000or your own server - Optional authentication - API key often omitted for local instances
- Responses API - Supported via chat completion fallback
Supported Operations
| Operation | Non-Streaming | Streaming | Endpoint |
|---|---|---|---|
| Chat Completions | ✅ | ✅ | /v1/chat/completions |
| Responses API | ✅ | ✅ | /v1/chat/completions |
| Text Completions | ✅ | ✅ | /v1/completions |
| Embeddings | ✅ | - | /v1/embeddings |
| List Models | ✅ | - | /v1/models |
| Image Generation | ❌ | ❌ | - |
| Speech (TTS) | ❌ | ❌ | - |
| Transcriptions (STT) | ✅ | ✅ | /v1/audio/transcriptions |
| Files | ❌ | ❌ | - |
| Batch | ❌ | ❌ | - |
Unsupported Operations (❌): Image Generation, Speech, Files, and Batch are not supported and return
UnsupportedOperationError.Authentication
- API key: Optional. For local vLLM instances, the key is often left empty.
- When set, the key is sent as
Authorization: Bearer <key>.
Configuration
- Base URL: Default is
http://localhost:8000. Override via providernetwork_config.base_url. - Model names: Depend on the models loaded in your vLLM instance (e.g.
meta-llama/Llama-3.2-1B-Instruct,BAAI/bge-m3for embeddings).
- Gateway
- Go SDK
Getting started
- Run a vLLM server (Docker or pip). Example with Docker:
- Verify the server:
- Use Bifrost with model prefix
vllm/<model_id>(e.g.vllm/meta-llama/Llama-3.2-1B-Instruct).
1. Chat Completions
vLLM supports standard OpenAI chat completion parameters. For full parameter reference, see OpenAI Chat Completions. Message types, tools, and streaming follow the same behavior.2. Responses API
Bifrost converts Responses API requests to Chat Completions and back:3. Text Completions
| Parameter | Mapping |
|---|---|
prompt | Sent as-is |
max_tokens | max_tokens |
temperature | temperature |
top_p | top_p |
stop | stop sequences |
4. Embeddings
vLLM supports/v1/embeddings. Use model IDs exposed by your vLLM server (e.g. BAAI/bge-m3).
5. List Models
Lists models from your vLLM instance via/v1/models. Available models depend on what is loaded on the server.
Caveats
Default base URL is localhost
Default base URL is localhost
Severity: Low
Behavior: Default base URL is
Impact: For remote or custom ports, set
Behavior: Default base URL is
http://localhost:8000.Impact: For remote or custom ports, set
network_config.base_url in the provider config.Error responses with HTTP 200
Error responses with HTTP 200
Severity: Low
Behavior: vLLM may return HTTP 200 with an error payload (e.g.
Impact: Bifrost normalizes these into standard error responses so clients see consistent error handling.
Behavior: vLLM may return HTTP 200 with an error payload (e.g.
{"error": {"code": 404, "message": "..."}}) instead of 4xx/5xx.Impact: Bifrost normalizes these into standard error responses so clients see consistent error handling.

