Skip to main content

Overview

vLLM is an OpenAI-compatible provider for self-hosted inference. Bifrost delegates to the shared OpenAI provider implementation. Key characteristics:
  • OpenAI compatibility - Chat, text completions, embeddings, and streaming
  • Self-hosted - Typically runs at http://localhost:8000 or your own server
  • Optional authentication - API key often omitted for local instances
  • Responses API - Supported via chat completion fallback

Supported Operations

OperationNon-StreamingStreamingEndpoint
Chat Completions/v1/chat/completions
Responses API/v1/chat/completions
Text Completions/v1/completions
Embeddings-/v1/embeddings
List Models-/v1/models
Image Generation-
Speech (TTS)-
Transcriptions (STT)/v1/audio/transcriptions
Files-
Batch-
Unsupported Operations (❌): Image Generation, Speech, Files, and Batch are not supported and return UnsupportedOperationError.

Authentication

  • API key: Optional. For local vLLM instances, the key is often left empty.
  • When set, the key is sent as Authorization: Bearer <key>.

Configuration

  • Base URL: Default is http://localhost:8000. Override via provider network_config.base_url.
  • Model names: Depend on the models loaded in your vLLM instance (e.g. meta-llama/Llama-3.2-1B-Instruct, BAAI/bge-m3 for embeddings).
# Point to local or remote vLLM instance (default: http://localhost:8000)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/meta-llama/Llama-3.2-1B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# Gateway provider config: set base_url for remote vLLM
# "network_config": { "base_url": "http://vllm-endpoint:8000" }

Getting started

  1. Run a vLLM server (Docker or pip). Example with Docker:
    docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model meta-llama/Llama-3.2-1B-Instruct
    
  2. Verify the server:
    curl http://localhost:8000/v1/models
    
  3. Use Bifrost with model prefix vllm/<model_id> (e.g. vllm/meta-llama/Llama-3.2-1B-Instruct).

1. Chat Completions

vLLM supports standard OpenAI chat completion parameters. For full parameter reference, see OpenAI Chat Completions. Message types, tools, and streaming follow the same behavior.

2. Responses API

Bifrost converts Responses API requests to Chat Completions and back:
BifrostResponsesRequest
  → ToChatRequest()
  → ChatCompletion
  → ToBifrostResponsesResponse()

3. Text Completions

ParameterMapping
promptSent as-is
max_tokensmax_tokens
temperaturetemperature
top_ptop_p
stopstop sequences

4. Embeddings

vLLM supports /v1/embeddings. Use model IDs exposed by your vLLM server (e.g. BAAI/bge-m3).

5. List Models

Lists models from your vLLM instance via /v1/models. Available models depend on what is loaded on the server.

Caveats

Severity: Low
Behavior: Default base URL is http://localhost:8000.
Impact: For remote or custom ports, set network_config.base_url in the provider config.
Severity: Low
Behavior: vLLM may return HTTP 200 with an error payload (e.g. {"error": {"code": 404, "message": "..."}}) instead of 4xx/5xx.
Impact: Bifrost normalizes these into standard error responses so clients see consistent error handling.