Skip to main content

Overview

Ollama is a local-first, OpenAI-compatible inference engine for running large language models on personal computers or servers. Bifrost delegates to the OpenAI implementation while supporting Ollama’s unique configuration requirements. Key characteristics:
  • Local-first deployment - Run models locally or on private infrastructure
  • OpenAI API compatibility - Identical request/response format
  • Full feature support - Chat, text, embeddings, and streaming
  • Tool calling - Complete function definition and execution
  • Self-hosted - No external API dependency required

Supported Operations

OperationNon-StreamingStreamingEndpoint
Chat Completions/v1/chat/completions
Responses API/v1/chat/completions
Text Completions/v1/completions
Embeddings-/v1/embeddings
List Models-/v1/models
Image Generation-
Speech (TTS)-
Transcriptions (STT)-
Files-
Batch-
Unsupported Operations (❌): Speech, Transcriptions, Files, and Batch are not supported by the upstream Ollama API. These return UnsupportedOperationError.Ollama is self-hosted. Ensure you have an Ollama instance running and configured with the correct BaseURL (e.g., http://localhost:11434).

1. Chat Completions

Request Parameters

Ollama supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see OpenAI Chat Completions.

Filtered Parameters

Removed for Ollama compatibility:
  • prompt_cache_key - Not supported
  • verbosity - Anthropic-specific
  • store - Not supported
  • service_tier - Not supported
Ollama supports all standard OpenAI message types, tools, responses, and streaming formats. For details on message handling, tool conversion, responses, and streaming, refer to OpenAI Chat Completions.

2. Responses API

Converted internally to Chat Completions:
ResponsesRequest → ChatRequest → ChatCompletion → ResponsesResponse
Same parameter support as Chat Completions.

3. Text Completions

Ollama supports legacy text completion format:
ParameterMapping
promptDirect pass-through
max_tokensmax_tokens
temperature, top_pDirect pass-through
stopStop sequences

4. Embeddings

Ollama supports text embeddings:
ParameterNotes
inputText or array of texts
modelEmbedding model name
encoding_format”float” or “base64”
dimensionsCustom output dimensions (optional)
Response returns embedding vectors with token usage.

5. List Models

Lists models currently loaded in Ollama with capabilities and context information.

Unsupported Features

FeatureReason
Speech/TTSNot offered by Ollama API
Transcription/STTNot offered by Ollama API
Batch OperationsNot offered by Ollama API
File ManagementNot offered by Ollama API
Ollama follows the OpenAI API specification for request format and error handling. Authentication is optional and depends on deployment (no authentication required for local access, optional Bearer token for protected instances).Critical: BaseURL must be explicitly configured pointing to your Ollama instance (e.g., http://localhost:11434 for local, https://ollama.example.com for remote).

Configuration

# Point to local Ollama instance
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/llama3.1:latest",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# Gateway needs to be configured with Ollama BaseURL
Environment Setup:
  1. Install Ollama from https://ollama.ai
  2. Pull a model:
    ollama pull llama3.1
    ollama pull mistral
    ollama pull neural-chat
    
  3. Start Ollama server:
    ollama serve
    
  4. Verify it’s running:
    curl http://localhost:11434/api/tags
    

Performance Considerations

Streaming for Large Models: For better user experience with large models, use streaming:
{
  "model": "llama3.1:latest",
  "messages": [...],
  "stream": true
}
Token Context: Different models have different context windows:
  • Llama 3.1 70B: 128K tokens
  • Mistral 7B: 32K tokens
  • Neural Chat 7B: 8K tokens
GPU Acceleration: Ollama automatically uses GPU if available. For CPU-only, ensure timeout is sufficient.
ModelSizeContextSpeed
llama3.1:latestVaries128KFast
mistral:latest7B32KVery Fast
neural-chat:latest7B8KVery Fast
orca-mini:latest3B3KVery Fast
openchat:latest7B8KVery Fast

Caveats

Severity: High Behavior: BaseURL must be explicitly configured - no default Impact: Requests fail without proper configuration Code: NewOllamaProvider validates BaseURL is set
Severity: Low Behavior: Cache control directives are removed from messages Impact: Prompt caching features don’t work Code: Stripped during JSON marshaling
Severity: Low Behavior: OpenAI-specific parameters filtered out Impact: prompt_cache_key, verbosity, store removed Code: filterOpenAISpecificParameters
Severity: Low Behavior: User field > 64 characters silently dropped Impact: Longer user identifiers are lost Code: SanitizeUserField enforces 64-char max