Overview

Ollama is a local-first, OpenAI-compatible inference engine for running large language models on personal computers or servers. Bifrost delegates to the OpenAI implementation while supporting Ollama’s unique configuration requirements. Key characteristics:

Local-first deployment - Run models locally or on private infrastructure
OpenAI API compatibility - Identical request/response format
Full feature support - Chat, text, embeddings, and streaming
Tool calling - Complete function definition and execution
Self-hosted - No external API dependency required

Supported Operations

Operation	Non-Streaming	Streaming	Endpoint
Chat Completions	✅	✅	`/v1/chat/completions`
Responses API	✅	✅	`/v1/chat/completions`
Text Completions	✅	✅	`/v1/completions`
Embeddings	✅	-	`/v1/embeddings`
List Models	✅	-	`/v1/models`
Image Generation	❌	❌	-
Speech (TTS)	❌	❌	-
Transcriptions (STT)	❌	❌	-
Files	❌	❌	-
Batch	❌	❌	-

Unsupported Operations (❌): Speech, Transcriptions, Files, and Batch are not supported by the upstream Ollama API. These return UnsupportedOperationError.Ollama is self-hosted. Ensure you have an Ollama instance running and configured with the correct BaseURL (e.g., http://localhost:11434).

1. Chat Completions

Request Parameters

Ollama supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see OpenAI Chat Completions.

Filtered Parameters

Removed for Ollama compatibility:

prompt_cache_key - Not supported
verbosity - Anthropic-specific
store - Not supported
service_tier - Not supported

Ollama supports all standard OpenAI message types, tools, responses, and streaming formats. For details on message handling, tool conversion, responses, and streaming, refer to OpenAI Chat Completions.

2. Responses API

Converted internally to Chat Completions:

ResponsesRequest → ChatRequest → ChatCompletion → ResponsesResponse

Same parameter support as Chat Completions.

3. Text Completions

Ollama supports legacy text completion format:

Parameter	Mapping
`prompt`	Direct pass-through
`max_tokens`	max_tokens
`temperature`, `top_p`	Direct pass-through
`stop`	Stop sequences

4. Embeddings

Ollama supports text embeddings:

Parameter	Notes
`input`	Text or array of texts
`model`	Embedding model name
`encoding_format`	”float” or “base64”
`dimensions`	Custom output dimensions (optional)

Response returns embedding vectors with token usage.

5. List Models

Lists models currently loaded in Ollama with capabilities and context information.

Unsupported Features

Feature	Reason
Speech/TTS	Not offered by Ollama API
Transcription/STT	Not offered by Ollama API
Batch Operations	Not offered by Ollama API
File Management	Not offered by Ollama API

Ollama follows the OpenAI API specification for request format and error handling. Authentication is optional and depends on deployment (no authentication required for local access, optional Bearer token for protected instances).Critical: BaseURL must be explicitly configured pointing to your Ollama instance (e.g., http://localhost:11434 for local, https://ollama.example.com for remote).

Configuration

Gateway
Go SDK

# Point to local Ollama instance
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/llama3.1:latest",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# Gateway needs to be configured with Ollama BaseURL

config := &schemas.ProviderConfig{
    NetworkConfig: schemas.NetworkConfig{
        BaseURL: "http://localhost:11434",  // Required!
        DefaultRequestTimeoutInSeconds: 30,
    },
}
provider, _ := ollama.NewOllamaProvider(config, logger)

response, _ := provider.ChatCompletion(ctx, key, request)

Environment Setup:

Install Ollama from https://ollama.ai

Pull a model:

ollama pull llama3.1
ollama pull mistral
ollama pull neural-chat

Start Ollama server:
```
ollama serve
```
Verify it’s running:
```
curl http://localhost:11434/api/tags
```

Performance Considerations

Streaming for Large Models: For better user experience with large models, use streaming:

{
  "model": "llama3.1:latest",
  "messages": [...],
  "stream": true
}

Token Context: Different models have different context windows:

Llama 3.1 70B: 128K tokens
Mistral 7B: 32K tokens
Neural Chat 7B: 8K tokens

GPU Acceleration: Ollama automatically uses GPU if available. For CPU-only, ensure timeout is sufficient.

Popular Models

Model	Size	Context	Speed
llama3.1:latest	Varies	128K	Fast
mistral:latest	7B	32K	Very Fast
neural-chat:latest	7B	8K	Very Fast
orca-mini:latest	3B	3K	Very Fast
openchat:latest	7B	8K	Very Fast

Caveats

BaseURL Configuration Required

Severity: High Behavior: BaseURL must be explicitly configured - no default Impact: Requests fail without proper configuration Code: NewOllamaProvider validates BaseURL is set

Cache Control Stripped

Severity: Low Behavior: Cache control directives are removed from messages Impact: Prompt caching features don’t work Code: Stripped during JSON marshaling

Parameter Filtering

Severity: Low Behavior: OpenAI-specific parameters filtered out Impact: prompt_cache_key, verbosity, store removed Code: filterOpenAISpecificParameters

User Field Size Limit

Severity: Low Behavior: User field > 64 characters silently dropped Impact: Longer user identifiers are lost Code: SanitizeUserField enforces 64-char max

Overview

Quick Start

Providers & Guides

SDK Integrations

MCP Gateway

Custom plugins

Open Source Features

Enterprise Features

Ollama

Overview

Supported Operations

1. Chat Completions

Request Parameters

Filtered Parameters

2. Responses API

3. Text Completions

4. Embeddings

5. List Models

Unsupported Features

Configuration

Performance Considerations

Popular Models

Caveats

Overview

Quick Start

Providers & Guides

SDK Integrations

MCP Gateway

Custom plugins

Open Source Features

Enterprise Features

​Overview

​Supported Operations

​1. Chat Completions

​Request Parameters

​Filtered Parameters

​2. Responses API

​3. Text Completions

​4. Embeddings

​5. List Models

​Unsupported Features

​Configuration

​Performance Considerations

​Popular Models

​Caveats

Overview

Supported Operations

1. Chat Completions

Request Parameters

Filtered Parameters

2. Responses API

3. Text Completions

4. Embeddings

5. List Models

Unsupported Features

Configuration

Performance Considerations

Popular Models

Caveats