> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getbifrost.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Ollama

> Ollama API conversion guide - local inference, OpenAI-compatible format, streaming, tool calling, and embeddings

## Overview

Ollama is a **local-first, OpenAI-compatible inference engine** for running large language models on personal computers or servers. Bifrost delegates to the OpenAI implementation while supporting Ollama's unique configuration requirements. Key characteristics:

* **Local-first deployment** - Run models locally or on private infrastructure
* **OpenAI API compatibility** - Identical request/response format
* **Full feature support** - Chat, text, embeddings, and streaming
* **Tool calling** - Complete function definition and execution
* **Self-hosted** - No external API dependency required

### Supported Operations

| Operation            | Non-Streaming | Streaming | Endpoint               |
| -------------------- | ------------- | --------- | ---------------------- |
| Chat Completions     | ✅             | ✅         | `/v1/chat/completions` |
| Responses API        | ✅             | ✅         | `/v1/chat/completions` |
| Text Completions     | ✅             | ✅         | `/v1/completions`      |
| Embeddings           | ✅             | -         | `/v1/embeddings`       |
| List Models          | ✅             | -         | `/v1/models`           |
| Image Generation     | ❌             | ❌         | -                      |
| Speech (TTS)         | ❌             | ❌         | -                      |
| Transcriptions (STT) | ❌             | ❌         | -                      |
| Files                | ❌             | ❌         | -                      |
| Batch                | ❌             | ❌         | -                      |

<Note>
  **Unsupported Operations** (❌): Speech, Transcriptions, Files, and Batch are not supported by the upstream Ollama API. These return `UnsupportedOperationError`.

  Ollama is self-hosted. Ensure you have an Ollama instance running and configured with the correct BaseURL (e.g., `http://localhost:11434`).
</Note>

***

# 1. Chat Completions

## Request Parameters

Ollama supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).

### Filtered Parameters

Removed for Ollama compatibility:

* `prompt_cache_key` - Not supported
* `verbosity` - Anthropic-specific
* `store` - Not supported
* `service_tier` - Not supported

Ollama supports all standard OpenAI message types, tools, responses, and streaming formats. For details on message handling, tool conversion, responses, and streaming, refer to [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).

***

# 2. Responses API

Converted internally to Chat Completions:

```
ResponsesRequest → ChatRequest → ChatCompletion → ResponsesResponse
```

Same parameter support as Chat Completions.

***

# 3. Text Completions

Ollama supports legacy text completion format:

| Parameter              | Mapping             |
| ---------------------- | ------------------- |
| `prompt`               | Direct pass-through |
| `max_tokens`           | max\_tokens         |
| `temperature`, `top_p` | Direct pass-through |
| `stop`                 | Stop sequences      |

***

# 4. Embeddings

Ollama supports text embeddings:

| Parameter         | Notes                               |
| ----------------- | ----------------------------------- |
| `input`           | Text or array of texts              |
| `model`           | Embedding model name                |
| `encoding_format` | "float" or "base64"                 |
| `dimensions`      | Custom output dimensions (optional) |

Response returns embedding vectors with token usage.

***

# 5. List Models

Lists models currently loaded in Ollama with capabilities and context information.

***

## Unsupported Features

| Feature           | Reason                    |
| ----------------- | ------------------------- |
| Speech/TTS        | Not offered by Ollama API |
| Transcription/STT | Not offered by Ollama API |
| Batch Operations  | Not offered by Ollama API |
| File Management   | Not offered by Ollama API |

<Note>
  Ollama follows the OpenAI API specification for request format and error handling. Authentication is optional and depends on deployment (no authentication required for local access, optional Bearer token for protected instances).

  **Critical**: BaseURL must be explicitly configured pointing to your Ollama instance (e.g., `http://localhost:11434` for local, `https://ollama.example.com` for remote).
</Note>

***

## Configuration

<Tabs>
  <Tab title="Gateway">
    ```bash theme={null}
    # Point to local Ollama instance
    curl -X POST http://localhost:8080/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "ollama/llama3.1:latest",
        "messages": [{"role": "user", "content": "Hello"}]
      }'

    # Gateway needs to be configured with Ollama BaseURL
    ```
  </Tab>

  <Tab title="Go SDK">
    ```go theme={null}
    config := &schemas.ProviderConfig{
        NetworkConfig: schemas.NetworkConfig{
            BaseURL: "http://localhost:11434",  // Required!
            DefaultRequestTimeoutInSeconds: 30,
        },
    }
    provider, _ := ollama.NewOllamaProvider(config, logger)

    response, _ := provider.ChatCompletion(ctx, key, request)
    ```
  </Tab>
</Tabs>

**Environment Setup:**

1. Install Ollama from [https://ollama.ai](https://ollama.ai)
2. Pull a model:
   ```bash theme={null}
   ollama pull llama3.1
   ollama pull mistral
   ollama pull neural-chat
   ```
3. Start Ollama server:
   ```bash theme={null}
   ollama serve
   ```
4. Verify it's running:
   ```bash theme={null}
   curl http://localhost:11434/api/tags
   ```

***

## Performance Considerations

**Streaming for Large Models:**
For better user experience with large models, use streaming:

```json theme={null}
{
  "model": "llama3.1:latest",
  "messages": [...],
  "stream": true
}
```

**Token Context:**
Different models have different context windows:

* Llama 3.1 70B: 128K tokens
* Mistral 7B: 32K tokens
* Neural Chat 7B: 8K tokens

**GPU Acceleration:**
Ollama automatically uses GPU if available. For CPU-only, ensure timeout is sufficient.

***

## Popular Models

| Model              | Size   | Context | Speed     |
| ------------------ | ------ | ------- | --------- |
| llama3.1:latest    | Varies | 128K    | Fast      |
| mistral:latest     | 7B     | 32K     | Very Fast |
| neural-chat:latest | 7B     | 8K      | Very Fast |
| orca-mini:latest   | 3B     | 3K      | Very Fast |
| openchat:latest    | 7B     | 8K      | Very Fast |

***

## Caveats

<Accordion title="BaseURL Configuration Required">
  **Severity**: High
  **Behavior**: BaseURL must be explicitly configured - no default
  **Impact**: Requests fail without proper configuration
  **Code**: NewOllamaProvider validates BaseURL is set
</Accordion>

<Accordion title="Cache Control Stripped">
  **Severity**: Low
  **Behavior**: Cache control directives are removed from messages
  **Impact**: Prompt caching features don't work
  **Code**: Stripped during JSON marshaling
</Accordion>

<Accordion title="Parameter Filtering">
  **Severity**: Low
  **Behavior**: OpenAI-specific parameters filtered out
  **Impact**: prompt\_cache\_key, verbosity, store removed
  **Code**: filterOpenAISpecificParameters
</Accordion>

<Accordion title="User Field Size Limit">
  **Severity**: Low
  **Behavior**: User field > 64 characters silently dropped
  **Impact**: Longer user identifiers are lost
  **Code**: SanitizeUserField enforces 64-char max
</Accordion>
