> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getbifrost.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Semantic Caching

> Intelligent response caching based on semantic similarity. Reduce costs and latency by serving cached responses for semantically similar requests.

## Overview

Semantic caching uses vector similarity search to intelligently cache AI responses, serving cached results for semantically similar requests even when the exact wording differs. This dramatically reduces API costs and latency for repeated or similar queries.

**Key Benefits:**

* **Cost Reduction**: Avoid expensive LLM API calls for similar requests
* **Improved Performance**: Sub-millisecond cache retrieval vs multi-second API calls
* **Intelligent Matching**: Semantic similarity beyond exact text matching
* **Streaming Support**: Full streaming response caching with proper chunk ordering

***

## Core Features

* **Dual-Layer Caching**: Exact hash matching + semantic similarity search (customizable threshold)
* **Vector-Powered Intelligence**: Uses embeddings to find semantically similar requests
* **Dynamic Configuration**: Per-request TTL and threshold overrides via headers/context
* **Model/Provider Isolation**: Separate caching per model and provider combination

***

## Vector Store Setup

Semantic caching requires a configured vector store. Bifrost supports the following vector databases:

<CardGroup cols={2}>
  <Card title="Weaviate" icon="database" href="/integrations/vector-databases/weaviate">
    Production-ready vector database with gRPC support.
  </Card>

  <Card title="Redis / Valkey" icon="database" href="/integrations/vector-databases/redis">
    High-performance in-memory vector store using RediSearch-compatible APIs.
  </Card>

  <Card title="Qdrant" icon="database" href="/integrations/vector-databases/qdrant">
    Rust-based vector search engine with advanced filtering.
  </Card>

  <Card title="Pinecone" icon="database" href="/integrations/vector-databases/pinecone">
    Managed vector database service with serverless options.
  </Card>
</CardGroup>

<Info>
  For detailed setup instructions and configuration options for each vector store, see the [Vector Store documentation](/architecture/framework/vector-store).
</Info>

**Quick Example (Weaviate):**

<Tabs group="vector-store-setup">
  <Tab title="Go SDK">
    ```go theme={null}
    import (
        "context"
        "github.com/maximhq/bifrost/framework/vectorstore"
    )

    // Configure vector store (example: Weaviate)
    vectorConfig := &vectorstore.Config{
        Enabled: true,
        Type:    vectorstore.VectorStoreTypeWeaviate,
        Config: vectorstore.WeaviateConfig{
            Scheme: "http",
            Host:   "localhost:8080",
        },
    }

    // Create vector store
    store, err := vectorstore.NewVectorStore(context.Background(), vectorConfig, logger)
    if err != nil {
        log.Fatal("Failed to create vector store:", err)
    }
    ```
  </Tab>

  <Tab title="config.json">
    ```json theme={null}
    {
      "vector_store": {
        "enabled": true,
        "type": "weaviate",
        "config": {
          "host": "localhost:8080",
          "scheme": "http"
        }
      }
    }
    ```
  </Tab>
</Tabs>

***

## Semantic Cache Configuration

> **UI Note**: The current Web UI flow configures provider-backed semantic caching. If you want direct-only mode (`dimension: 1` with no `provider`), configure it through `config.json`.

<Tabs group="cache-config">
  <Tab title="Go SDK">
    ```go theme={null}
    import (
        "github.com/maximhq/bifrost/plugins/semanticcache"
        "github.com/maximhq/bifrost/core/schemas"
    )

    // Configure semantic cache plugin
    cacheConfig := &semanticcache.Config{
        // Embedding model configuration (Required)
        Provider:       schemas.OpenAI,
        EmbeddingModel: "text-embedding-3-small",
        Dimension:     1536,
        
        // Cache behavior
        TTL:       5 * time.Minute,  // Time to live for cached responses (default: 5 minutes)
        Threshold: 0.8,              // Similarity threshold for cache lookup (default: 0.8)
        CleanUpOnShutdown: true,     // Clean up cache on shutdown (default: false)
        
        // Conversation behavior
        ConversationHistoryThreshold: 5,    // Skip caching if conversation has > N messages (default: 3)
        ExcludeSystemPrompt: bifrost.Ptr(false), // Exclude system messages from cache key (default: false)
        
        // Advanced options
        CacheByModel:    bifrost.Ptr(true),  // Include model in cache key (default: true)
        CacheByProvider: bifrost.Ptr(true),  // Include provider in cache key (default: true)
    }

    // Create plugin
    plugin, err := semanticcache.Init(context.Background(), cacheConfig, logger, store)
    if err != nil {
        log.Fatal("Failed to create semantic cache plugin:", err)
    }

    // Add to Bifrost config
    bifrostConfig := schemas.BifrostConfig{
        LLMPlugins: []schemas.LLMPlugin{plugin},
        // ... other config
    }
    ```
  </Tab>

  <Tab title="Web UI">
    <img src="https://mintcdn.com/bifrost/haPSvjWru9cl-Jd-/media/ui-semantic-cache-config.png?fit=max&auto=format&n=haPSvjWru9cl-Jd-&q=85&s=174f448c78600e2a3cbb4fc820fb0fed" alt="Semantic Cache Plugin Configuration" width="3492" height="2358" data-path="media/ui-semantic-cache-config.png" />

    **Prerequisites**: A vector store must be configured and enabled in `config.json`, and at least one provider must be configured, before the toggle becomes available.

    1. **Navigate to the Config page** in the Bifrost UI and find the **Plugins** section.

    2. **Toggle** the **Enable Semantic Caching** switch to enable it. The configuration form expands below.

    3. **Fill in the fields** across the four sections:

    **Provider and Model Settings** (required for semantic mode):

    * **Configured Providers**: Dropdown of providers already set up in Bifrost. The selected provider's API keys are inherited automatically.
    * **Embedding Model**: The embedding model to use (e.g. `text-embedding-3-small`).

    **Cache Settings**:

    * **TTL (seconds)**: How long cached responses are kept (default: 300 s).
    * **Similarity Threshold**: Cosine similarity cutoff for a cache hit (0–1, default: 0.8).
    * **Dimension**: Vector dimension matching your embedding model (e.g. 1536 for `text-embedding-3-small`).

    **Conversation Settings**:

    * **Conversation History Threshold**: Skip caching when the conversation has more than this many messages (default: 3).
    * **Exclude System Prompt** (toggle): Exclude system messages from cache-key generation.

    **Cache Behavior**:

    * **Cache by Model** (toggle): Include the model name in the cache key (default: on).
    * **Cache by Provider** (toggle): Include the provider name in the cache key (default: on).

    4. Click **Save**. Changes are persisted and applied immediately for enabled plugins via the API reload path; other plugin changes (e.g. via `config.json`) may still require a restart.
  </Tab>

  <Tab title="config.json">
    ```json theme={null}
    {
      "plugins": [
        {
          "enabled": true,
          "name": "semantic_cache",
          "config": {        
            "provider": "openai",
            "embedding_model": "text-embedding-3-small",
            "dimension": 1536,
            
            "cleanup_on_shutdown": true,
            "ttl": "5m",
            "threshold": 0.8,
            
            "conversation_history_threshold": 3,
            "exclude_system_prompt": false,
            
            "cache_by_model": true,
            "cache_by_provider": true
          }
        }
      ]
    }
    ```

    > **Note**: Provider API keys are inherited automatically from the global provider configuration. You do not need to (and cannot) specify keys inside the plugin config.

    **TTL Format Options:**

    * Duration strings: `"30s"`, `"5m"`, `"1h"`, `"24h"`
    * Numeric seconds: `300` (5 minutes), `3600` (1 hour)
  </Tab>
</Tabs>

***

## Direct Hash Mode (Embedding-Free)

Direct hash mode provides exact-match caching without requiring an embedding provider. Each request is hashed deterministically based on its normalized input, parameters, and stream flag. Identical requests produce cache hits; different wording is a cache miss.

Exact-match direct entries are stored and retrieved using a deterministic cache ID. This keeps repeated direct cache lookups fast and consistent across retries, streaming responses, and restarts.

**When to use direct hash mode:**

* You only need exact-match deduplication (no fuzzy/semantic matching)
* You cannot or do not want to call an external embedding API
* You want the lowest possible latency with zero embedding overhead
* Cost-sensitive environments where embedding API calls add up

### Setup

To enable direct-only mode globally, set `dimension: 1` and omit the `provider` and `embedding_model` fields from the plugin config. The plugin will automatically fall back to direct search only.

> **Important**: If you specify `dimension: 1` and also provide a `provider`, Bifrost treats the config as provider-backed semantic mode, not direct-only mode. To use direct-only mode, omit the `provider` field entirely.

<Warning>
  A vector store is still required as the storage backend, even in direct hash mode. See [Recommended Vector Store](#recommended-vector-store) below for the best choice.
</Warning>

<Tabs group="direct-hash-setup">
  <Tab title="Go SDK">
    ```go theme={null}
    import (
        "github.com/maximhq/bifrost/plugins/semanticcache"
    )

    cacheConfig := &semanticcache.Config{
        // No Provider or EmbeddingModel -- direct hash mode only
        Dimension: 1, // Placeholder; entries are stored as metadata-only (no embedding vectors). Change dimension before switching to dual-layer mode to avoid mixed-dimension issues.

        TTL:               5 * time.Minute,
        CleanUpOnShutdown: true,
        CacheByModel:      bifrost.Ptr(true),
        CacheByProvider:   bifrost.Ptr(true),
    }

    plugin, err := semanticcache.Init(ctx, cacheConfig, logger, store)
    ```
  </Tab>

  <Tab title="Helm">
    ```yaml theme={null}
    bifrost:
      plugins:
        semanticCache:
          enabled: true
          config:
            dimension: 1
            ttl: "5m"
            cleanup_on_shutdown: true
            cache_by_model: true
            cache_by_provider: true
    ```
  </Tab>

  <Tab title="config.json">
    ```json theme={null}
    {
      "plugins": [
        {
          "enabled": true,
          "name": "semantic_cache",
          "config": {
            "dimension": 1,
            "ttl": "5m",
            "cleanup_on_shutdown": true,
            "cache_by_model": true,
            "cache_by_provider": true
          }
        }
      ]
    }
    ```
  </Tab>
</Tabs>

When initialized this way, all requests automatically use direct hash matching regardless of the `x-bf-cache-type` header. No embeddings are generated, and no embedding provider credentials are needed.

### Recommended Vector Store

**Redis/Valkey-compatible stores** are recommended for direct hash mode. They do not require vectors for metadata-only entries, and all cache fields are indexed as TAG fields for fast exact-match lookups.

<Warning>
  Qdrant and Pinecone are not compatible with direct hash mode when no embedding provider is configured. These stores require a vector for every entry; the plugin's zero-vector placeholder codepath requires an initialised embedding client, so storage will fail if no provider is set. Weaviate requires a vector per entry as well and is therefore also not recommended for direct-only mode.
</Warning>

<Tabs group="direct-hash-redis">
  <Tab title="Helm">
    ```yaml theme={null}
    vectorStore:
      enabled: true
      type: redis
      redis:
        external:
          enabled: true
          host: "redis-or-valkey.example.com"
          port: 6379
          password: "your-redis-password"
    ```
  </Tab>

  <Tab title="config.json">
    ```json theme={null}
    {
      "vector_store": {
        "enabled": true,
        "type": "redis",
        "config": {
          "addr": "localhost:6379"
        }
      }
    }
    ```

    <Info>
      For Valkey deployments, keep `vector_store.type` as `"redis"` and point `config.addr` to your Valkey endpoint.
    </Info>
  </Tab>
</Tabs>

### Per-Request Cache Type Override

When the plugin is initialized **without** an embedding provider (direct-only mode), all requests use direct hash matching automatically. The `x-bf-cache-type` header has no effect.

When the plugin is initialized **with** an embedding provider (dual-layer mode), you can force direct-only matching on specific requests using the `x-bf-cache-type: direct` header. See [Cache Type Control](#cache-type-control) for details.

***

## Cache Triggering

<Warning>
  **Cache Key is mandatory**: Semantic caching only activates when a cache key is provided. Without a cache key, requests bypass caching entirely.
</Warning>

<Tabs group="cache-triggering">
  <Tab title="Go SDK">
    Must set cache key in request context:

    ```go theme={null}
    // This request WILL be cached
    ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
    response, err := client.ChatCompletionRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), request)

    // This request will NOT be cached (no context value)
    response, err := client.ChatCompletionRequest(schemas.NewBifrostContext(context.Background(), schemas.NoDeadline), request)
    ```
  </Tab>

  <Tab title="HTTP API">
    Must set cache key in request header `x-bf-cache-key`:

    ```bash theme={null}
    # This request WILL be cached
    curl -H "x-bf-cache-key: session-123" ...

    # This request will NOT be cached (no header)
    curl ...
    ```
  </Tab>
</Tabs>

## Per-Request Overrides

Override default TTL and similarity threshold per request:

<Tabs group="per-request-overrides">
  <Tab title="Go SDK">
    You can set TTL and threshold in the request context using the semantic cache context keys:

    ```go theme={null}
    // Go SDK: Custom TTL and threshold
    ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
    ctx = context.WithValue(ctx, semanticcache.CacheTTLKey, 30*time.Second)
    ctx = context.WithValue(ctx, semanticcache.CacheThresholdKey, 0.9)
    ```
  </Tab>

  <Tab title="HTTP API">
    You can set TTL and threshold in the request headers `x-bf-cache-ttl` and `x-bf-cache-threshold`:

    ```bash theme={null}
    # HTTP: Custom TTL and threshold
    curl -H "x-bf-cache-key: session-123" \
         -H "x-bf-cache-ttl: 30s" \
         -H "x-bf-cache-threshold: 0.9" ...
    ```
  </Tab>
</Tabs>

***

## Advanced Cache Control

### Cache Type Control

Control which caching mechanism to use per request:

<Tabs group="cache-type-control">
  <Tab title="Go SDK">
    ```go theme={null}
    // Use only direct hash matching (fastest)
    ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
    ctx = context.WithValue(ctx, semanticcache.CacheTypeKey, semanticcache.CacheTypeDirect)

    // Use only semantic similarity search
    ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")  
    ctx = context.WithValue(ctx, semanticcache.CacheTypeKey, semanticcache.CacheTypeSemantic)

    // Default behavior: Direct + semantic fallback (if not specified)
    ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
    ```
  </Tab>

  <Tab title="HTTP API">
    ```bash theme={null}
    # Direct hash matching only
    curl -H "x-bf-cache-key: session-123" \
         -H "x-bf-cache-type: direct" ...

    # Semantic similarity search only  
    curl -H "x-bf-cache-key: session-123" \
         -H "x-bf-cache-type: semantic" ...

    # Default: Both (if header not specified)
    curl -H "x-bf-cache-key: session-123" ...
    ```
  </Tab>
</Tabs>

### No-Store Control

Disable response caching while still allowing cache reads:

<Tabs group="no-store-control">
  <Tab title="Go SDK">
    ```go theme={null}
    // Read from cache but don't store the response
    ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
    ctx = context.WithValue(ctx, semanticcache.CacheNoStoreKey, true)
    ```
  </Tab>

  <Tab title="HTTP API">
    ```bash theme={null}
    # Read from cache but don't store response
    curl -H "x-bf-cache-key: session-123" \
         -H "x-bf-cache-no-store: true" ...
    ```
  </Tab>
</Tabs>

***

## Conversation Configuration

### History Threshold Logic

The `ConversationHistoryThreshold` setting skips caching for conversations with many messages to prevent false positives:

**Why this matters:**

* **Semantic False Positives**: Long conversation histories have high probability of semantic matches with unrelated conversations due to topic overlap
* **Direct Cache Inefficiency**: Long conversations rarely have exact hash matches, making direct caching less effective
* **Performance**: Reduces vector store load by filtering out low-value caching scenarios

```json theme={null}
{
  "conversation_history_threshold": 3  // Skip caching if > 3 messages in conversation
}
```

**Recommended Values:**

* **1-2**: Very conservative (may miss valuable caching opportunities)
* **3-5**: Balanced approach (default: 3)
* **10+**: Cache longer conversations (higher false positive risk)

### System Prompt Handling

Control whether system messages are included in cache key generation:

```json theme={null}
{
  "exclude_system_prompt": false  // Include system messages in cache key (default)
}
```

**When to exclude (`true`):**

* System prompts change frequently but content is similar
* Multiple system prompt variations for same use case
* Focus caching on user content similarity

**When to include (`false`):**

* System prompts significantly change response behavior
* Each system prompt requires distinct cached responses
* Strict response consistency requirements

***

## Cache Management

### Cache Metadata Location

When responses are served from semantic cache, 3 key variables are automatically added to the response:

**Location**: `response.ExtraFields.CacheDebug` (as a JSON object)

**Fields**:

* `CacheHit` (boolean): `true` if the response was served from the cache, `false` when lookup fails.
* `HitType` (string): `"semantic"` for similarity match, `"direct"` for hash match
* `CacheID` (string): Unique cache entry ID for management operations (present only for cache hits)

**Semantic Cache Only**:

* `ProviderUsed` (string): Provider used for the calculating semantic match embedding. (present for both cache hits and misses)
* `ModelUsed` (string): Model used for the calculating semantic match embedding. (present for both cache hits and misses)
* `InputTokens` (number): Number of tokens extracted from the request for the semantic match embedding calculation. (present for both cache hits and misses)
* `Threshold` (number): Similarity threshold used for the match. (present only for cache hits)
* `Similarity` (number): Similarity score for the match. (present only for cache hits)

Example HTTP Response:

```json theme={null}
{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": true,
      "hit_type": "direct",
      "cache_id": "550e8500-e29b-41d4-a725-446655440001",
    }
  }
}

{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": true,
      "hit_type": "semantic",
      "cache_id": "550e8500-e29b-41d4-a725-446655440001",
      "threshold": 0.8,
      "similarity": 0.95,
      "provider_used": "openai",
      "model_used": "gpt-4o-mini",
      "input_tokens": 100
    }
  }
}

{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": false,
      "provider_used": "openai",
      "model_used": "gpt-4o-mini",
      "input_tokens": 20
    }
  }
}
```

These variables allow you to detect cached responses and get the cache entry ID needed for clearing specific entries.

### Clear Specific Cache Entry

Use the request ID from cached responses to clear specific entries:

<Tabs group="cache-clear">
  <Tab title="Go SDK">
    ```go theme={null}
    // Clear specific entry by request ID
    err := plugin.ClearCacheForRequestID("550e8400-e29b-41d4-a716-446655440000")

    // Clear all entries for a cache key  
    err := plugin.ClearCacheForKey("support-session-456")
    ```
  </Tab>

  <Tab title="HTTP API">
    ```bash theme={null}
    # Clear specific cached entry by request ID
    curl -X DELETE http://localhost:8080/api/cache/clear/550e8400-e29b-41d4-a716-446655440000

    # Clear all entries for a cache key
    curl -X DELETE http://localhost:8080/api/cache/clear-by-key/support-session-456
    ```
  </Tab>
</Tabs>

### Cache Lifecycle & Cleanup

The semantic cache automatically handles cleanup to prevent storage bloat:

**Automatic Cleanup:**

* **TTL Expiration**: Entries are automatically removed when TTL expires
* **Shutdown Cleanup**: All cache entries are cleared from the vector store namespace and the namespace itself when Bifrost client shuts down
* **Namespace Isolation**: Each Bifrost instance uses isolated vector store namespaces to prevent conflicts

**Manual Cleanup Options:**

* Clear specific entries by request ID (see examples above)
* Clear all entries for a cache key
* Restart Bifrost to clear all cache data

<Warning>
  The semantic cache namespace and all its cache entries are deleted when Bifrost client shuts down **only if `cleanup_on_shutdown` is set to `true`**. By default (`cleanup_on_shutdown: false`), cache data persists between restarts. DO NOT use the plugin's namespace for external purposes.
</Warning>

<Warning>
  **Dimension Changes**: If you update the `dimension` config, the existing namespace will contain data with mixed dimensions, causing retrieval issues. To avoid this, either use a different `vector_store_namespace` or set `cleanup_on_shutdown: true` before restarting.
</Warning>

***

<Info>
  **Vector Store Requirement**: Semantic caching requires a configured vector store. Bifrost supports Weaviate, Redis/Valkey-compatible endpoints, Qdrant, and Pinecone. See the [Vector Store documentation](/architecture/framework/vector-store) for setup details.
</Info>
