Semantic Caching

Overview

Semantic caching uses vector similarity search to intelligently cache AI responses, serving cached results for semantically similar requests even when the exact wording differs. This dramatically reduces API costs and latency for repeated or similar queries. Key Benefits:

Cost Reduction: Avoid expensive LLM API calls for similar requests
Improved Performance: Sub-millisecond cache retrieval vs multi-second API calls
Intelligent Matching: Semantic similarity beyond exact text matching
Streaming Support: Full streaming response caching with proper chunk ordering

Core Features

Dual-Layer Caching: Exact hash matching + semantic similarity search (customizable threshold)
Vector-Powered Intelligence: Uses embeddings to find semantically similar requests
Dynamic Configuration: Per-request TTL and threshold overrides via headers/context
Model/Provider Isolation: Separate caching per model and provider combination

Vector Store Setup

Go SDK
config.json

import (
    "context"
    "github.com/maximhq/bifrost/framework/vectorstore"
    "github.com/maximhq/bifrost/core/schemas"
)

// Configure vector store
vectorConfig := &vectorstore.Config{
    Enabled: true,
    Type:    vectorstore.VectorStoreTypeWeaviate,
    Config: vectorstore.WeaviateConfig{
        Scheme:    "http",
        Host:      "localhost:8080",
    },
}

// Create vector store
store, err := vectorstore.NewVectorStore(context.Background(), vectorConfig, logger)
if err != nil {
    log.Fatal("Failed to create vector store:", err)
}

Semantic Cache Configuration

Go SDK
Web UI
config.json

import (
    "github.com/maximhq/bifrost/plugins/semanticcache"
    "github.com/maximhq/bifrost/core/schemas"
)

// Configure semantic cache plugin
cacheConfig := semanticcache.Config{
    // Embedding model configuration (Required)
    Provider:       schemas.OpenAI,
    Keys:          []schemas.Key{{Value: "sk-..."}},
    EmbeddingModel: "text-embedding-3-small",
    Dimension:     1536,
    
    // Cache behavior
    TTL:       5 * time.Minute,  // Time to live for cached responses (default: 5 minutes)
    Threshold: 0.8,              // Similarity threshold for cache lookup (default: 0.8)
    CleanUpOnShutdown: true,     // Clean up cache on shutdown (default: false)
    
    // Conversation behavior
    ConversationHistoryThreshold: 5,    // Skip caching if conversation has > N messages (default: 3)
    ExcludeSystemPrompt: bifrost.Ptr(false), // Exclude system messages from cache key (default: false)
    
    // Advanced options
    CacheByModel:    bifrost.Ptr(true),  // Include model in cache key (default: true)
    CacheByProvider: bifrost.Ptr(true),  // Include provider in cache key (default: true)
}

// Create plugin
plugin, err := semanticcache.Init(context.Background(), cacheConfig, logger, store)
if err != nil {
    log.Fatal("Failed to create semantic cache plugin:", err)
}

// Add to Bifrost config
bifrostConfig := schemas.BifrostConfig{
    Plugins: []schemas.Plugin{plugin},
    // ... other config
}

Cache Triggering

Cache Key is mandatory: Semantic caching only activates when a cache key is provided. Without a cache key, requests bypass caching entirely.

Go SDK
HTTP API

Must set cache key in request context:

// This request WILL be cached
ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
response, err := client.ChatCompletionRequest(ctx, request)

// This request will NOT be cached (no context value)
response, err := client.ChatCompletionRequest(context.Background(), request)

Per-Request Overrides

Override default TTL and similarity threshold per request:

Go SDK
HTTP API

You can set TTL and threshold in the request context, in the keys you configured in the plugin config:

// Go SDK: Custom TTL and threshold
ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
ctx = context.WithValue(ctx, semanticcache.CacheTTLKey, 30*time.Second)
ctx = context.WithValue(ctx, semanticcache.CacheThresholdKey, 0.9)

Advanced Cache Control

Cache Type Control

Control which caching mechanism to use per request:

Go SDK
HTTP API

// Use only direct hash matching (fastest)
ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
ctx = context.WithValue(ctx, semanticcache.CacheTypeKey, semanticcache.CacheTypeDirect)

// Use only semantic similarity search
ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")  
ctx = context.WithValue(ctx, semanticcache.CacheTypeKey, semanticcache.CacheTypeSemantic)

// Default behavior: Direct + semantic fallback (if not specified)
ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")

No-Store Control

Disable response caching while still allowing cache reads:

Go SDK
HTTP API

// Read from cache but don't store the response
ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
ctx = context.WithValue(ctx, semanticcache.CacheNoStoreKey, true)

Conversation Configuration

History Threshold Logic

The ConversationHistoryThreshold setting skips caching for conversations with many messages to prevent false positives: Why this matters:

Semantic False Positives: Long conversation histories have high probability of semantic matches with unrelated conversations due to topic overlap
Direct Cache Inefficiency: Long conversations rarely have exact hash matches, making direct caching less effective
Performance: Reduces vector store load by filtering out low-value caching scenarios

{
  "conversation_history_threshold": 3  // Skip caching if > 3 messages in conversation
}

Recommended Values:

1-2: Very conservative (may miss valuable caching opportunities)
3-5: Balanced approach (default: 3)
10+: Cache longer conversations (higher false positive risk)

System Prompt Handling

Control whether system messages are included in cache key generation:

{
  "exclude_system_prompt": false  // Include system messages in cache key (default)
}

When to exclude (true):

System prompts change frequently but content is similar
Multiple system prompt variations for same use case
Focus caching on user content similarity

When to include (false):

System prompts significantly change response behavior
Each system prompt requires distinct cached responses
Strict response consistency requirements

Cache Management

Cache Metadata Location

When responses are served from semantic cache, 3 key variables are automatically added to the response: Location: response.ExtraFields.CacheDebug (as a JSON object) Fields:

CacheHit (boolean): true if the response was served from the cache, false when lookup fails.
HitType (string): "semantic" for similarity match, "direct" for hash match
CacheID (string): Unique cache entry ID for management operations (present only for cache hits)

Semantic Cache Only:

ProviderUsed (string): Provider used for the calculating semantic match embedding. (present for both cache hits and misses)
ModelUsed (string): Model used for the calculating semantic match embedding. (present for both cache hits and misses)
InputTokens (number): Number of tokens extracted from the request for the semantic match embedding calculation. (present for both cache hits and misses)
Threshold (number): Similarity threshold used for the match. (present only for cache hits)
Similarity (number): Similarity score for the match. (present only for cache hits)

Example HTTP Response:

{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": true,
      "hit_type": "direct",
      "cache_id": "550e8500-e29b-41d4-a725-446655440001",
    }
  }
}

{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": true,
      "hit_type": "semantic",
      "cache_id": "550e8500-e29b-41d4-a725-446655440001",
      "threshold": 0.8,
      "similarity": 0.95,
      "provider_used": "openai",
      "model_used": "gpt-4o-mini",
      "input_tokens": 100
    }
  }
}

{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": false,
      "provider_used": "openai",
      "model_used": "gpt-4o-mini",
      "input_tokens": 20
    }
  }
}

These variables allow you to detect cached responses and get the cache entry ID needed for clearing specific entries.

Clear Specific Cache Entry

Use the request ID from cached responses to clear specific entries:

Go SDK
HTTP API

// Clear specific entry by request ID
err := plugin.ClearCacheForRequestID("550e8400-e29b-41d4-a716-446655440000")

// Clear all entries for a cache key  
err := plugin.ClearCacheForKey("support-session-456")

Cache Lifecycle & Cleanup

The semantic cache automatically handles cleanup to prevent storage bloat: Automatic Cleanup:

TTL Expiration: Entries are automatically removed when TTL expires
Shutdown Cleanup: All cache entries are cleared from the vector store namespace and the namespace itself when Bifrost client shuts down
Namespace Isolation: Each Bifrost instance uses isolated vector store namespaces to prevent conflicts

Manual Cleanup Options:

Clear specific entries by request ID (see examples above)
Clear all entries for a cache key
Restart Bifrost to clear all cache data

The semantic cache namespace and all its cache entries are deleted when Bifrost client shuts down only if cleanup_on_shutdown is set to true. By default (cleanup_on_shutdown: false), cache data persists between restarts. DO NOT use the plugin’s namespace for external purposes.

Dimension Changes: If you update the dimension config, the existing namespace will contain data with mixed dimensions, causing retrieval issues. To avoid this, either use a different vector_store_namespace or set cleanup_on_shutdown: true before restarting.

Vector Store Requirement: Semantic caching requires a configured vector store (currently Weaviate only). Without vector store setup, the plugin will not function.

Quick Start

Models Catalog

Provider Integrations

Open Source Features

Enterprise Features

Semantic Caching

Overview

Core Features

Vector Store Setup

Semantic Cache Configuration

Cache Triggering

Per-Request Overrides

Advanced Cache Control

Cache Type Control

No-Store Control

Conversation Configuration

History Threshold Logic

System Prompt Handling

Cache Management

Cache Metadata Location

Clear Specific Cache Entry

Cache Lifecycle & Cleanup

Quick Start

Models Catalog

Provider Integrations

Open Source Features

Enterprise Features

​Overview

​Core Features

​Vector Store Setup

​Semantic Cache Configuration

​Cache Triggering

​Per-Request Overrides

​Advanced Cache Control

​Cache Type Control

​No-Store Control

​Conversation Configuration

​History Threshold Logic

​System Prompt Handling

​Cache Management

​Cache Metadata Location

​Clear Specific Cache Entry

​Cache Lifecycle & Cleanup

Overview

Core Features

Vector Store Setup

Semantic Cache Configuration

Cache Triggering

Per-Request Overrides

Advanced Cache Control

Cache Type Control

No-Store Control

Conversation Configuration

History Threshold Logic

System Prompt Handling

Cache Management

Cache Metadata Location

Clear Specific Cache Entry

Cache Lifecycle & Cleanup