Semantic caching uses vector similarity search to intelligently cache AI responses, serving cached results for semantically similar requests even when the exact wording differs. This dramatically reduces API costs and latency for repeated or similar queries.Key Benefits:
Cost Reduction: Avoid expensive LLM API calls for similar requests
Improved Performance: Sub-millisecond cache retrieval vs multi-second API calls
Intelligent Matching: Semantic similarity beyond exact text matching
Streaming Support: Full streaming response caching with proper chunk ordering
UI Note: The current Web UI flow configures provider-backed semantic caching. If you want direct-only mode (dimension: 1 with no provider), configure it through config.json.
Go SDK
Web UI
config.json
import ( "github.com/maximhq/bifrost/plugins/semanticcache" "github.com/maximhq/bifrost/core/schemas")// Configure semantic cache plugincacheConfig := &semanticcache.Config{ // Embedding model configuration (Required) Provider: schemas.OpenAI, EmbeddingModel: "text-embedding-3-small", Dimension: 1536, // Cache behavior TTL: 5 * time.Minute, // Time to live for cached responses (default: 5 minutes) Threshold: 0.8, // Similarity threshold for cache lookup (default: 0.8) CleanUpOnShutdown: true, // Clean up cache on shutdown (default: false) // Conversation behavior ConversationHistoryThreshold: 5, // Skip caching if conversation has > N messages (default: 3) ExcludeSystemPrompt: bifrost.Ptr(false), // Exclude system messages from cache key (default: false) // Advanced options CacheByModel: bifrost.Ptr(true), // Include model in cache key (default: true) CacheByProvider: bifrost.Ptr(true), // Include provider in cache key (default: true)}// Create pluginplugin, err := semanticcache.Init(context.Background(), cacheConfig, logger, store)if err != nil { log.Fatal("Failed to create semantic cache plugin:", err)}// Add to Bifrost configbifrostConfig := schemas.BifrostConfig{ LLMPlugins: []schemas.LLMPlugin{plugin}, // ... other config}
Prerequisites: A vector store must be configured and enabled in config.json, and at least one provider must be configured, before the toggle becomes available.
Navigate to the Config page in the Bifrost UI and find the Plugins section.
Toggle the Enable Semantic Caching switch to enable it. The configuration form expands below.
Fill in the fields across the four sections:
Provider and Model Settings (required for semantic mode):
Configured Providers: Dropdown of providers already set up in Bifrost. The selected provider’s API keys are inherited automatically.
Embedding Model: The embedding model to use (e.g. text-embedding-3-small).
Cache Settings:
TTL (seconds): How long cached responses are kept (default: 300 s).
Similarity Threshold: Cosine similarity cutoff for a cache hit (0–1, default: 0.8).
Dimension: Vector dimension matching your embedding model (e.g. 1536 for text-embedding-3-small).
Conversation Settings:
Conversation History Threshold: Skip caching when the conversation has more than this many messages (default: 3).
Exclude System Prompt (toggle): Exclude system messages from cache-key generation.
Cache Behavior:
Cache by Model (toggle): Include the model name in the cache key (default: on).
Cache by Provider (toggle): Include the provider name in the cache key (default: on).
Click Save. Changes are persisted and applied immediately for enabled plugins via the API reload path; other plugin changes (e.g. via config.json) may still require a restart.
Note: Provider API keys are inherited automatically from the global provider configuration. You do not need to (and cannot) specify keys inside the plugin config.
Direct hash mode provides exact-match caching without requiring an embedding provider. Each request is hashed deterministically based on its normalized input, parameters, and stream flag. Identical requests produce cache hits; different wording is a cache miss.Exact-match direct entries are stored and retrieved using a deterministic cache ID. This keeps repeated direct cache lookups fast and consistent across retries, streaming responses, and restarts.When to use direct hash mode:
You only need exact-match deduplication (no fuzzy/semantic matching)
You cannot or do not want to call an external embedding API
You want the lowest possible latency with zero embedding overhead
Cost-sensitive environments where embedding API calls add up
To enable direct-only mode globally, set dimension: 1 and omit the provider and embedding_model fields from the plugin config. The plugin will automatically fall back to direct search only.
Important: If you specify dimension: 1 and also provide a provider, Bifrost treats the config as provider-backed semantic mode, not direct-only mode. To use direct-only mode, omit the provider field entirely.
A vector store is still required as the storage backend, even in direct hash mode. See Recommended Vector Store below for the best choice.
Go SDK
Helm
config.json
import ( "github.com/maximhq/bifrost/plugins/semanticcache")cacheConfig := &semanticcache.Config{ // No Provider or EmbeddingModel -- direct hash mode only Dimension: 1, // Placeholder; entries are stored as metadata-only (no embedding vectors). Change dimension before switching to dual-layer mode to avoid mixed-dimension issues. TTL: 5 * time.Minute, CleanUpOnShutdown: true, CacheByModel: bifrost.Ptr(true), CacheByProvider: bifrost.Ptr(true),}plugin, err := semanticcache.Init(ctx, cacheConfig, logger, store)
When initialized this way, all requests automatically use direct hash matching regardless of the x-bf-cache-type header. No embeddings are generated, and no embedding provider credentials are needed.
Redis/Valkey-compatible stores are recommended for direct hash mode. They do not require vectors for metadata-only entries, and all cache fields are indexed as TAG fields for fast exact-match lookups.
Qdrant and Pinecone are not compatible with direct hash mode when no embedding provider is configured. These stores require a vector for every entry; the plugin’s zero-vector placeholder codepath requires an initialised embedding client, so storage will fail if no provider is set. Weaviate requires a vector per entry as well and is therefore also not recommended for direct-only mode.
When the plugin is initialized without an embedding provider (direct-only mode), all requests use direct hash matching automatically. The x-bf-cache-type header has no effect.When the plugin is initialized with an embedding provider (dual-layer mode), you can force direct-only matching on specific requests using the x-bf-cache-type: direct header. See Cache Type Control for details.
Cache Key is mandatory: Semantic caching only activates when a cache key is provided. Without a cache key, requests bypass caching entirely.
Go SDK
HTTP API
Must set cache key in request context:
// This request WILL be cachedctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")response, err := client.ChatCompletionRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), request)// This request will NOT be cached (no context value)response, err := client.ChatCompletionRequest(schemas.NewBifrostContext(context.Background(), schemas.NoDeadline), request)
Must set cache key in request header x-bf-cache-key:
# This request WILL be cachedcurl -H "x-bf-cache-key: session-123" ...# This request will NOT be cached (no header)curl ...
Disable response caching while still allowing cache reads:
Go SDK
HTTP API
// Read from cache but don't store the responsectx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")ctx = context.WithValue(ctx, semanticcache.CacheNoStoreKey, true)
# Read from cache but don't store responsecurl -H "x-bf-cache-key: session-123" \ -H "x-bf-cache-no-store: true" ...
When responses are served from semantic cache, 3 key variables are automatically added to the response:Location: response.ExtraFields.CacheDebug (as a JSON object)Fields:
CacheHit (boolean): true if the response was served from the cache, false when lookup fails.
HitType (string): "semantic" for similarity match, "direct" for hash match
CacheID (string): Unique cache entry ID for management operations (present only for cache hits)
Semantic Cache Only:
ProviderUsed (string): Provider used for the calculating semantic match embedding. (present for both cache hits and misses)
ModelUsed (string): Model used for the calculating semantic match embedding. (present for both cache hits and misses)
InputTokens (number): Number of tokens extracted from the request for the semantic match embedding calculation. (present for both cache hits and misses)
Threshold (number): Similarity threshold used for the match. (present only for cache hits)
Similarity (number): Similarity score for the match. (present only for cache hits)
Use the request ID from cached responses to clear specific entries:
Go SDK
HTTP API
// Clear specific entry by request IDerr := plugin.ClearCacheForRequestID("550e8400-e29b-41d4-a716-446655440000")// Clear all entries for a cache key err := plugin.ClearCacheForKey("support-session-456")
# Clear specific cached entry by request IDcurl -X DELETE http://localhost:8080/api/cache/clear/550e8400-e29b-41d4-a716-446655440000# Clear all entries for a cache keycurl -X DELETE http://localhost:8080/api/cache/clear-by-key/support-session-456
The semantic cache automatically handles cleanup to prevent storage bloat:Automatic Cleanup:
TTL Expiration: Entries are automatically removed when TTL expires
Shutdown Cleanup: All cache entries are cleared from the vector store namespace and the namespace itself when Bifrost client shuts down
Namespace Isolation: Each Bifrost instance uses isolated vector store namespaces to prevent conflicts
Manual Cleanup Options:
Clear specific entries by request ID (see examples above)
Clear all entries for a cache key
Restart Bifrost to clear all cache data
The semantic cache namespace and all its cache entries are deleted when Bifrost client shuts down only if cleanup_on_shutdown is set to true. By default (cleanup_on_shutdown: false), cache data persists between restarts. DO NOT use the plugin’s namespace for external purposes.
Dimension Changes: If you update the dimension config, the existing namespace will contain data with mixed dimensions, causing retrieval issues. To avoid this, either use a different vector_store_namespace or set cleanup_on_shutdown: true before restarting.
Vector Store Requirement: Semantic caching requires a configured vector store. Bifrost supports Weaviate, Redis/Valkey-compatible endpoints, Qdrant, and Pinecone. See the Vector Store documentation for setup details.