Overview
Semantic caching uses vector similarity search to intelligently cache AI responses, serving cached results for semantically similar requests even when the exact wording differs. This dramatically reduces API costs and latency for repeated or similar queries. Key Benefits:- Cost Reduction: Avoid expensive LLM API calls for similar requests
- Improved Performance: Sub-millisecond cache retrieval vs multi-second API calls
- Intelligent Matching: Semantic similarity beyond exact text matching
- Streaming Support: Full streaming response caching with proper chunk ordering
Core Features
- Dual-Layer Caching: Exact hash matching + semantic similarity search (customizable threshold)
- Vector-Powered Intelligence: Uses embeddings to find semantically similar requests
- Dynamic Configuration: Per-request TTL and threshold overrides via headers/context
- Model/Provider Isolation: Separate caching per model and provider combination
Vector Store Setup
- Go SDK
- config.json
Semantic Cache Configuration
- Go SDK
- Web UI
- config.json
Cache Triggering
Cache Key is mandatory: Semantic caching only activates when a cache key is provided. Without a cache key, requests bypass caching entirely.
- Go SDK
- HTTP API
Must set cache key in request context:
Per-Request Overrides
Override default TTL and similarity threshold per request:- Go SDK
- HTTP API
You can set TTL and threshold in the request context, in the keys you configured in the plugin config:
Advanced Cache Control
Cache Type Control
Control which caching mechanism to use per request:- Go SDK
- HTTP API
No-Store Control
Disable response caching while still allowing cache reads:- Go SDK
- HTTP API
Conversation Configuration
History Threshold Logic
TheConversationHistoryThreshold setting skips caching for conversations with many messages to prevent false positives:
Why this matters:
- Semantic False Positives: Long conversation histories have high probability of semantic matches with unrelated conversations due to topic overlap
- Direct Cache Inefficiency: Long conversations rarely have exact hash matches, making direct caching less effective
- Performance: Reduces vector store load by filtering out low-value caching scenarios
- 1-2: Very conservative (may miss valuable caching opportunities)
- 3-5: Balanced approach (default: 3)
- 10+: Cache longer conversations (higher false positive risk)
System Prompt Handling
Control whether system messages are included in cache key generation:true):
- System prompts change frequently but content is similar
- Multiple system prompt variations for same use case
- Focus caching on user content similarity
false):
- System prompts significantly change response behavior
- Each system prompt requires distinct cached responses
- Strict response consistency requirements
Cache Management
Cache Metadata Location
When responses are served from semantic cache, 3 key variables are automatically added to the response: Location:response.ExtraFields.CacheDebug (as a JSON object)
Fields:
CacheHit(boolean):trueif the response was served from the cache,falsewhen lookup fails.HitType(string):"semantic"for similarity match,"direct"for hash matchCacheID(string): Unique cache entry ID for management operations (present only for cache hits)
ProviderUsed(string): Provider used for the calculating semantic match embedding. (present for both cache hits and misses)ModelUsed(string): Model used for the calculating semantic match embedding. (present for both cache hits and misses)InputTokens(number): Number of tokens extracted from the request for the semantic match embedding calculation. (present for both cache hits and misses)Threshold(number): Similarity threshold used for the match. (present only for cache hits)Similarity(number): Similarity score for the match. (present only for cache hits)
Clear Specific Cache Entry
Use the request ID from cached responses to clear specific entries:- Go SDK
- HTTP API
Cache Lifecycle & Cleanup
The semantic cache automatically handles cleanup to prevent storage bloat: Automatic Cleanup:- TTL Expiration: Entries are automatically removed when TTL expires
- Shutdown Cleanup: All cache entries are cleared from the vector store namespace and the namespace itself when Bifrost client shuts down
- Namespace Isolation: Each Bifrost instance uses isolated vector store namespaces to prevent conflicts
- Clear specific entries by request ID (see examples above)
- Clear all entries for a cache key
- Restart Bifrost to clear all cache data
The semantic cache namespace and all its cache entries are deleted when Bifrost client shuts down only if
cleanup_on_shutdown is set to true. By default (cleanup_on_shutdown: false), cache data persists between restarts. DO NOT use the plugin’s namespace for external purposes.Dimension Changes: If you update the
dimension config, the existing namespace will contain data with mixed dimensions, causing retrieval issues. To avoid this, either use a different vector_store_namespace or set cleanup_on_shutdown: true before restarting.Vector Store Requirement: Semantic caching requires a configured vector store (currently Weaviate only). Without vector store setup, the plugin will not function.

