Skip to main content

Overview

Bifrost can cache LLM responses and replay them for repeated requests, avoiding a round-trip to the provider. It offers two complementary lookup paths:
  • Direct (hash) matching — deterministic, exact-match replay. The request is normalized and hashed; an identical request is served instantly. No embeddings required.
  • Semantic (similarity) matching — embedding-based lookup that serves a cached answer when a new request is close enough to a previous one, even if the wording differs.
Both paths can run together (direct first, semantic on miss), or you can run direct-only with no embedding provider at all.
In the Web UI this feature is labeled Local Cache (under Settings → Caching). “Semantic caching” refers to the embedding-based mode; “direct” mode is the embedding-free path. They are the same plugin (semantic_cache).
Key benefits:
  • Cost reduction — skip paid LLM calls for repeated or similar prompts.
  • Lower latency — sub-millisecond cache reads vs. multi-second provider calls.
  • Two modes — exact-match deduplication (direct) or fuzzy similarity (semantic).
  • Streaming support — streamed responses are cached and replayed chunk-by-chunk.

How it works

A few things that trip up first-time users — read these before configuring:
  1. A cache key is mandatory. Caching only engages when a request carries a cache key (the x-bf-cache-key header, or the CacheKey context value in the Go SDK). Without one — and without a configured default_cache_key — the request bypasses the cache entirely. This is the single most common reason “nothing is being cached.”
  2. Direct runs before semantic. When both paths are enabled, a direct hash hit is served first; the semantic search only runs on a direct miss. You can narrow a request to one path with the x-bf-cache-type header.
  3. Writes are asynchronous. On a cache miss, Bifrost returns the provider’s response immediately and stores it in the background, so the first request never blocks on a cache write.
  4. Entries persist across restarts. Cache entries live in your vector store with a per-entry expiry (expires_at). They are not purged when Bifrost shuts down — a restart keeps serving warm cache (see Cache lifecycle).
What gets cached: chat completions, text completions, the Responses API (including WebSocket), embeddings, transcriptions, speech, and image generation — including their streaming variants.
Latency overhead. The cache lookup itself adds latency to every cache-enabled request, and the cost differs per path:
  • Direct lookup — one vector store round-trip per request, hit or miss. Sub-millisecond to a few milliseconds with a local Redis/Valkey; higher with remote or managed stores. (Computing the request hash itself is in-process and takes microseconds — the round-trip is the only real cost.)
  • Semantic lookup — runs on every direct miss, and must embed the incoming request before it can search. That means one embedding API call to your provider (typically tens to a few hundred milliseconds) plus a vector similarity search, paid upfront regardless of the outcome. A semantic hit therefore costs roughly an embedding round-trip — not the near-instant replay of a direct hit — and a semantic miss pays the embedding call on top of the full LLM call, making it slower than running without the cache.
  • Cache writes — asynchronous; they add no latency to the response.

Prerequisites

  1. A vector store is required as the storage backend for both modes — even direct-only mode stores its entries there. Bifrost supports:

Redis / Valkey

In-memory, RediSearch-compatible. Recommended for direct-only mode.

Weaviate

Production-ready vector database with gRPC support.

Qdrant

Rust-based vector search engine with advanced filtering.

Pinecone

Managed, serverless vector database service.
  1. An embedding-capable provider — only if you want semantic mode. Direct-only mode needs no provider.
See the Vector Store documentation for per-store setup. The vector store must be enabled in config.json before the Enable Caching toggle becomes available in the UI.
Minimal vector store config (Redis/Valkey):
{
  "vector_store": {
    "enabled": true,
    "type": "redis",
    "config": {
      "addr": "localhost:6379"
    }
  }
}
For Valkey, keep vector_store.type as "redis" and point config.addr at your Valkey endpoint.

Configuration

Local Cache configuration page
  1. Configure and enable a vector store in config.json (see Prerequisites). Without it, the toggle stays disabled.
  2. In the Bifrost UI, go to Settings → Caching. You’ll see the Local Cache panel.
  3. Flip Enable Caching on. The plugin loads live — no server restart needed.
  4. Pick a Cache Mode using the tabs at the top of the panel:
    • Direct only — exact-match caching. No provider or embeddings. Cheapest path; ideal for stable, repeated prompts.
    • Direct + Semantic — adds vector similarity on top of direct matching. Requires an embedding-capable provider. (This tab is disabled until at least one embedding-capable provider is configured.)
  5. For semantic mode, fill in the embedding provider, model, and dimension that appear below the tabs:
    • Configured Providers — an embedding-capable provider already set up in Bifrost. Its API keys are inherited automatically.
    • Embedding Model — e.g. text-embedding-3-small.
    • Dimension — the vector size the model produces. Must match the model exactly (e.g. 1536 for text-embedding-3-small, 3072 for text-embedding-3-large, 768 for many Cohere/Voyage models).
  6. Tune Cache Settings, Storage & Cache Key, Conversation Settings, and Cache Key Composition (all explained in the field reference below).
  7. Click Save Changes. Config changes mutate the live plugin in place.
  8. Send a request with an x-bf-cache-key header to start caching (see Triggering the cache).

Field reference

FieldTypeDefaultDescription
providerstringEmbedding provider. Required for semantic mode; omit for direct-only.
embedding_modelstringEmbedding model name. Required when provider is set.
dimensionintegerVector size. Use 1 for direct-only mode; the embedding model’s real dimension (> 1) for semantic mode. Required.
ttlduration / seconds5m (300s)How long entries live before they expire. Accepts a duration string ("5m") or numeric seconds (300).
thresholdnumber (0–1)0.8Minimum cosine similarity for a semantic hit. Semantic mode only.
conversation_history_thresholdinteger3Skip caching when a conversation has more than this many messages. UI range: 1–50.
exclude_system_promptbooleanfalseExclude system messages from cache-key generation.
cache_by_modelbooleantrueInclude the model name in the cache key (different models won’t share entries).
cache_by_providerbooleantrueInclude the provider name in the cache key (different providers won’t share entries).
vector_store_namespacestringBifrostSemanticCachePluginBucket/index where entries live. Changing it points the plugin at a different (possibly empty) bucket; old entries aren’t deleted, just no longer queried.
default_cache_keystring"" (empty)Fallback cache key used when a request doesn’t send x-bf-cache-key. Left empty, caching is disabled for any request without the header.

Direct vs. semantic mode

Direct onlyDirect + Semantic
MatchesExact (normalized) requestExact and semantically similar
Embedding providerNot neededRequired
Cost per missZero embedding costOne embedding call per miss
Added latencyOne vector store round-trip per requestStore round-trip, plus an embedding call + similarity search on every direct miss
Best forStable, repeated prompts; strict dedupParaphrased / varied user queries
dimension1The model’s real vector size (> 1)

Direct-only setup

Direct mode hashes each request deterministically from its normalized input, parameters, and stream flag. Identical requests hit; any difference is a miss. The deterministic cache ID keeps repeated lookups consistent across retries, streaming, and restarts. To enable direct-only mode, set dimension: 1 and omit provider and embedding_model. In the UI, pick the Direct only tab.
If you set dimension: 1 and also provide a provider, Bifrost treats the config as semantic mode, not direct-only. To use direct-only mode, omit provider entirely.
{
  "plugins": [
    {
      "enabled": true,
      "name": "semantic_cache",
      "config": {
        "dimension": 1,
        "ttl": "5m",
        "cache_by_model": true,
        "cache_by_provider": true
      }
    }
  ]
}
In direct-only mode, all requests use hash matching regardless of the x-bf-cache-type header — no embeddings are generated and no embedding credentials are needed. Redis/Valkey-compatible stores are recommended for direct-only mode. They don’t require a vector for metadata-only entries, and all cache fields are indexed as TAG fields for fast exact-match lookups.
Qdrant, Pinecone, and Weaviate are not suitable for direct-only mode. They require a vector for every entry; the plugin’s zero-vector placeholder codepath needs an initialized embedding client, so storage fails when no provider is configured. Use Redis/Valkey for direct-only.

Triggering the cache

A cache key is mandatory. Caching only activates when a request carries a cache key. Without one (and without a configured default_cache_key), the request bypasses caching entirely.
The cache key is the partition every lookup and write is scoped to — it’s part of the cache entry’s identity alongside the model and provider. It exists for two reasons:
  • Isolation (no cross-talk). Entries are only ever matched within the same key. A request under tenant-A can never be served a response cached under tenant-B, even if the prompts are identical. This prevents one user, tenant, or feature from leaking cached answers to another — the key is how you draw that boundary (per user, per session, per feature, per tenant, etc.).
  • Explicit opt-in. Caching changes behavior — a response can be replayed instead of freshly generated. Requiring a key makes that a deliberate choice per request (or per deployment via default_cache_key), so you never accidentally serve a cached answer where you wanted a live one.
Pick a key granularity that matches how much you want to share: a coarse key (e.g. a feature name) maximizes hit rate across users; a fine key (e.g. a per-user or per-session ID) keeps caches private at the cost of fewer hits.
Set the cache key in the x-bf-cache-key header:
# This request WILL be cached
curl -H "x-bf-cache-key: session-123" ...

# This request will NOT be cached (no header, no default_cache_key)
curl ...

Per-request overrides

Every plugin default can be overridden per request via headers (HTTP) or context keys (Go SDK).
HeaderContext key (Go)ValueEffect
x-bf-cache-keyCacheKeystringScope this request to a cache partition. Required (or default_cache_key) for caching to engage.
x-bf-cache-ttlCacheTTLKeyduration string or secondsOverride TTL for this request. Invalid values are ignored.
x-bf-cache-thresholdCacheThresholdKeyfloat (0–1)Override the semantic similarity threshold. Clamped to [0,1].
x-bf-cache-typeCacheTypeKeydirect or semanticLimit lookup to a single path.
x-bf-cache-no-storeCacheNoStoreKeytrueSkip writing the response (still serves cached hits).
# Custom TTL and threshold
curl -H "x-bf-cache-key: session-123" \
     -H "x-bf-cache-ttl: 30s" \
     -H "x-bf-cache-threshold: 0.9" ...

# Force direct-only matching
curl -H "x-bf-cache-key: session-123" \
     -H "x-bf-cache-type: direct" ...

# Read from cache but don't store the response
curl -H "x-bf-cache-key: session-123" \
     -H "x-bf-cache-no-store: true" ...
In direct-only mode (no embedding provider), x-bf-cache-type and x-bf-cache-threshold have no effect — every request uses direct matching.

Cache management

Every cached or cache-checked response carries debug metadata so you can confirm caching is working and capture the entry’s ID for management. Location: response.ExtraFields.CacheDebug
FieldWhen presentDescription
cache_hitalwaystrue if served from cache, false otherwise.
cache_idalwaysStorage ID of the entry — use it to invalidate later.
hit_typeon hit"direct" or "semantic".
thresholdon semantic hitSimilarity threshold used.
similarityon semantic hitActual cosine similarity score.
provider_usedwhen semantic search ranEmbedding provider used.
model_usedwhen semantic search ranEmbedding model used.
input_tokenswhen semantic search ranTokens consumed computing the embedding.
Examples:
// Direct hit
{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": true,
      "hit_type": "direct",
      "cache_id": "550e8500-e29b-41d4-a725-446655440001"
    }
  }
}

// Semantic hit
{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": true,
      "hit_type": "semantic",
      "cache_id": "550e8500-e29b-41d4-a725-446655440001",
      "threshold": 0.8,
      "similarity": 0.95,
      "provider_used": "openai",
      "model_used": "text-embedding-3-small",
      "input_tokens": 100
    }
  }
}

// Miss (semantic search ran but found nothing close enough)
{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": false,
      "cache_id": "550e8500-e29b-41d4-a725-446655440001",
      "provider_used": "openai",
      "model_used": "text-embedding-3-small",
      "input_tokens": 20
    }
  }
}
On a streamed response, only the final chunk carries the full cache_debug payload.
Cache outcomes also surface in Logs without inspecting the raw response: Log detail sheet showing the Semantic Cache badge and Caching Details block
  • Hit-type badge — a cache hit is tagged with a Direct Cache or Semantic Cache badge on the log entry.
  • Cache row — each cached request shows a Cache (hit) / Cache (miss) row with the copyable cache_id.
  • Caching Details block — expands to the cache_debug fields: cache type, and for semantic hits the embedding provider, embedding model, threshold, similarity score, and embedding input tokens.
  • Local Caching filter — the logs filter sidebar lets you filter requests by hit type (Direct cache / Semantic cache).

Invalidation

Use the cache_id from cache_debug to invalidate entries.
# Clear a specific cached entry by cache ID
curl -X DELETE http://localhost:8080/api/cache/clear/550e8500-e29b-41d4-a725-446655440001

# Clear all entries for a cache key
curl -X DELETE http://localhost:8080/api/cache/clear-by-key/support-session-456

Lifecycle & Cleanup

  • TTL expiration — every entry is stored with an expires_at timestamp. Expired entries are no longer served and are swept out over time.
  • Entries persist across restarts — cache data lives in your vector store and is not purged when Bifrost shuts down. A restart resumes serving the existing (unexpired) cache. To wipe entries, use the cache-clear APIs or clear the namespace in your vector store directly.
  • Namespace isolation — each vector_store_namespace is an independent cache pool. Use distinct namespaces to keep separate caches from colliding.
Changing dimension, provider, or embedding_model: a vector store namespace can hold vectors of one dimension only. The namespace is not recreated automatically — CreateNamespace is a no-op when the class/collection already exists. If the new embedding model produces a different vector size, subsequent writes fail (size mismatch) and reads silently miss. Before saving such a change, either:
  • point vector_store_namespace at a fresh name, or
  • drop the existing class/index in your vector store.

Troubleshooting

Most common cause: no cache key. Caching only engages when a request sends x-bf-cache-key (or you’ve set a default_cache_key). Confirm the header is present, then check cache_debug on the response.
Expected. The cache is populated after the first response is returned (writes are asynchronous). Send the same request again to see a hit.
  • Verify dimension exactly matches your embedding model’s output size.
  • Lower the threshold (e.g. 0.80.75) if genuinely-similar prompts aren’t matching.
  • Check cache_debug.similarity on a miss to see how close you got.
You changed dimension/provider/embedding_model against an existing namespace. See the dimension-change warning — use a fresh namespace or drop the old class/index.
No embedding-capable provider is configured. Add one under Providers first; its keys are inherited automatically.
No vector store is enabled. Configure and enable one in config.json (see Prerequisites).
You’re likely using Qdrant, Pinecone, or Weaviate, which require a vector per entry. Switch to Redis/Valkey for direct-only mode.

Next steps