Skip to main content

Overview

Bifrost offers two powerful methods for routing requests across AI providers, each serving different use cases:
  1. Governance-based Routing: Explicit, user-defined routing rules configured via Virtual Keys
  2. Adaptive Load Balancing: Automatic, performance-based routing powered by real-time metrics (Enterprise feature)
When both methods are available, governance takes precedence because users have explicitly defined their routing preferences through provider configurations on Virtual Keys.
When to use which method:
  • Use Governance when you need explicit control, compliance requirements, or specific cost optimization strategies
  • Use Adaptive Load Balancing for automatic performance optimization and minimal configuration overhead

The Model Catalog

The Model Catalog is Bifrost’s central registry that tracks which models are available from which providers. It powers both governance-based routing and adaptive load balancing by maintaining an up-to-date mapping of models to providers.

Data Sources

The Model Catalog combines two data sources:
  1. Pricing Data (Primary source)
    • Downloaded from a remote URL (configurable, defaults to Maxim’s pricing endpoint)
    • Contains model names, pricing tiers, and provider mappings
    • Synced to database on startup and refreshed every hour
    • Used for cost calculation and initial model-to-provider mapping
  2. Provider List Models API (Secondary source)
    • Calls each provider’s /v1/models endpoint
    • Enriches the catalog with provider-specific models and aliases
    • Called on Bifrost startup and when providers are added/updated
    • Adds models that may not be in pricing data yet

Syncing Behavior

When Bifrost starts:
  1. Pricing data is loaded from the remote URL
  2. If successful, data is stored in the database (if config store is available)
  3. Model pool is populated from pricing data
  4. List models API is called for all configured providers
  5. Results are added to the model pool
If list models API fails for a provider:
{"level":"warn","message":"failed to list models for provider ollama: failed to execute HTTP request to provider API"}
  • This is logged as a warning but does not stop startup
  • The provider can still be used with models from pricing data
While Bifrost is running:
  • Pricing data: Background worker checks every hour and syncs if interval elapsed
  • List models API: Re-fetched when provider is added/updated via API or dashboard
Sync failures are handled gracefully:
  • If pricing URL fails but database has existing data → Use database
  • If pricing URL fails and no database data → Error (startup fails)
  • If list models API fails → Log warning, continue with pricing data only
When syncing fails:
  1. Pricing data failure: Use existing database records (requires config store)
  2. List models failure: Rely on pricing data only
  3. Empty allowed_models: Use model catalog to validate which models are supported
This multi-layered approach ensures routing continues even with partial sync failures.

How It’s Used in Routing

When a Virtual Key has empty allowed_models:
{
  "provider_configs": [
    {
      "provider": "openai",
      "allowed_models": [],  // Empty = use Model Catalog
      "weight": 0.5
    }
  ]
}
Bifrost checks the Model Catalog:
  • Request for gpt-4o → ✅ Allowed (catalog shows OpenAI supports this)
  • Request for claude-3-sonnet → ❌ Rejected (catalog shows OpenAI doesn’t support this)
Model Catalog is essential for cross-provider routing. Without it, Bifrost wouldn’t know that gpt-4o is available from both OpenAI and Azure, limiting routing flexibility.

Governance-based Routing

Governance-based routing allows you to explicitly define which providers and models should handle requests for a specific Virtual Key. This method provides precise control over routing decisions.

How It Works

When a Virtual Key has provider_configs defined:
  1. Request arrives with a Virtual Key (e.g., x-bf-vk: vk-prod-main)
  2. Model validation: Bifrost checks if the requested model is allowed for any configured provider
  3. Provider filtering: Providers are filtered based on:
    • Model availability in allowed_models
    • Budget limits (current usage vs max limit)
    • Rate limits (tokens/requests per time window)
  4. Weighted selection: A provider is selected using weighted random distribution
  5. Provider prefix added: Model string becomes provider/model (e.g., openai/gpt-4o)
  6. Fallbacks created: Remaining providers sorted by weight (descending) are added as fallbacks

Configuration Example

{
  "provider_configs": [
    {
      "provider": "openai",
      "allowed_models": ["gpt-4o", "gpt-4o-mini"],
      "weight": 0.3,
      "budget": {
        "max_limit": 100.0,
        "current_usage": 45.0
      }
    },
    {
      "provider": "azure",
      "allowed_models": ["gpt-4o"],
      "weight": 0.7,
      "rate_limit": {
        "token_max_limit": 100000,
        "token_reset_duration": "1m"
      }
    }
  ]
}

Request Flow

1

Request with Virtual Key

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "x-bf-vk: vk-prod-main" \
  -d '{"model": "gpt-4o", "messages": [...]}'
2

Governance Evaluation

  • OpenAI: ✅ Has gpt-4o in allowed_models, budget OK, weight 0.3
  • Azure: ✅ Has gpt-4o in allowed_models, rate limit OK, weight 0.7
3

Weighted Selection

  • 70% chance → Azure
  • 30% chance → OpenAI
4

Request Transformation

{
  "model": "azure/gpt-4o",
  "messages": [...],
  "fallbacks": ["openai/gpt-4o"]
}

Key Features

FeatureDescription
Explicit ControlDefine exactly which providers and models are accessible
Budget EnforcementAutomatically exclude providers exceeding budget limits
Rate Limit ProtectionSkip providers that have hit rate limits
Weighted DistributionControl traffic distribution with custom weights
Automatic FallbacksFailed providers automatically retry with next highest weight

Best Practices

Assign higher weights to cheaper providers for cost-sensitive workloads:
{
  "provider_configs": [
    {"provider": "groq", "weight": 0.7},
    {"provider": "openai", "weight": 0.3}
  ]
}
Create different Virtual Keys for dev/staging/prod with different provider access:
{
  "virtual_keys": [
    {
      "id": "vk-dev",
      "provider_configs": [{"provider": "ollama"}]
    },
    {
      "id": "vk-prod",
      "provider_configs": [{"provider": "openai"}, {"provider": "azure"}]
    }
  ]
}
Restrict specific Virtual Keys to compliant providers:
{
  "provider_configs": [
    {"provider": "azure", "allowed_models": ["gpt-4o"]},
    {"provider": "bedrock", "allowed_models": ["claude-3-sonnet-20240229"]}
  ]
}
Empty allowed_models: When left empty, Bifrost uses the Model Catalog (populated from pricing data and the provider’s list models API) to determine which models are supported. See the Model Catalog section above for how syncing works. For configuration instructions, see Governance Routing.

Adaptive Load Balancing

Enterprise Feature: Adaptive Load Balancing is available in Bifrost Enterprise. Contact us to enable it.
Adaptive Load Balancing automatically optimizes routing based on real-time performance metrics. It operates at two levels to provide both macro-level provider selection and micro-level key optimization.

Two-Level Architecture

Why Two Levels?

Separating provider selection (direction) from key selection (route) enables:
  • Provider-level optimization: Choose the best provider for a model based on aggregate performance
  • Key-level optimization: Within that provider, choose the best API key based on individual key performance
  • Resilience: Even when provider is specified (by governance or user), key-level load balancing still optimizes which API key to use

Level 1: Direction (Provider Selection)

When it runs: Only when the model string has no provider prefix (e.g., gpt-4o) How it works:
  1. Model catalog lookup: Find all configured providers that support the requested model
  2. Provider filtering: Filter based on:
    • Allowed models from keys configuration
    • Keys availability for the provider
  3. Performance scoring: Calculate scores for each provider based on:
    • Error rates (50% weight)
    • Latency (20% weight, using MV-TACOS algorithm)
    • Utilization (5% weight)
    • Momentum bias (recovery acceleration)
  4. Smart selection: Choose provider using weighted random with jitter and exploration
  5. Fallbacks created: Remaining providers sorted by performance score (descending) are added as fallbacks

Level 2: Route (Key Selection)

When it runs: Always, even when provider is already specified (by governance, user, or Level 1) How it works:
  1. Get available keys: Fetch all keys for the selected provider
  2. Filter by configuration: Apply model restrictions from key configuration
  3. Performance scoring: Calculate score for each key based on:
    • Error rates (recent failures)
    • Latency (response time)
    • TPM hits (rate limit violations)
    • Current state (Healthy, Degraded, Failed, Recovering)
  4. Weighted random selection: Choose key with exploration (25% chance to probe recovering keys)
  5. Circuit breaker: Skip keys with zero weight (TPM hits, repeated failures)

Scoring Algorithm

The load balancer computes a performance score for each provider-model combination: Score=(Perror×0.5)+(Platency×0.2)+(Putil×0.05)MmomentumScore = (P_{error} \times 0.5) + (P_{latency} \times 0.2) + (P_{util} \times 0.05) - M_{momentum}
Lower penalties = Higher weights = More traffic. The system self-heals by quickly penalizing failing routes but enabling fast recovery once issues are resolved.

Request Flow

1

Request without Provider Prefix

curl -X POST http://localhost:8080/v1/chat/completions \
  -d '{"model": "gpt-4o", "messages": [...]}'
2

Model Catalog Lookup

Providers supporting gpt-4o: [openai, azure, groq]
3

Performance Evaluation

  • OpenAI: Score 0.92 (low latency, 99% success rate)
  • Azure: Score 0.85 (medium latency, 98% success rate)
  • Groq: Score 0.65 (high latency recently)
4

Provider Selection

OpenAI selected (highest score within jitter band)
5

Request Transformation

{
  "model": "openai/gpt-4o",
  "messages": [...],
  "fallbacks": ["azure/gpt-4o", "groq/gpt-4o"]
}

Key Features

FeatureDescription
Automatic OptimizationNo manual weight tuning required
Real-time AdaptationWeights recomputed every 5 seconds based on live metrics
Circuit BreakersFailing routes automatically removed from rotation
Fast Recovery90% penalty reduction in 30 seconds after issues resolve
Health StatesRoutes transition between Healthy, Degraded, Failed, and Recovering
Smart Exploration25% chance to probe potentially recovered routes

Dashboard Visibility

Monitor load balancing performance in real-time:
Adaptive Load Balancing Dashboard
The dashboard shows:
  • Weight distribution across provider-model-key routes
  • Performance metrics (error rates, latency, success rates)
  • State transitions (Healthy → Degraded → Failed → Recovering)
  • Actual vs expected traffic distribution

How Governance and Load Balancing Interact

When both methods are available in your Bifrost deployment, they work together in a complementary way across two levels.
Key Insight: Load balancing has two levels:
  • Level 1 (Direction/Provider): Skipped when provider is already specified
  • Level 2 (Route/Key): Always runs, even when provider is specified
This means key-level optimization works regardless of how the provider was chosen!

Execution Flow

Execution Order

  1. HTTPTransportIntercept (Governance Plugin - Provider Level)
    • Runs first in the request pipeline
    • Checks if Virtual Key has provider_configs
    • If yes: adds provider prefix (e.g., azure/gpt-4o)
    • Result: Provider is selected by governance rules
  2. Middleware (Load Balancing Plugin - Provider Level / Direction)
    • Runs after HTTPTransportIntercept
    • Checks if model string contains ”/”
    • If yes: skips provider selection (already determined by governance or user)
    • If no: performs performance-based provider selection
    • Result: Provider prefix added if not already present
  3. KeySelector (Load Balancing - Key Level / Route)
    • Always runs during request execution in Bifrost core
    • Gets all keys for the selected provider
    • Filters keys based on model restrictions
    • Scores each key by performance metrics
    • Selects best key using weighted random + exploration
    • Result: Optimal key selected within the provider
Important: Even when governance specifies azure/gpt-4o, load balancing still optimizes which Azure key to use based on performance metrics. This is the power of the two-level architecture!

Example Scenarios

Setup:
  • Virtual Key has provider_configs defined
  • No adaptive load balancing enabled
Request:
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "x-bf-vk: vk-prod-main" \
  -d '{"model": "gpt-4o", "messages": [...]}'
Behavior:
  1. Governance applies weighted provider routing → selects Azure (70% weight)
  2. Model becomes azure/gpt-4o
  3. Standard key selection (non-adaptive) chooses an Azure key based on static weights
  4. Request forwarded to Azure with selected key

Provider vs Key Selection Rules

ScenarioProvider SelectionKey Selection
VK with provider_configsGovernance (weighted random)Standard or Adaptive (if enabled)
VK without provider_configs + LBLoad Balancing Level 1 (performance)Load Balancing Level 2 (performance)
No VK + LBLoad Balancing Level 1 (performance)Load Balancing Level 2 (performance)
Model with provider prefix + LBSkip (already specified)Load Balancing Level 2 (performance) ✅
No Load Balancing enabledGovernance or User or Model CatalogStandard (static weights)
Critical Insight:
  • Provider selection respects the hierarchy: Governance → Load Balancing Level 1 → User specification
  • Key selection runs independently and benefits from load balancing even when provider is predetermined
This separation is what makes the two-level architecture so powerful!

Choosing the Right Approach

  1. Use Governance When: Compliance requirements: Need to ensure data stays in specific regions or providers ✅ Cost optimization: Want explicit control over traffic distribution to cheaper providers ✅ Budget enforcement: Need hard limits on spending per provider ✅ Environment separation: Different teams/apps need different provider access ✅ Rate limit management: Need to respect provider-specific rate limits
  2. Use Load Balancing When: Performance optimization: Want automatic routing to best-performing providers ✅ Minimal configuration: Prefer hands-off operation with intelligent defaults ✅ Dynamic workloads: Traffic patterns change frequently ✅ Automatic failover: Need instant adaptation to provider issues ✅ Multi-provider redundancy: Want seamless provider switching based on availability
  3. Use Both When: Hybrid requirements: Some Virtual Keys need governance, others can use load balancing ✅ Progressive rollout: Start with governance, gradually adopt load balancing ✅ Selective optimization: Governance for sensitive workloads, load balancing for others

Additional Resources