Skip to main content

Overview

Bifrost provides three key performance configuration parameters that control throughput, memory usage, and request handling behavior:
ParameterScopeDefaultDescription
ConcurrencyPer Provider1000Number of worker goroutines processing requests simultaneously
Buffer SizePer Provider5000Maximum requests that can be queued before blocking/dropping
Initial Pool SizeGlobal5000Pre-allocated objects in sync pools to reduce GC pressure
These defaults are suitable for most production deployments handling up to ~5000 RPS. For higher throughput or constrained environments, tuning these parameters can significantly improve performance.

Understanding the Parameters

Concurrency (Per Provider)

What it does: Controls two aspects of provider performance:
  1. Worker Goroutines: The number of goroutines that process requests for each provider. Each worker pulls requests from the provider’s queue and executes them against the provider’s API.
  2. Provider Pool Pre-warming: Pre-allocates provider-specific response objects (e.g., AnthropicMessageResponse, OpenAIResponse) in sync pools to reduce allocations during request handling.
Impact:
  • Higher concurrency = More parallel requests to the provider, higher throughput, more pre-allocated response objects
  • Lower concurrency = Fewer parallel requests, lower resource usage, respects provider rate limits
Default: 1000 workers per provider
{
    "providers": {
        "openai": {
            "keys": [...],
            "concurrency_and_buffer_size": {
                "concurrency": 100,
                "buffer_size": 500
            }
        }
    }
}

Buffer Size (Per Provider)

What it does: Sets the capacity of the buffered channel (queue) for each provider. Incoming requests are queued here before being picked up by workers. Impact:
  • Larger buffer = More requests can be queued during traffic spikes, handles burst traffic better
  • Smaller buffer = Lower memory footprint, faster backpressure signals to clients
Default: 5000 requests per provider queue Queue Full Behavior: Controlled by drop_excess_requests:
  • false (default): New requests block until queue space is available
  • true: New requests are immediately dropped with an error when queue is full
Constraint: Buffer size must be greater than or equal to concurrency. If concurrency > buffer_size, provider setup will fail.

Initial Pool Size (Global)

What it does: Controls the number of pre-allocated objects in Bifrost’s internal sync pools at startup. These pools recycle objects to reduce garbage collection overhead. Pooled Objects:
  • Channel messages (request wrappers)
  • Response channels
  • Error channels
  • Stream channels
  • Plugin pipelines
  • Request objects
Impact:
  • Higher initial pool = Less GC pressure during high traffic, more consistent latency, higher initial memory usage
  • Lower initial pool = Lower initial memory footprint, may cause more allocations under load
Default: 5000 objects per pool
{
    "config": {
        "initial_pool_size": 10000,
        "drop_excess_requests": false
    }
}

Sizing Guidelines

Concurrency & Buffer Size (Per Provider)

Configure these settings per provider based on the expected RPS for that specific provider:
Provider RPSConcurrencyBuffer Size
100100150
500500750
100010001500
250025003750
500050007500
100001000015000
Example: If you expect 2000 RPS to OpenAI and 500 RPS to Anthropic, configure OpenAI with concurrency: 2000, buffer_size: 3000 and Anthropic with concurrency: 500, buffer_size: 750.
Formula:
concurrency = expected_rps
buffer_size = 1.5 × expected_rps
This ratio ensures:
  • Enough queue capacity to absorb traffic bursts
  • Workers are never starved for work
  • Backpressure is applied before memory exhaustion

Initial Pool Size (Global)

Configure this setting based on total RPS across all providers combined:
Total RPS (All Providers)Initial Pool SizeMemory Estimate
100150~50 MB
500750~100 MB
10001500~200 MB
25003750~400 MB
50007500~800 MB
1000015000~1.5 GB
Memory estimates are approximate and vary based on request/response sizes, number of providers, and plugins. Monitor actual memory usage in your environment.
Formula:
initial_pool_size = 1.5 × total_expected_rps
Additionally, ensure:
initial_pool_size >= max(buffer_size across all providers)
This ensures pools are pre-warmed to handle peak queue depths without runtime allocations.

Multi-Node Deployments

When running multiple Bifrost instances behind a load balancer, divide the per-node settings by the number of nodes based on your total expected RPS.

Formula

Per-Node Concurrency = Total Concurrency / Number of Nodes
Per-Node Buffer Size = Total Buffer Size / Number of Nodes
Per-Node Initial Pool Size = Total Initial Pool Size / Number of Nodes

Example: 10,000 RPS Across 4 Nodes

Total capacity (aggregate across all 4 nodes):
  • Total RPS: 10,000 RPS
  • Per-node RPS: ~2,500 RPS per node
Single node settings for 10,000 RPS (if running on one node):
  • Concurrency: 10000
  • Buffer Size: 15000
  • Initial Pool Size: 15000
Per-node settings (4 nodes, 10,000 RPS total):
ParameterTotal (Aggregate)Per Node (4 nodes)
Concurrency100002500
Buffer Size150003750
Initial Pool Size150003750
{
    "config": {
        "initial_pool_size": 3750,
        "drop_excess_requests": false
    },
    "providers": {
        "openai": {
            "keys": [...],
            "concurrency_and_buffer_size": {
                "concurrency": 2500,
                "buffer_size": 3750
            }
        },
        "anthropic": {
            "keys": [...],
            "concurrency_and_buffer_size": {
                "concurrency": 2500,
                "buffer_size": 3750
            }
        }
    }
}
Kubernetes Horizontal Pod Autoscaling: When using HPA, configure settings for your minimum replica count. As pods scale up, each node handles a smaller portion of traffic. Consider using environment variables or ConfigMaps to dynamically adjust settings based on replica count.

Provider-Specific Tuning

Different providers have different rate limits and latency characteristics. Tune each provider independently:

Provider Rate Limit Considerations

ProviderTypical Rate LimitsRecommended ConcurrencyNotes
OpenAI500-10000 RPM (varies by tier)100-500Higher tiers support more concurrency
Anthropic1000-4000 RPM (varies by tier)50-200More conservative rate limits
BedrockPer-model limits100-300Check AWS quotas for your account
Azure OpenAIDeployment-specific100-500Configure per-deployment
Vertex AIPer-model quotas100-300Check GCP quotas
GroqVery high throughput500-1000Designed for high concurrency
OllamaLocal resource bound10-50Limited by local GPU/CPU

Example: Mixed Provider Configuration

{
    "providers": {
        "openai": {
            "keys": [...],
            "concurrency_and_buffer_size": {
                "concurrency": 200,
                "buffer_size": 1000
            }
        },
        "anthropic": {
            "keys": [...],
            "concurrency_and_buffer_size": {
                "concurrency": 100,
                "buffer_size": 500
            }
        },
        "groq": {
            "keys": [...],
            "concurrency_and_buffer_size": {
                "concurrency": 500,
                "buffer_size": 2500
            }
        },
        "ollama": {
            "keys": [...],
            "concurrency_and_buffer_size": {
                "concurrency": 20,
                "buffer_size": 100
            }
        }
    }
}

Queue Overflow Handling

When the provider queue reaches capacity, Bifrost’s behavior is controlled by drop_excess_requests:

Blocking Mode (Default)

{
    "config": {
        "drop_excess_requests": false
    }
}
  • New requests wait until queue space is available
  • Ensures no requests are lost
  • May increase latency during high load
  • Suitable for critical workloads where every request matters

Drop Mode

{
    "config": {
        "drop_excess_requests": true
    }
}
  • New requests are immediately rejected when queue is full
  • Returns error: "request dropped: queue is full"
  • Maintains consistent latency for accepted requests
  • Suitable for real-time applications where stale requests are useless
Best Practice: Use drop_excess_requests: true with buffer sizes at 1.5x concurrency for production workloads. This prevents memory exhaustion while still handling reasonable traffic bursts.

Monitoring and Diagnostics

Key Metrics to Monitor

MetricHealthy RangeAction if Exceeded
Queue depth< 50% of buffer_sizeIncrease buffer or concurrency
Request latency (p99)< 2x averageCheck provider rate limits
Dropped requests0Increase buffer_size
Memory usageStableReduce pool/buffer sizes
Goroutine countStableCheck for goroutine leaks

Health Check Endpoint

The Gateway exposes health and metrics endpoints:
# Health check
curl http://localhost:8080/health

# Prometheus metrics
curl http://localhost:8080/metrics

Best Practices Summary

Start Conservative

Begin with lower values and scale up based on observed performance. Over-provisioning wastes resources.

Monitor Continuously

Track queue depths, latencies, and error rates. Adjust settings based on real traffic patterns.

Match Provider Limits

Don’t set concurrency higher than provider rate limits allow. You’ll just get rate-limited.

Plan for Bursts

Set buffer_size to 1.5x concurrency to handle traffic spikes without dropping requests.

Quick Reference

// Formula
concurrency      = expected_rps
buffer_size      = 1.5 × expected_rps
initial_pool_size = 1.5 × total_rps (across all providers)

// Example: 500 RPS per provider, 2 providers (1000 total RPS)
concurrency: 500, buffer_size: 750, initial_pool_size: 1500

// Example: 2000 RPS per provider, 3 providers (6000 total RPS)
concurrency: 2000, buffer_size: 3000, initial_pool_size: 9000

// Multi-node formula
per_node_value = total_value / number_of_nodes