Overview
Bifrost provides three key performance configuration parameters that control throughput, memory usage, and request handling behavior:| Parameter | Scope | Default | Description |
|---|---|---|---|
| Concurrency | Per Provider | 1000 | Number of worker goroutines processing requests simultaneously |
| Buffer Size | Per Provider | 5000 | Maximum requests that can be queued before blocking/dropping |
| Initial Pool Size | Global | 5000 | Pre-allocated objects in sync pools to reduce GC pressure |
These defaults are suitable for most production deployments handling up to ~5000 RPS. For higher throughput or constrained environments, tuning these parameters can significantly improve performance.
Understanding the Parameters
Concurrency (Per Provider)
What it does: Controls two aspects of provider performance:- Worker Goroutines: The number of goroutines that process requests for each provider. Each worker pulls requests from the provider’s queue and executes them against the provider’s API.
- Provider Pool Pre-warming: Pre-allocates provider-specific response objects (e.g.,
AnthropicMessageResponse,OpenAIResponse) in sync pools to reduce allocations during request handling.
- Higher concurrency = More parallel requests to the provider, higher throughput, more pre-allocated response objects
- Lower concurrency = Fewer parallel requests, lower resource usage, respects provider rate limits
1000 workers per provider
- Gateway (config.json)
- Go SDK
Buffer Size (Per Provider)
What it does: Sets the capacity of the buffered channel (queue) for each provider. Incoming requests are queued here before being picked up by workers. Impact:- Larger buffer = More requests can be queued during traffic spikes, handles burst traffic better
- Smaller buffer = Lower memory footprint, faster backpressure signals to clients
5000 requests per provider queue
Queue Full Behavior: Controlled by drop_excess_requests:
false(default): New requests block until queue space is availabletrue: New requests are immediately dropped with an error when queue is full
Initial Pool Size (Global)
What it does: Controls the number of pre-allocated objects in Bifrost’s internal sync pools at startup. These pools recycle objects to reduce garbage collection overhead. Pooled Objects:- Channel messages (request wrappers)
- Response channels
- Error channels
- Stream channels
- Plugin pipelines
- Request objects
- Higher initial pool = Less GC pressure during high traffic, more consistent latency, higher initial memory usage
- Lower initial pool = Lower initial memory footprint, may cause more allocations under load
5000 objects per pool
- Gateway (config.json)
- Go SDK
Sizing Guidelines
Concurrency & Buffer Size (Per Provider)
Configure these settings per provider based on the expected RPS for that specific provider:| Provider RPS | Concurrency | Buffer Size |
|---|---|---|
| 100 | 100 | 150 |
| 500 | 500 | 750 |
| 1000 | 1000 | 1500 |
| 2500 | 2500 | 3750 |
| 5000 | 5000 | 7500 |
| 10000 | 10000 | 15000 |
Example: If you expect 2000 RPS to OpenAI and 500 RPS to Anthropic, configure OpenAI with
concurrency: 2000, buffer_size: 3000 and Anthropic with concurrency: 500, buffer_size: 750.- Enough queue capacity to absorb traffic bursts
- Workers are never starved for work
- Backpressure is applied before memory exhaustion
Initial Pool Size (Global)
Configure this setting based on total RPS across all providers combined:| Total RPS (All Providers) | Initial Pool Size | Memory Estimate |
|---|---|---|
| 100 | 150 | ~50 MB |
| 500 | 750 | ~100 MB |
| 1000 | 1500 | ~200 MB |
| 2500 | 3750 | ~400 MB |
| 5000 | 7500 | ~800 MB |
| 10000 | 15000 | ~1.5 GB |
Memory estimates are approximate and vary based on request/response sizes, number of providers, and plugins. Monitor actual memory usage in your environment.
Multi-Node Deployments
When running multiple Bifrost instances behind a load balancer, divide the per-node settings by the number of nodes based on your total expected RPS.Formula
Example: 10,000 RPS Across 4 Nodes
Total capacity (aggregate across all 4 nodes):- Total RPS: 10,000 RPS
- Per-node RPS: ~2,500 RPS per node
- Concurrency: 10000
- Buffer Size: 15000
- Initial Pool Size: 15000
| Parameter | Total (Aggregate) | Per Node (4 nodes) |
|---|---|---|
| Concurrency | 10000 | 2500 |
| Buffer Size | 15000 | 3750 |
| Initial Pool Size | 15000 | 3750 |
- Gateway (config.json)
- Go SDK
Provider-Specific Tuning
Different providers have different rate limits and latency characteristics. Tune each provider independently:Provider Rate Limit Considerations
| Provider | Typical Rate Limits | Recommended Concurrency | Notes |
|---|---|---|---|
| OpenAI | 500-10000 RPM (varies by tier) | 100-500 | Higher tiers support more concurrency |
| Anthropic | 1000-4000 RPM (varies by tier) | 50-200 | More conservative rate limits |
| Bedrock | Per-model limits | 100-300 | Check AWS quotas for your account |
| Azure OpenAI | Deployment-specific | 100-500 | Configure per-deployment |
| Vertex AI | Per-model quotas | 100-300 | Check GCP quotas |
| Groq | Very high throughput | 500-1000 | Designed for high concurrency |
| Ollama | Local resource bound | 10-50 | Limited by local GPU/CPU |
Example: Mixed Provider Configuration
- Gateway (config.json)
- Go SDK
Queue Overflow Handling
When the provider queue reaches capacity, Bifrost’s behavior is controlled bydrop_excess_requests:
Blocking Mode (Default)
- New requests wait until queue space is available
- Ensures no requests are lost
- May increase latency during high load
- Suitable for critical workloads where every request matters
Drop Mode
- New requests are immediately rejected when queue is full
- Returns error:
"request dropped: queue is full" - Maintains consistent latency for accepted requests
- Suitable for real-time applications where stale requests are useless
Monitoring and Diagnostics
Key Metrics to Monitor
| Metric | Healthy Range | Action if Exceeded |
|---|---|---|
| Queue depth | < 50% of buffer_size | Increase buffer or concurrency |
| Request latency (p99) | < 2x average | Check provider rate limits |
| Dropped requests | 0 | Increase buffer_size |
| Memory usage | Stable | Reduce pool/buffer sizes |
| Goroutine count | Stable | Check for goroutine leaks |
Health Check Endpoint
The Gateway exposes health and metrics endpoints:Best Practices Summary
Start Conservative
Begin with lower values and scale up based on observed performance. Over-provisioning wastes resources.
Monitor Continuously
Track queue depths, latencies, and error rates. Adjust settings based on real traffic patterns.
Match Provider Limits
Don’t set concurrency higher than provider rate limits allow. You’ll just get rate-limited.
Plan for Bursts
Set buffer_size to 1.5x concurrency to handle traffic spikes without dropping requests.
Quick Reference
Related Documentation
- Provider Configuration - Complete provider setup guide
- Custom Providers - Creating custom provider integrations
- Deployment - Production deployment guides

