Performance Tuning

Overview

Bifrost provides three key performance configuration parameters that control throughput, memory usage, and request handling behavior:

Parameter	Scope	Default	Description
Concurrency	Per Provider	1000	Number of worker goroutines processing requests simultaneously
Buffer Size	Per Provider	5000	Maximum requests that can be queued before blocking/dropping
Initial Pool Size	Global	5000	Pre-allocated objects in sync pools to reduce GC pressure

These defaults are suitable for most production deployments handling up to ~5000 RPS. For higher throughput or constrained environments, tuning these parameters can significantly improve performance.

Understanding the Parameters

Concurrency (Per Provider)

What it does: Controls two aspects of provider performance:

Worker Goroutines: The number of goroutines that process requests for each provider. Each worker pulls requests from the provider’s queue and executes them against the provider’s API.
Provider Pool Pre-warming: Pre-allocates provider-specific response objects (e.g., AnthropicMessageResponse, OpenAIResponse) in sync pools to reduce allocations during request handling.

Impact:

Higher concurrency = More parallel requests to the provider, higher throughput, more pre-allocated response objects
Lower concurrency = Fewer parallel requests, lower resource usage, respects provider rate limits

Default: 1000 workers per provider

Gateway (config.json)
Go SDK

{
    "providers": {
        "openai": {
            "keys": [...],
            "concurrency_and_buffer_size": {
                "concurrency": 100,
                "buffer_size": 500
            }
        }
    }
}

func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) {
    return &schemas.ProviderConfig{
        NetworkConfig: schemas.DefaultNetworkConfig,
        ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
            Concurrency: 100, // 100 concurrent workers
            BufferSize:  500, // 500 request queue capacity
        },
    }, nil
}

Buffer Size (Per Provider)

What it does: Sets the capacity of the buffered channel (queue) for each provider. Incoming requests are queued here before being picked up by workers. Impact:

Larger buffer = More requests can be queued during traffic spikes, handles burst traffic better
Smaller buffer = Lower memory footprint, faster backpressure signals to clients

Default: 5000 requests per provider queue Queue Full Behavior: Controlled by drop_excess_requests:

false (default): New requests block until queue space is available
true: New requests are immediately dropped with an error when queue is full

Constraint: Buffer size must be greater than or equal to concurrency. If concurrency > buffer_size, provider setup will fail.

Initial Pool Size (Global)

What it does: Controls the number of pre-allocated objects in Bifrost’s internal sync pools at startup. These pools recycle objects to reduce garbage collection overhead. Pooled Objects:

Channel messages (request wrappers)
Response channels
Error channels
Stream channels
Plugin pipelines
Request objects

Impact:

Higher initial pool = Less GC pressure during high traffic, more consistent latency, higher initial memory usage
Lower initial pool = Lower initial memory footprint, may cause more allocations under load

Default: 5000 objects per pool

Gateway (config.json)
Go SDK

{
    "config": {
        "initial_pool_size": 10000,
        "drop_excess_requests": false
    }
}

bifrostConfig := schemas.BifrostConfig{
    Account:            myAccount,
    InitialPoolSize:    10000, // Pre-warm pools with 10,000 objects
    DropExcessRequests: false,
}

client, err := bifrost.Init(ctx, bifrostConfig)

Sizing Guidelines

Concurrency & Buffer Size (Per Provider)

Configure these settings per provider based on the expected RPS for that specific provider:

Provider RPS	Concurrency	Buffer Size
100	100	150
500	500	750
1000	1000	1500
2500	2500	3750
5000	5000	7500
10000	10000	15000

Example: If you expect 2000 RPS to OpenAI and 500 RPS to Anthropic, configure OpenAI with concurrency: 2000, buffer_size: 3000 and Anthropic with concurrency: 500, buffer_size: 750.

Formula:

concurrency = expected_rps
buffer_size = 1.5 × expected_rps

This ratio ensures:

Enough queue capacity to absorb traffic bursts
Workers are never starved for work
Backpressure is applied before memory exhaustion

Initial Pool Size (Global)

Configure this setting based on total RPS across all providers combined:

Total RPS (All Providers)	Initial Pool Size	Memory Estimate
100	150	~50 MB
500	750	~100 MB
1000	1500	~200 MB
2500	3750	~400 MB
5000	7500	~800 MB
10000	15000	~1.5 GB

Memory estimates are approximate and vary based on request/response sizes, number of providers, and plugins. Monitor actual memory usage in your environment.

Formula:

initial_pool_size = 1.5 × total_expected_rps

Additionally, ensure:

initial_pool_size >= max(buffer_size across all providers)

This ensures pools are pre-warmed to handle peak queue depths without runtime allocations.

Multi-Node Deployments

When running multiple Bifrost instances behind a load balancer, divide the per-node settings by the number of nodes based on your total expected RPS.

Formula

Per-Node Concurrency = Total Concurrency / Number of Nodes
Per-Node Buffer Size = Total Buffer Size / Number of Nodes
Per-Node Initial Pool Size = Total Initial Pool Size / Number of Nodes

Example: 10,000 RPS Across 4 Nodes

Total capacity (aggregate across all 4 nodes):

Total RPS: 10,000 RPS
Per-node RPS: ~2,500 RPS per node

Single node settings for 10,000 RPS (if running on one node):

Concurrency: 10000
Buffer Size: 15000
Initial Pool Size: 15000

Per-node settings (4 nodes, 10,000 RPS total):

Parameter	Total (Aggregate)	Per Node (4 nodes)
Concurrency	10000	2500
Buffer Size	15000	3750
Initial Pool Size	15000	3750

Gateway (config.json)
Go SDK

{
    "config": {
        "initial_pool_size": 3750,
        "drop_excess_requests": false
    },
    "providers": {
        "openai": {
            "keys": [...],
            "concurrency_and_buffer_size": {
                "concurrency": 2500,
                "buffer_size": 3750
            }
        },
        "anthropic": {
            "keys": [...],
            "concurrency_and_buffer_size": {
                "concurrency": 2500,
                "buffer_size": 3750
            }
        }
    }
}

const numNodes = 4

func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) {
    // Total capacity divided by number of nodes
    // Total: 10,000 RPS across 4 nodes = 2,500 RPS per node
    return &schemas.ProviderConfig{
        NetworkConfig: schemas.DefaultNetworkConfig,
        ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
            Concurrency: 10000 / numNodes, // 2500 per node
            BufferSize:  15000 / numNodes, // 3750 per node
        },
    }, nil
}

// In main initialization
bifrostConfig := schemas.BifrostConfig{
    Account:         myAccount,
    InitialPoolSize: 15000 / numNodes, // 3750 per node
}

Kubernetes Horizontal Pod Autoscaling: When using HPA, configure settings for your minimum replica count. As pods scale up, each node handles a smaller portion of traffic. Consider using environment variables or ConfigMaps to dynamically adjust settings based on replica count.

Provider-Specific Tuning

Different providers have different rate limits and latency characteristics. Tune each provider independently:

Provider Rate Limit Considerations

Provider	Typical Rate Limits	Recommended Concurrency	Notes
OpenAI	500-10000 RPM (varies by tier)	100-500	Higher tiers support more concurrency
Anthropic	1000-4000 RPM (varies by tier)	50-200	More conservative rate limits
Bedrock	Per-model limits	100-300	Check AWS quotas for your account
Azure OpenAI	Deployment-specific	100-500	Configure per-deployment
Vertex AI	Per-model quotas	100-300	Check GCP quotas
Groq	Very high throughput	500-1000	Designed for high concurrency
Ollama	Local resource bound	10-50	Limited by local GPU/CPU

Example: Mixed Provider Configuration

Gateway (config.json)
Go SDK

{
    "providers": {
        "openai": {
            "keys": [...],
            "concurrency_and_buffer_size": {
                "concurrency": 200,
                "buffer_size": 1000
            }
        },
        "anthropic": {
            "keys": [...],
            "concurrency_and_buffer_size": {
                "concurrency": 100,
                "buffer_size": 500
            }
        },
        "groq": {
            "keys": [...],
            "concurrency_and_buffer_size": {
                "concurrency": 500,
                "buffer_size": 2500
            }
        },
        "ollama": {
            "keys": [...],
            "concurrency_and_buffer_size": {
                "concurrency": 20,
                "buffer_size": 100
            }
        }
    }
}

func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) {
    switch provider {
    case schemas.OpenAI:
        return &schemas.ProviderConfig{
            NetworkConfig: schemas.DefaultNetworkConfig,
            ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
                Concurrency: 200,
                BufferSize:  1000,
            },
        }, nil
    case schemas.Anthropic:
        return &schemas.ProviderConfig{
            NetworkConfig: schemas.DefaultNetworkConfig,
            ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
                Concurrency: 100,
                BufferSize:  500,
            },
        }, nil
    case schemas.Groq:
        return &schemas.ProviderConfig{
            NetworkConfig: schemas.DefaultNetworkConfig,
            ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
                Concurrency: 500,
                BufferSize:  2500,
            },
        }, nil
    case schemas.Ollama:
        return &schemas.ProviderConfig{
            NetworkConfig: schemas.DefaultNetworkConfig,
            ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
                Concurrency: 20,
                BufferSize:  100,
            },
        }, nil
    default:
        return &schemas.ProviderConfig{
            NetworkConfig:            schemas.DefaultNetworkConfig,
            ConcurrencyAndBufferSize: schemas.DefaultConcurrencyAndBufferSize,
        }, nil
    }
}

Queue Overflow Handling

When the provider queue reaches capacity, Bifrost’s behavior is controlled by drop_excess_requests:

Blocking Mode (Default)

{
    "config": {
        "drop_excess_requests": false
    }
}

New requests wait until queue space is available
Ensures no requests are lost
May increase latency during high load
Suitable for critical workloads where every request matters

Drop Mode

{
    "config": {
        "drop_excess_requests": true
    }
}

New requests are immediately rejected when queue is full
Returns error: "request dropped: queue is full"
Maintains consistent latency for accepted requests
Suitable for real-time applications where stale requests are useless

Best Practice: Use drop_excess_requests: true with buffer sizes at 1.5x concurrency for production workloads. This prevents memory exhaustion while still handling reasonable traffic bursts.

Monitoring and Diagnostics

Key Metrics to Monitor

Metric	Healthy Range	Action if Exceeded
Queue depth	< 50% of buffer_size	Increase buffer or concurrency
Request latency (p99)	< 2x average	Check provider rate limits
Dropped requests	0	Increase buffer_size
Memory usage	Stable	Reduce pool/buffer sizes
Goroutine count	Stable	Check for goroutine leaks

Health Check Endpoint

The Gateway exposes health and metrics endpoints:

# Health check
curl http://localhost:8080/health

# Prometheus metrics
curl http://localhost:8080/metrics

Best Practices Summary

Start Conservative

Begin with lower values and scale up based on observed performance. Over-provisioning wastes resources.

Monitor Continuously

Track queue depths, latencies, and error rates. Adjust settings based on real traffic patterns.

Match Provider Limits

Don’t set concurrency higher than provider rate limits allow. You’ll just get rate-limited.

Plan for Bursts

Set buffer_size to 1.5x concurrency to handle traffic spikes without dropping requests.

Quick Reference

// Formula
concurrency      = expected_rps
buffer_size      = 1.5 × expected_rps
initial_pool_size = 1.5 × total_rps (across all providers)

// Example: 500 RPS per provider, 2 providers (1000 total RPS)
concurrency: 500, buffer_size: 750, initial_pool_size: 1500

// Example: 2000 RPS per provider, 3 providers (6000 total RPS)
concurrency: 2000, buffer_size: 3000, initial_pool_size: 9000

// Multi-node formula
per_node_value = total_value / number_of_nodes

Provider Configuration - Complete provider setup guide
Custom Providers - Creating custom provider integrations
Deployment - Production deployment guides

Overview

Quick Start

Providers & Guides

SDK Integrations

MCP Gateway

Custom plugins

Open Source Features

Enterprise Features

Performance Tuning

Overview

Understanding the Parameters

Concurrency (Per Provider)

Buffer Size (Per Provider)

Initial Pool Size (Global)

Sizing Guidelines

Concurrency & Buffer Size (Per Provider)

Initial Pool Size (Global)

Multi-Node Deployments

Formula

Example: 10,000 RPS Across 4 Nodes

Provider-Specific Tuning

Provider Rate Limit Considerations

Example: Mixed Provider Configuration

Queue Overflow Handling

Blocking Mode (Default)

Drop Mode

Monitoring and Diagnostics

Key Metrics to Monitor

Health Check Endpoint

Best Practices Summary

Start Conservative

Monitor Continuously

Match Provider Limits

Plan for Bursts

Quick Reference

Overview

Quick Start

Providers & Guides

SDK Integrations

MCP Gateway

Custom plugins

Open Source Features

Enterprise Features

​Overview

​Understanding the Parameters

​Concurrency (Per Provider)

​Buffer Size (Per Provider)

​Initial Pool Size (Global)

​Sizing Guidelines

​Concurrency & Buffer Size (Per Provider)

​Initial Pool Size (Global)

​Multi-Node Deployments

​Formula

​Example: 10,000 RPS Across 4 Nodes

​Provider-Specific Tuning

​Provider Rate Limit Considerations

​Example: Mixed Provider Configuration

​Queue Overflow Handling

​Blocking Mode (Default)

​Drop Mode

​Monitoring and Diagnostics

​Key Metrics to Monitor

​Health Check Endpoint

​Best Practices Summary

Start Conservative

Monitor Continuously

Match Provider Limits

Plan for Bursts

​Quick Reference

​Related Documentation

Overview

Understanding the Parameters

Concurrency (Per Provider)

Buffer Size (Per Provider)

Initial Pool Size (Global)

Sizing Guidelines

Concurrency & Buffer Size (Per Provider)

Initial Pool Size (Global)

Multi-Node Deployments

Formula

Example: 10,000 RPS Across 4 Nodes

Provider-Specific Tuning

Provider Rate Limit Considerations

Example: Mixed Provider Configuration

Queue Overflow Handling

Blocking Mode (Default)

Drop Mode

Monitoring and Diagnostics

Key Metrics to Monitor

Health Check Endpoint

Best Practices Summary

Quick Reference

Related Documentation