> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getbifrost.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Performance Tuning

> Optimize Bifrost for high throughput with concurrency, buffer sizing, and memory pool configuration

## Overview

Bifrost provides three key performance configuration parameters that control throughput, memory usage, and request handling behavior:

| Parameter             | Scope        | Default | Description                                                    |
| --------------------- | ------------ | ------- | -------------------------------------------------------------- |
| **Concurrency**       | Per Provider | 1000    | Number of worker goroutines processing requests simultaneously |
| **Buffer Size**       | Per Provider | 5000    | Maximum requests that can be queued before blocking/dropping   |
| **Initial Pool Size** | Global       | 5000    | Pre-allocated objects in sync pools to reduce GC pressure      |

<Info>
  These defaults are suitable for most production deployments handling up to \~5000 RPS. For higher throughput or constrained environments, tuning these parameters can significantly improve performance.
</Info>

***

## Understanding the Parameters

### Concurrency (Per Provider)

**What it does:** Controls two aspects of provider performance:

1. **Worker Goroutines:** The number of goroutines that process requests for each provider. Each worker pulls requests from the provider's queue and executes them against the provider's API.
2. **Provider Pool Pre-warming:** Pre-allocates provider-specific response objects (e.g., `AnthropicMessageResponse`, `OpenAIResponse`) in sync pools to reduce allocations during request handling.

**Impact:**

* **Higher concurrency** = More parallel requests to the provider, higher throughput, more pre-allocated response objects
* **Lower concurrency** = Fewer parallel requests, lower resource usage, respects provider rate limits

**Default:** `1000` workers per provider

<Tabs>
  <Tab title="Gateway (config.json)">
    ```json theme={null}
    {
        "providers": {
            "openai": {
                "keys": [...],
                "concurrency_and_buffer_size": {
                    "concurrency": 100,
                    "buffer_size": 500
                }
            }
        }
    }
    ```
  </Tab>

  <Tab title="Go SDK">
    ```go theme={null}
    func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) {
        return &schemas.ProviderConfig{
            NetworkConfig: schemas.DefaultNetworkConfig,
            ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
                Concurrency: 100, // 100 concurrent workers
                BufferSize:  500, // 500 request queue capacity
            },
        }, nil
    }
    ```
  </Tab>
</Tabs>

### Buffer Size (Per Provider)

**What it does:** Sets the capacity of the buffered channel (queue) for each provider. Incoming requests are queued here before being picked up by workers.

**Impact:**

* **Larger buffer** = More requests can be queued during traffic spikes, handles burst traffic better
* **Smaller buffer** = Lower memory footprint, faster backpressure signals to clients

**Default:** `5000` requests per provider queue

**Queue Full Behavior:** Controlled by `drop_excess_requests`:

* `false` (default): New requests block until queue space is available
* `true`: New requests are immediately dropped with an error when queue is full

<Warning>
  **Constraint:** Buffer size must be greater than or equal to concurrency. If `concurrency > buffer_size`, provider setup will fail.
</Warning>

### Initial Pool Size (Global)

**What it does:** Controls the number of pre-allocated objects in Bifrost's internal sync pools at startup. These pools recycle objects to reduce garbage collection overhead.

**Pooled Objects:**

* Channel messages (request wrappers)
* Response channels
* Error channels
* Stream channels
* Plugin pipelines
* Request objects

**Impact:**

* **Higher initial pool** = Less GC pressure during high traffic, more consistent latency, higher initial memory usage
* **Lower initial pool** = Lower initial memory footprint, may cause more allocations under load

**Default:** `5000` objects per pool

<Tabs>
  <Tab title="Gateway (config.json)">
    ```json theme={null}
    {
        "config": {
            "initial_pool_size": 10000,
            "drop_excess_requests": false
        }
    }
    ```
  </Tab>

  <Tab title="Go SDK">
    ```go theme={null}
    bifrostConfig := schemas.BifrostConfig{
        Account:            myAccount,
        InitialPoolSize:    10000, // Pre-warm pools with 10,000 objects
        DropExcessRequests: false,
    }

    client, err := bifrost.Init(ctx, bifrostConfig)
    ```
  </Tab>
</Tabs>

***

## Sizing Guidelines

### Concurrency & Buffer Size (Per Provider)

Configure these settings **per provider** based on the expected RPS for that specific provider:

| Provider RPS | Concurrency | Buffer Size |
| ------------ | ----------- | ----------- |
| 100          | 100         | 150         |
| 500          | 500         | 750         |
| 1000         | 1000        | 1500        |
| 2500         | 2500        | 3750        |
| 5000         | 5000        | 7500        |
| 10000        | 10000       | 15000       |

<Info>
  **Example:** If you expect 2000 RPS to OpenAI and 500 RPS to Anthropic, configure OpenAI with `concurrency: 2000, buffer_size: 3000` and Anthropic with `concurrency: 500, buffer_size: 750`.
</Info>

**Formula:**

```
concurrency = expected_rps
buffer_size = 1.5 × expected_rps
```

This ratio ensures:

* Enough queue capacity to absorb traffic bursts
* Workers are never starved for work
* Backpressure is applied before memory exhaustion

### Initial Pool Size (Global)

Configure this setting based on **total RPS across all providers combined**:

| Total RPS (All Providers) | Initial Pool Size | Memory Estimate |
| ------------------------- | ----------------- | --------------- |
| 100                       | 150               | \~50 MB         |
| 500                       | 750               | \~100 MB        |
| 1000                      | 1500              | \~200 MB        |
| 2500                      | 3750              | \~400 MB        |
| 5000                      | 7500              | \~800 MB        |
| 10000                     | 15000             | \~1.5 GB        |

<Note>
  Memory estimates are approximate and vary based on request/response sizes, number of providers, and plugins. Monitor actual memory usage in your environment.
</Note>

**Formula:**

```
initial_pool_size = 1.5 × total_expected_rps
```

Additionally, ensure:

```
initial_pool_size >= max(buffer_size across all providers)
```

This ensures pools are pre-warmed to handle peak queue depths without runtime allocations.

***

## Multi-Node Deployments

When running multiple Bifrost instances behind a load balancer, **divide the per-node settings by the number of nodes** based on your total expected RPS.

### Formula

```
Per-Node Concurrency = Total Concurrency / Number of Nodes
Per-Node Buffer Size = Total Buffer Size / Number of Nodes
Per-Node Initial Pool Size = Total Initial Pool Size / Number of Nodes
```

### Example: 10,000 RPS Across 4 Nodes

**Total capacity (aggregate across all 4 nodes):**

* Total RPS: 10,000 RPS
* Per-node RPS: \~2,500 RPS per node

**Single node settings for 10,000 RPS (if running on one node):**

* Concurrency: 10000
* Buffer Size: 15000
* Initial Pool Size: 15000

**Per-node settings (4 nodes, 10,000 RPS total):**

| Parameter         | Total (Aggregate) | Per Node (4 nodes) |
| ----------------- | ----------------- | ------------------ |
| Concurrency       | 10000             | 2500               |
| Buffer Size       | 15000             | 3750               |
| Initial Pool Size | 15000             | 3750               |

<Tabs>
  <Tab title="Gateway (config.json)">
    ```json theme={null}
    {
        "config": {
            "initial_pool_size": 3750,
            "drop_excess_requests": false
        },
        "providers": {
            "openai": {
                "keys": [...],
                "concurrency_and_buffer_size": {
                    "concurrency": 2500,
                    "buffer_size": 3750
                }
            },
            "anthropic": {
                "keys": [...],
                "concurrency_and_buffer_size": {
                    "concurrency": 2500,
                    "buffer_size": 3750
                }
            }
        }
    }
    ```
  </Tab>

  <Tab title="Go SDK">
    ```go theme={null}
    const numNodes = 4

    func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) {
        // Total capacity divided by number of nodes
        // Total: 10,000 RPS across 4 nodes = 2,500 RPS per node
        return &schemas.ProviderConfig{
            NetworkConfig: schemas.DefaultNetworkConfig,
            ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
                Concurrency: 10000 / numNodes, // 2500 per node
                BufferSize:  15000 / numNodes, // 3750 per node
            },
        }, nil
    }

    // In main initialization
    bifrostConfig := schemas.BifrostConfig{
        Account:         myAccount,
        InitialPoolSize: 15000 / numNodes, // 3750 per node
    }
    ```
  </Tab>
</Tabs>

<Tip>
  **Kubernetes Horizontal Pod Autoscaling:** When using HPA, configure settings for your minimum replica count. As pods scale up, each node handles a smaller portion of traffic. Consider using environment variables or ConfigMaps to dynamically adjust settings based on replica count.
</Tip>

***

## Provider-Specific Tuning

Different providers have different rate limits and latency characteristics. Tune each provider independently:

### Provider Rate Limit Considerations

| Provider     | Typical Rate Limits            | Recommended Concurrency | Notes                                 |
| ------------ | ------------------------------ | ----------------------- | ------------------------------------- |
| OpenAI       | 500-10000 RPM (varies by tier) | 100-500                 | Higher tiers support more concurrency |
| Anthropic    | 1000-4000 RPM (varies by tier) | 50-200                  | More conservative rate limits         |
| Bedrock      | Per-model limits               | 100-300                 | Check AWS quotas for your account     |
| Azure OpenAI | Deployment-specific            | 100-500                 | Configure per-deployment              |
| Vertex AI    | Per-model quotas               | 100-300                 | Check GCP quotas                      |
| Groq         | Very high throughput           | 500-1000                | Designed for high concurrency         |
| Ollama       | Local resource bound           | 10-50                   | Limited by local GPU/CPU              |

### Example: Mixed Provider Configuration

<Tabs>
  <Tab title="Gateway (config.json)">
    ```json theme={null}
    {
        "providers": {
            "openai": {
                "keys": [...],
                "concurrency_and_buffer_size": {
                    "concurrency": 200,
                    "buffer_size": 1000
                }
            },
            "anthropic": {
                "keys": [...],
                "concurrency_and_buffer_size": {
                    "concurrency": 100,
                    "buffer_size": 500
                }
            },
            "groq": {
                "keys": [...],
                "concurrency_and_buffer_size": {
                    "concurrency": 500,
                    "buffer_size": 2500
                }
            },
            "ollama": {
                "keys": [...],
                "concurrency_and_buffer_size": {
                    "concurrency": 20,
                    "buffer_size": 100
                }
            }
        }
    }
    ```
  </Tab>

  <Tab title="Go SDK">
    ```go theme={null}
    func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) {
        switch provider {
        case schemas.OpenAI:
            return &schemas.ProviderConfig{
                NetworkConfig: schemas.DefaultNetworkConfig,
                ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
                    Concurrency: 200,
                    BufferSize:  1000,
                },
            }, nil
        case schemas.Anthropic:
            return &schemas.ProviderConfig{
                NetworkConfig: schemas.DefaultNetworkConfig,
                ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
                    Concurrency: 100,
                    BufferSize:  500,
                },
            }, nil
        case schemas.Groq:
            return &schemas.ProviderConfig{
                NetworkConfig: schemas.DefaultNetworkConfig,
                ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
                    Concurrency: 500,
                    BufferSize:  2500,
                },
            }, nil
        case schemas.Ollama:
            return &schemas.ProviderConfig{
                NetworkConfig: schemas.DefaultNetworkConfig,
                ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
                    Concurrency: 20,
                    BufferSize:  100,
                },
            }, nil
        default:
            return &schemas.ProviderConfig{
                NetworkConfig:            schemas.DefaultNetworkConfig,
                ConcurrencyAndBufferSize: schemas.DefaultConcurrencyAndBufferSize,
            }, nil
        }
    }
    ```
  </Tab>
</Tabs>

***

## Queue Overflow Handling

When the provider queue reaches capacity, Bifrost's behavior is controlled by `drop_excess_requests`:

### Blocking Mode (Default)

```json theme={null}
{
    "config": {
        "drop_excess_requests": false
    }
}
```

* New requests **wait** until queue space is available
* Ensures no requests are lost
* May increase latency during high load
* Suitable for critical workloads where every request matters

### Drop Mode

```json theme={null}
{
    "config": {
        "drop_excess_requests": true
    }
}
```

* New requests are **immediately rejected** when queue is full
* Returns error: `"request dropped: queue is full"`
* Maintains consistent latency for accepted requests
* Suitable for real-time applications where stale requests are useless

<Tip>
  **Best Practice:** Use `drop_excess_requests: true` with buffer sizes at 1.5x concurrency for production workloads. This prevents memory exhaustion while still handling reasonable traffic bursts.
</Tip>

***

## Monitoring and Diagnostics

### Key Metrics to Monitor

| Metric                | Healthy Range          | Action if Exceeded             |
| --------------------- | ---------------------- | ------------------------------ |
| Queue depth           | \< 50% of buffer\_size | Increase buffer or concurrency |
| Request latency (p99) | \< 2x average          | Check provider rate limits     |
| Dropped requests      | 0                      | Increase buffer\_size          |
| Memory usage          | Stable                 | Reduce pool/buffer sizes       |
| Goroutine count       | Stable                 | Check for goroutine leaks      |

### Health Check Endpoint

The Gateway exposes health and metrics endpoints:

```bash theme={null}
# Health check
curl http://localhost:8080/health

# Prometheus metrics
curl http://localhost:8080/metrics
```

***

## Best Practices Summary

<CardGroup cols={2}>
  <Card title="Start Conservative" icon="shield">
    Begin with lower values and scale up based on observed performance. Over-provisioning wastes resources.
  </Card>

  <Card title="Monitor Continuously" icon="chart-line">
    Track queue depths, latencies, and error rates. Adjust settings based on real traffic patterns.
  </Card>

  <Card title="Match Provider Limits" icon="scale-balanced">
    Don't set concurrency higher than provider rate limits allow. You'll just get rate-limited.
  </Card>

  <Card title="Plan for Bursts" icon="bolt">
    Set buffer\_size to 1.5x concurrency to handle traffic spikes without dropping requests.
  </Card>
</CardGroup>

### Quick Reference

```
// Formula
concurrency      = expected_rps
buffer_size      = 1.5 × expected_rps
initial_pool_size = 1.5 × total_rps (across all providers)

// Example: 500 RPS per provider, 2 providers (1000 total RPS)
concurrency: 500, buffer_size: 750, initial_pool_size: 1500

// Example: 2000 RPS per provider, 3 providers (6000 total RPS)
concurrency: 2000, buffer_size: 3000, initial_pool_size: 9000

// Multi-node formula
per_node_value = total_value / number_of_nodes
```

***

## Related Documentation

* **[Provider Configuration](../quickstart/gateway/provider-configuration)** - Complete provider setup guide
* **[Custom Providers](./custom-providers)** - Creating custom provider integrations
* **[Deployment](../deployment-guides/)** - Production deployment guides
