Async Inference

Overview

Async inference uses a fire-and-forget pattern for gateway requests: submit a normal inference payload to an async endpoint, get a job_id immediately, and poll later for the final result.

This is a gateway-only feature and is not available in the Go SDK and requires a Logs Store to be configured.

How It Works

Supported Endpoints

Streaming is not supported on async endpoints.

Request Type	Submit (POST)	Poll (GET)
Text completions	`/v1/async/completions`	`/v1/async/completions/{job_id}`
Chat completions	`/v1/async/chat/completions`	`/v1/async/chat/completions/{job_id}`
Responses API	`/v1/async/responses`	`/v1/async/responses/{job_id}`
Embeddings	`/v1/async/embeddings`	`/v1/async/embeddings/{job_id}`
Speech	`/v1/async/audio/speech`	`/v1/async/audio/speech/{job_id}`
Transcriptions	`/v1/async/audio/transcriptions`	`/v1/async/audio/transcriptions/{job_id}`
Image generations	`/v1/async/images/generations`	`/v1/async/images/generations/{job_id}`
Image edits	`/v1/async/images/edits`	`/v1/async/images/edits/{job_id}`
Image variations	`/v1/async/images/variations`	`/v1/async/images/variations/{job_id}`

Submitting a Request

Use the same JSON body as the synchronous endpoint, but switch to the /v1/async/ path.

curl -X POST http://localhost:8080/v1/async/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-bf-vk: sk-bf-your-virtual-key" \
  -H "x-bf-async-job-result-ttl: 3600" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [
      {
        "role": "user",
        "content": "Summarize the latest release notes in 3 bullets"
      }
    ]
  }'

Response (202 Accepted)

{
  "id": "1e89b165-d4fe-49e8-beb2-3e157f2df02f",
  "status": "pending",
  "created_at": "2026-02-19T08:10:17.831Z"
}

Polling for Results

Use GET on the matching endpoint with the returned job_id.

curl -X GET http://localhost:8080/v1/async/chat/completions/1e89b165-d4fe-49e8-beb2-3e157f2df02f \
  -H "x-bf-vk: sk-bf-your-virtual-key"

Response codes:

202 Accepted: job is still pending or processing
200 OK: job is completed or failed

Pending example (202)

{
  "id": "1e89b165-d4fe-49e8-beb2-3e157f2df02f",
  "status": "pending",
  "created_at": "2026-02-19T08:10:17.831Z"
}

Completed example (200)

{
  "id": "1e89b165-d4fe-49e8-beb2-3e157f2df02f",
  "status": "completed",
  "created_at": "2026-02-19T08:10:17.831Z",
  "completed_at": "2026-02-19T08:10:19.412Z",
  "expires_at": "2026-02-19T09:10:19.412Z",
  "status_code": 200,
  "result": {
    "id": "chatcmpl-123",
    "object": "chat.completion"
  }
}

Failed example (200)

{
  "id": "1e89b165-d4fe-49e8-beb2-3e157f2df02f",
  "status": "failed",
  "created_at": "2026-02-19T08:10:17.831Z",
  "completed_at": "2026-02-19T08:10:19.412Z",
  "expires_at": "2026-02-19T09:10:19.412Z",
  "status_code": 429,
  "error": {
    "error": {
      "message": "rate limit exceeded",
      "type": "rate_limit_error"
    }
  }
}

Job Lifecycle

Status	Meaning	Transition Trigger
`pending`	Job record is created and queued	Immediate status on submit
`processing`	Background worker has picked up the job	Worker starts execution
`completed`	Operation succeeded and result is stored	Provider call completes successfully
`failed`	Operation failed and error is stored	Provider call returns a Bifrost error

Result TTL and Expiration

Default TTL is 3600 seconds (1 hour).
TTL starts from completion time, not submission time.
Server default is configured in client.async_job_result_ttl.
Per-request override uses x-bf-async-job-result-ttl.
If the header is invalid or <= 0, Bifrost falls back to the default TTL.
Expired jobs return 404 Job not found or expired.
Expired async jobs are cleaned up every minute.

Virtual Key Authorization

If a job is created with a virtual key, the job stores that virtual key identity.
Polling must use the same virtual key value.
Missing or mismatched virtual keys fail lookup and return 404 Job not found or expired.
Jobs created without a virtual key are not virtual-key scoped, so they can be polled by any caller that passes your gateway auth/middleware checks.

Observability

Async executions are logged like synchronous requests.
The logging metadata includes isAsyncRequest: true, which appears as an Async badge in the Logs UI.
Background execution still uses Bifrost request APIs, so LLM plugin hooks (governance, logging, cost tracking, etc.) are executed for the actual inference run.

Limitations

Gateway-only feature (not available in Go SDK).
Streaming is not supported on async endpoints.
Requires Logs Store to register async routes.
Jobs stuck in processing are not auto-expired by TTL cleanup. Cleanup only deletes jobs with expires_at set (completed/failed).

Overview

Quick Start

Providers & Guides

SDK Integrations

MCP Gateway

Custom plugins

Open Source Features

Enterprise Features

Async Inference

Overview

How It Works

Supported Endpoints

Submitting a Request

Polling for Results

Job Lifecycle

Result TTL and Expiration

Virtual Key Authorization

Observability

Limitations

Overview

Quick Start

Providers & Guides

SDK Integrations

MCP Gateway

Custom plugins

Open Source Features

Enterprise Features

​Overview

​How It Works

​Supported Endpoints

​Submitting a Request

​Polling for Results

​Job Lifecycle

​Result TTL and Expiration

​Virtual Key Authorization

​Observability

​Limitations

Overview

How It Works

Supported Endpoints

Submitting a Request

Polling for Results

Job Lifecycle

Result TTL and Expiration

Virtual Key Authorization

Observability

Limitations