Skip to main content

Overview

Async inference uses a fire-and-forget pattern for gateway requests: submit a normal inference payload to an async endpoint, get a job_id immediately, and poll later for the final result.
This is a gateway-only feature and is not available in the Go SDK and requires a Logs Store to be configured.

How It Works

Supported Endpoints

Streaming is not supported on async endpoints.
Request TypeSubmit (POST)Poll (GET)
Text completions/v1/async/completions/v1/async/completions/{job_id}
Chat completions/v1/async/chat/completions/v1/async/chat/completions/{job_id}
Responses API/v1/async/responses/v1/async/responses/{job_id}
Embeddings/v1/async/embeddings/v1/async/embeddings/{job_id}
Speech/v1/async/audio/speech/v1/async/audio/speech/{job_id}
Transcriptions/v1/async/audio/transcriptions/v1/async/audio/transcriptions/{job_id}
Image generations/v1/async/images/generations/v1/async/images/generations/{job_id}
Image edits/v1/async/images/edits/v1/async/images/edits/{job_id}
Image variations/v1/async/images/variations/v1/async/images/variations/{job_id}

Submitting a Request

Use the same JSON body as the synchronous endpoint, but switch to the /v1/async/ path.
curl -X POST http://localhost:8080/v1/async/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-bf-vk: sk-bf-your-virtual-key" \
  -H "x-bf-async-job-result-ttl: 3600" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [
      {
        "role": "user",
        "content": "Summarize the latest release notes in 3 bullets"
      }
    ]
  }'
Response (202 Accepted)
{
  "id": "1e89b165-d4fe-49e8-beb2-3e157f2df02f",
  "status": "pending",
  "created_at": "2026-02-19T08:10:17.831Z"
}

Polling for Results

Use GET on the matching endpoint with the returned job_id.
curl -X GET http://localhost:8080/v1/async/chat/completions/1e89b165-d4fe-49e8-beb2-3e157f2df02f \
  -H "x-bf-vk: sk-bf-your-virtual-key"
Response codes:
  • 202 Accepted: job is still pending or processing
  • 200 OK: job is completed or failed
Pending example (202)
{
  "id": "1e89b165-d4fe-49e8-beb2-3e157f2df02f",
  "status": "pending",
  "created_at": "2026-02-19T08:10:17.831Z"
}
Completed example (200)
{
  "id": "1e89b165-d4fe-49e8-beb2-3e157f2df02f",
  "status": "completed",
  "created_at": "2026-02-19T08:10:17.831Z",
  "completed_at": "2026-02-19T08:10:19.412Z",
  "expires_at": "2026-02-19T09:10:19.412Z",
  "status_code": 200,
  "result": {
    "id": "chatcmpl-123",
    "object": "chat.completion"
  }
}
Failed example (200)
{
  "id": "1e89b165-d4fe-49e8-beb2-3e157f2df02f",
  "status": "failed",
  "created_at": "2026-02-19T08:10:17.831Z",
  "completed_at": "2026-02-19T08:10:19.412Z",
  "expires_at": "2026-02-19T09:10:19.412Z",
  "status_code": 429,
  "error": {
    "error": {
      "message": "rate limit exceeded",
      "type": "rate_limit_error"
    }
  }
}

Job Lifecycle

StatusMeaningTransition Trigger
pendingJob record is created and queuedImmediate status on submit
processingBackground worker has picked up the jobWorker starts execution
completedOperation succeeded and result is storedProvider call completes successfully
failedOperation failed and error is storedProvider call returns a Bifrost error

Result TTL and Expiration

  • Default TTL is 3600 seconds (1 hour).
  • TTL starts from completion time, not submission time.
  • Server default is configured in client.async_job_result_ttl.
  • Per-request override uses x-bf-async-job-result-ttl.
  • If the header is invalid or <= 0, Bifrost falls back to the default TTL.
  • Expired jobs return 404 Job not found or expired.
  • Expired async jobs are cleaned up every minute.

Virtual Key Authorization

  • If a job is created with a virtual key, the job stores that virtual key identity.
  • Polling must use the same virtual key value.
  • Missing or mismatched virtual keys fail lookup and return 404 Job not found or expired.
  • Jobs created without a virtual key are not virtual-key scoped, so they can be polled by any caller that passes your gateway auth/middleware checks.

Observability

  • Async executions are logged like synchronous requests.
  • The logging metadata includes isAsyncRequest: true, which appears as an Async badge in the Logs UI.
  • Background execution still uses Bifrost request APIs, so LLM plugin hooks (governance, logging, cost tracking, etc.) are executed for the actual inference run.

Limitations

  • Gateway-only feature (not available in Go SDK).
  • Streaming is not supported on async endpoints.
  • Requires Logs Store to register async routes.
  • Jobs stuck in processing are not auto-expired by TTL cleanup. Cleanup only deletes jobs with expires_at set (completed/failed).