Overview

Replicate is architecturally different from other providers in Bifrost. It uses a prediction-based API where every request creates a “prediction” that runs asynchronously. Each model on Replicate defines its own input schema, making it highly flexible but requiring model-specific parameter knowledge.

Key Architectural Differences

Prediction-Based System: All operations create predictions via /v1/predictions or deployment endpoints
Model-Specific Inputs: Each model has its own parameter schema (use extra_params for model-specific fields)
Async/Sync Modes: Predictions can run synchronously (with Prefer: wait header) or asynchronously (with polling)
Flexible Output: Output can be strings, arrays, URLs, or data URIs depending on the model

Supported Operations

Operation	Non-Streaming	Streaming	Endpoint
Chat Completions	✅	✅	`/v1/predictions`
Responses API	✅	✅	`/v1/predictions`
Text Completions	✅	✅	`/v1/predictions`
Image Generation	✅	✅	`/v1/predictions`
Files	✅	-	`/v1/files`
List Models	✅	-	`/v1/deployments`
Embeddings	❌	❌	-
Speech (TTS)	❌	❌	-
Transcriptions (STT)	❌	❌	-
Batch	❌	❌	-

List Models returns account-specific deployments only, not all public models on Replicate.

Model Identification

Replicate models can be specified in three ways:

1. Version ID

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "replicate/5c7d5dc6dd8bf75c1acaa8565735e7986bc5b66206b55cca93cb72c9bf15ccaa",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

2. Model Name

Format: owner/model-name

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "replicate/meta/llama-2-7b-chat",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

3. Deployment

Configure deployed models in the Replicate key configuration. Deployments map custom model identifiers to actual deployment paths. Configuration Example:

{
  "provider": "replicate",
  "value": "your-api-key",
  "replicate_key_config": {
    "deployments": {
      "my-model": "owner/my-deployment-name"
    }
  }
}

Usage:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "replicate/my-model",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Prediction Modes

Sync Mode

Bifrost uses sync mode with the Prefer: wait header if it is present in the request headers. The request blocks until the prediction completes or times out (default 60 seconds). How it works:

Creates prediction with Prefer: wait=60 header
Replicate holds connection open for up to 60 seconds
If prediction completes within timeout, returns result immediately
If timeout expires, falls back to polling mode

Async Mode (Polling)

It is the default mode of Replicate predictions. Bifrost automatically polls the prediction URL every 2 seconds until completion. Status Flow: starting → processing → succeeded/failed/canceled

1. Chat Completions

Message Conversion

System Messages: Extracted from messages array and concatenated into system_prompt field. User/Assistant Messages: Preserved as conversation context. Text content from content blocks is concatenated with newlines. Image Content: Non-base64 image URLs from message content blocks are extracted and passed as image_input array.

// Input
{
  "messages": [
    {"role": "system", "content": "You are helpful"},
    {"role": "user", "content": "Hello"}
  ]
}

// Converted to Replicate format
{
  "input": {
    "system_prompt": "You are helpful",
    "prompt": "Hello",
    "messages": [...] // Original messages array also included
  }
}

System Prompt Filtering

Important: Not all Replicate models support the system_prompt field. For unsupported models, the system prompt is automatically prepended to the conversation prompt. Models without system_prompt support:

meta/meta-llama-3-8b
meta/llama-2-70b
openai/gpt-oss-20b
openai/o1-mini
xai/grok-4
All deepseek-ai/deepseek* models (e.g., deepseek-r1, deepseek-v3)

Model-Specific Parameters

Use extra_params to pass model-specific parameters. These are flattened into the input object:

Gateway
Go SDK

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "replicate/meta/llama-2-7b-chat",
    "messages": [{"role": "user", "content": "Hello"}],
    "temperature": 0.7,
    "top_k": 50,
    "repetition_penalty": 1.1,
    "min_new_tokens": 10
  }'

resp, err := client.ChatCompletionRequest(ctx, &schemas.BifrostChatRequest{
    Provider: schemas.Replicate,
    Model:    "meta/llama-2-7b-chat",
    Input:    messages,
    Params: &schemas.ChatParameters{
        Temperature: schemas.Ptr(0.7),
        ExtraParams: map[string]interface{}{
            "top_k": 50,
            "repetition_penalty": 1.1,
            "min_new_tokens": 10,
        },
    },
})

Model Schema Discovery: Each Replicate model has unique parameters. Check the model’s documentation on replicate.com or use the OpenAPI schema from the model version to discover available parameters.

Response Conversion

Field Mapping

Output:
- String → choices[0].message.content
- Array of strings → joined and mapped to choices[0].message.content
- Object with text field → text value mapped to choices[0].message.content
Status: succeeded → finish_reason: "stop", failed → finish_reason: "error"
Metrics: input_token_count → prompt_tokens, output_token_count → completion_tokens

Example Response

{
  "id": "abc123",
  "model": "meta/llama-2-7b-chat",
  "object": "chat.completion",
  "created": 1234567890,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 8,
    "total_tokens": 18
  }
}

Streaming

Replicate streaming uses Server-Sent Events (SSE) with the following event types:

Event Type	Description	Data Format
`output`	Content chunk	Plain text string
`done`	Completion	JSON: `{"reason": ""}` (empty = success)
`error`	Error occurred	JSON: `{"detail": "error message"}`

Streaming Flow:

Bifrost sets stream: true in prediction input
Replicate returns urls.stream in initial response
Bifrost connects to stream URL and processes SSE events
output events → content deltas
done event → final chunk with finish_reason

Done Event Reasons:

Empty or no reason = success (finish_reason: "stop")
"canceled" = prediction was canceled
"error" = prediction failed

2. Responses API

The Responses API is converted internally to Chat Completions or native Replicate format depending on the model:

// Responses request → Replicate prediction conversion
ResponsesRequest → ReplicatePredictionRequest → ReplicatePredictionResponse → BifrostResponsesResponse

Conversion Logic:

For OpenAI models with gpt-5-structured: Uses native Responses format with input_item_list, tools, and json_schema support
For all other models: Converted to Chat Completions format using message conversion logic

Same parameter mapping and system prompt handling as Chat Completions.

Response Format

Responses follow standard Responses API format with status mapping:

Replicate Status	Responses Status
`succeeded`	`completed`
`failed`	`failed`
`canceled`	`cancelled`
`processing`	`in_progress`
`starting`	`queued`

3. Text Completions (Legacy)

Conversion

Prompt array: Joined with newlines into single prompt field
top_k: Pass via extra_params (model-specific)

Example

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "replicate/meta/llama-2-7b",
    "prompt": "Once upon a time",
    "max_tokens": 100,
    "temperature": 0.8,
    "top_k": 40
  }'

Response

Same conversion as chat completions: output string/array → choices[0].text, with usage metrics from prediction metrics.

4. Image Generation

Parameter Mapping

{
  "prompt": "prompt",
  "n": "number_of_images",
  "aspect_ratio": "aspect_ratio",
  "resolution": "resolution",
  "output_format": "output_format",
  "quality": "quality",
  "background": "background",
  "seed": "seed",
  "negative_prompt": "negative_prompt",
  "num_inference_steps": "num_inference_steps",
  "input_images": "input_images"
}

Input Image Field Mapping

Important: Different Replicate models expect input images in different fields. Bifrost automatically maps input_images to the correct field based on the model. Field Mapping by Model:

Field	Models
`image_prompt`	`black-forest-labs/flux-1.1-pro` `black-forest-labs/flux-1.1-pro-ultra` `black-forest-labs/flux-pro` `black-forest-labs/flux-1.1-pro-ultra-finetuned`
`input_image`	`black-forest-labs/flux-kontext-pro` `black-forest-labs/flux-kontext-max` `black-forest-labs/flux-kontext-dev`
`image`	`black-forest-labs/flux-dev` `black-forest-labs/flux-fill-pro` `black-forest-labs/flux-dev-lora` `black-forest-labs/flux-krea-dev`
`input_images`	All other models (default)

For models that expect a single image field (image_prompt, input_image, image), only the first image from the input_images array is used.

Example

Gateway
Go SDK

curl -X POST http://localhost:8080/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "replicate/black-forest-labs/flux-schnell",
    "prompt": "A serene mountain landscape at sunset",
    "aspect_ratio": "16:9",
    "output_format": "webp",
    "num_inference_steps": 4,
    "seed": 42
  }'

resp, err := client.ImageGenerationRequest(ctx, &schemas.BifrostImageGenerationRequest{
    Provider: schemas.Replicate,
    Model:    "black-forest-labs/flux-schnell",
    Input: &schemas.ImageGenerationInput{
        Prompt: "A serene mountain landscape at sunset",
    },
    Params: &schemas.ImageGenerationParameters{
        AspectRatio: schemas.Ptr("16:9"),
        OutputFormat: schemas.Ptr("webp"),
        NumInferenceSteps: schemas.Ptr(4),
        Seed: schemas.Ptr(42),
    },
})

Response Conversion

Replicate output can be:

Single URL: String → data[0].url
Multiple URLs: Array → data[i].url for each image
Data URIs: Base64-encoded images in data URI format

{
  "id": "xyz789",
  "created": 1234567890,
  "model": "black-forest-labs/flux-schnell",
  "data": [
    {
      "url": "https://replicate.delivery/pbxt/...",
      "index": 0
    }
  ],
  "usage": {
    "input_tokens": 15,
    "output_tokens": 0,
    "total_tokens": 15
  }
}

Streaming

Image generation streaming provides progressive image updates as data URIs: SSE Events:

output: Data URI chunk (partial image)
done: Final completion with reason
error: Error details

Flow:

Each output event contains a complete data URI (e.g., data:image/webp;base64,...)
Progressive refinement shows generation progress
done event signals completion with final image
Each chunk includes Index, ChunkIndex, and B64JSON fields

5. Files API

Replicate’s Files API supports uploading, listing, and managing files for use in predictions.

Upload

Request: Multipart form-data

Field	Type	Required	Notes
`file`	binary	✅	File content
`filename`	string	❌	Custom filename
`content_type`	string	❌	MIME type (auto-detected from extension)

Example:

curl -X POST http://localhost:8080/v1/files \
  -H "Authorization: Bearer $API_KEY" \
  -F "[email protected]" \
  -F "filename=my-document.pdf"

Response:

{
  "id": "file_abc123",
  "object": "file",
  "bytes": 12345,
  "created_at": 1234567890,
  "filename": "my-document.pdf",
  "purpose": "batch",
  "status": "processed"
}

List Files

Query Parameters:

Parameter	Type	Notes
`limit`	int	Results per page
`after`	string	Pagination cursor

Example:

curl -X GET "http://localhost:8080/v1/files?limit=20" \
  -H "Authorization: Bearer $API_KEY"

Pagination: Uses cursor-based pagination with next URL in response. Bifrost serializes this into the after cursor.

Retrieve / Delete

Operations:

GET /v1/files/{file_id} - Retrieve file metadata
DELETE /v1/files/{file_id} - Delete file

File Content Download

Replicate requires signed download URLs with owner, expiry, and signature parameters.

Required Parameters in ExtraParams:

Parameter	Type	Description
`owner`	string	File owner username
`expiry`	int64	Unix timestamp for expiration
`signature`	string	Base64-encoded HMAC-SHA256 signature

Signature Format: HMAC-SHA256 of "{owner} {file_id} {expiry}" using Files API signing secret Example:

curl -X POST http://localhost:8080/v1/files/file_abc123/content \
  -H "Content-Type: application/json" \
  -d '{
    "owner": "my-username",
    "expiry": 1735689600,
    "signature": "base64-encoded-signature"
  }'

6. List Models

Endpoint: /v1/models

List Models returns account-specific deployments only, not all public models on Replicate.

Deployments are private or organization models with dedicated infrastructure. The response includes:

{
  "data": [
    {
      "id": "replicate/my-org/my-deployment",
      "name": "my-deployment",
      "owner": "my-org"
    }
  ],
  "has_more": false
}

Usage:

List your deployments via this endpoint
Use deployment name as model identifier: replicate/my-org/my-deployment
Predictions route to deployment-specific endpoint: /v1/deployments/my-org/my-deployment/predictions

Extra Parameters

Model-Specific Parameters

The most important feature for Replicate integration is extra_params. Parameters not in Bifrost’s standard schema are flattened directly into the prediction input object.

How It Works

// Request with extra params
{
  "model": "replicate/stability-ai/sdxl",
  "prompt": "A photo of an astronaut",
  "temperature": 0.7,          // Standard param
  "guidance_scale": 7.5,       // Model-specific (extra param)
  "num_inference_steps": 50,   // Model-specific (extra param)
  "scheduler": "DPMSolverMultistep"  // Model-specific (extra param)
}

// Converted to Replicate prediction input
{
  "version": "...",
  "input": {
    "prompt": "A photo of an astronaut",
    "temperature": 0.7,
    "guidance_scale": 7.5,       // Flattened from extra_params
    "num_inference_steps": 50,   // Flattened from extra_params
    "scheduler": "DPMSolverMultistep"  // Flattened from extra_params
  }
}

Discovering Model Parameters

Each Replicate model has unique parameters. To find available parameters:

Model Page: Visit the model on replicate.com
OpenAPI Schema: Available at /v1/models/{owner}/{name}/versions/{version_id} (includes openapi_schema)
Cog Definition: Check the model’s source code (if public)

Caveats

System Prompt Field Support

Severity: Medium Behavior: Not all models support system_prompt field. For unsupported models, system prompt is prepended to conversation prompt. Impact: Prompt structure differs between models Models Affected: meta/meta-llama-3-8b, meta/llama-2-70b, openai/gpt-oss-20b, openai/o1-mini, xai/grok-4, and all deepseek-ai/deepseek* models Code: chat.go:300-318

Input Image Field Mapping

Severity: Medium Behavior: Different models expect input images in different fields (image_prompt, input_image, image, input_images) Impact: Bifrost automatically maps to correct field based on model Models Affected: Flux family models (see Input Image Field Mapping table) Code: images.go:192-209

Image Content in Chat

Severity: Low Behavior: Only non-base64 image URLs from message content blocks are extracted to image_input Impact: Base64-encoded images in messages are ignored Code: chat.go:58-63

Model-Specific Parameters

Severity: Medium Behavior: Each model has unique input schema; standard parameters may not work for all models Impact: Requires checking model documentation for available parameters Mitigation: Use extra_params for model-specific fields

Overview

Quick Start

Providers & Guides

SDK Integrations

MCP Gateway

Custom plugins

Open Source Features

Enterprise Features

​Overview

​Key Architectural Differences

​Supported Operations

​Model Identification

​1. Version ID

​2. Model Name

​3. Deployment

​Prediction Modes

​Sync Mode

​Async Mode (Polling)

​1. Chat Completions

​Message Conversion

​System Prompt Filtering

​Model-Specific Parameters

​Response Conversion

​Field Mapping

​Example Response

​Streaming

​2. Responses API

​Response Format

​3. Text Completions (Legacy)

​Conversion

​Example

​Response

​4. Image Generation

​Parameter Mapping

​Input Image Field Mapping

​Example

​Response Conversion

​Streaming

​5. Files API

​Upload

​List Files

​Retrieve / Delete

​File Content Download

​6. List Models

​Extra Parameters

​Model-Specific Parameters

​How It Works

​Discovering Model Parameters

​Caveats

​Reference Links

Overview

Key Architectural Differences

Supported Operations

Model Identification

1. Version ID

2. Model Name

3. Deployment

Prediction Modes

Sync Mode

Async Mode (Polling)

1. Chat Completions

Message Conversion

System Prompt Filtering

Model-Specific Parameters

Response Conversion

Field Mapping

Example Response

Streaming

2. Responses API

Response Format

3. Text Completions (Legacy)

Conversion

Example

Response

4. Image Generation

Parameter Mapping

Input Image Field Mapping

Example

Response Conversion

Streaming

5. Files API

Upload

List Files

Retrieve / Delete

File Content Download

6. List Models

Extra Parameters

Model-Specific Parameters

How It Works

Discovering Model Parameters

Caveats

Reference Links