Hugging Face

The Hugging Face provider in Bifrost (core/providers/huggingface) implements a complex integration that supports multiple inference providers (like hf-inference, fal-ai, cerebras, sambanova, etc.) through a unified interface.

Overview

The Hugging Face provider implements custom logic for:

Multiple inference backends: Routes requests to 19+ different inference providers
Dynamic model aliasing: Transforms model IDs based on provider-specific mappings
Heterogeneous request formats: Supports JSON, raw binary, and base64-encoded payloads
Provider-specific constraints: Handles varying payload limits and format restrictions

Supported Inference Providers

The Hugging Face provider supports routing to 20+ inference backends. Below is the current list of supported providers and their capabilities (as of December 2025):

Provider	Chat	Embedding	Speech (TTS)	Transcription (ASR)
`hf-inference`	✅	✅	❌	✅
`cerebras`	✅	❌	❌	❌
`cohere`	✅	❌	❌	❌
`fal-ai`	❌	❌	✅	✅
`featherless-ai`	✅	❌	❌	❌
`fireworks`	✅	❌	❌	❌
`groq`	✅	❌	❌	❌
`hyperbolic`	✅	❌	❌	❌
`nebius`	✅	✅	❌	❌
`novita`	✅	❌	❌	❌
`nscale`	✅	❌	❌	❌
`ovhcloud-ai-endpoints`	✅	❌	❌	❌
`public-ai`	✅	❌	❌	❌
`replicate`	❌	❌	✅	✅
`sambanova`	✅	✅	❌	❌
`scaleway`	✅	✅	❌	❌
`together`	✅	❌	❌	❌
`z-ai`	✅	❌	❌	❌

Provider capabilities may change over time. For the most up-to-date information, refer to the Hugging Face Inference Providers documentation. Also checkmarks (✅) indicate capabilities supported by the inference provider itself.

All Chat-supported models automatically support Responses(v1/responses) as well via Bifrost’s internal conversion logic.

Model Aliases & Identification

Unlike standard providers where model IDs are direct strings (e.g., gpt-4), Hugging Face models in Bifrost are identified by a composite key to route requests to the correct inference backend. Format: huggingface/[inference_provider]/[model_id]

inference_provider: The backend service (e.g., hf-inference, fal-ai, cerebras).
model_id: The actual model identifier on Hugging Face Hub (e.g., meta-llama/Meta-Llama-3-8B-Instruct).

Example: huggingface/hf-inference/meta-llama/Meta-Llama-3-8B-Instruct This parsing logic is handled in utils.go and models.go, allowing Bifrost to dynamically route requests based on the model string.

Request Handling Differences

The Hugging Face provider handles various tasks (Chat, Speech, Transcription) which often require different request structures depending on the underlying inference provider.

Inference Provider Constraints

Different inference providers have specific limitations and requirements:

Payload Limit

HuggingFace API enforces a 2 MB request body limit across all request types (Chat, Embedding, Speech, Transcription). This constraint applies to:

JSON request payloads
Raw audio bytes in transcription requests
Any other request body data

Impact: Large audio files, extensive chat histories, or bulk embedding requests may need to be split or compressed before sending.

`fal-ai` Audio Format Restrictions

The fal-ai provider has strict audio format requirements:

Supported Format: Only MP3 (audio/mpeg) is accepted
Rejected Formats: WAV (audio/wav) and other formats are explicitly rejected
Encoding: Audio must be provided as a base64-encoded Data URI in the audio_url field

Validation Logic (from core/providers/huggingface/transcription.go):

mimeType := getMimeTypeForAudioType(utils.DetectAudioMimeType(request.Input.File))
if mimeType == "audio/wav" {
    return nil, fmt.Errorf("fal-ai provider does not support audio/wav format; please use a different format like mp3 or ogg")
}
encoded = fmt.Sprintf("data:%s;base64,%s", mimeType, encoded)

Speech (Text-to-Speech)

For Text-to-Speech (TTS) requests, the implementation differs from a standard pipeline request:

No Pipeline Tag: The HuggingFaceSpeechRequest struct does not include a pipeline_tag field in the JSON body, even though the model might be tagged as text-to-speech on the Hub.

Structure:

type HuggingFaceSpeechRequest struct {
    Text       string                       `json:"text"`
    Provider   string                       `json:"provider" validate:"required"`
    Model      string                       `json:"model" validate:"required"`
    Parameters *HuggingFaceSpeechParameters `json:"parameters,omitempty"`
}

Implementation: See core/providers/huggingface/speech.go.

Transcription (Automatic Speech Recognition)

The Transcription implementation (core/providers/huggingface/transcription.go) exhibits a “pattern-breaking” behavior where the request format changes significantly based on the inference provider.

1. `hf-inference` (Raw Bytes)

When using the standard hf-inference provider, the API expects the raw audio bytes directly in the request body, not a JSON object.

Content-Type: Audio mime type (e.g., audio/mpeg).
Body: Raw binary data from request.Input.File.
Payload Limit: Maximum 2 MB for the raw audio bytes.

Logic:

// core/providers/huggingface/huggingface.go
if inferenceProvider == hfInference {
    jsonData = request.Input.File // Raw bytes (max 2 MB)
    isHFInferenceAudioRequest = true
}

URL Pattern: /hf-inference/models/{model_name} (no /pipeline/ suffix for ASR).

2. `fal-ai` (JSON with Base64 Data URI)

When using fal-ai through HuggingFace provider, the API expects a JSON body containing the audio as a base64-encoded Data URI.

Content-Type: application/json.
Body: JSON object with audio_url field.
Audio Format Restriction: Only MP3 (audio/mpeg) is supported. WAV files are rejected.
Encoding: Audio is base64-encoded and prefixed with a Data URI scheme.

Logic:

// core/providers/huggingface/transcription.go
encoded = base64.StdEncoding.EncodeToString(request.Input.File)
mimeType := getMimeTypeForAudioType(utils.DetectAudioMimeType(request.Input.File))
if mimeType == "audio/wav" {
    return nil, fmt.Errorf("fal-ai provider does not support audio/wav format; please use a different format like mp3 or ogg")
}
encoded = fmt.Sprintf("data:%s;base64,%s", mimeType, encoded)
hfRequest = &HuggingFaceTranscriptionRequest{
    AudioURL: encoded,
}

Dual Fields in `types.go`

To support these divergent requirements, the HuggingFaceTranscriptionRequest struct in types.go contains fields for both scenarios, which are used mutually exclusively:

type HuggingFaceTranscriptionRequest struct {
    Inputs     []byte  `json:"inputs,omitempty"`    // For standard JSON providers (NOT hf-inference raw body)
    AudioURL   string  `json:"audio_url,omitempty"` // For fal-ai (base64 Data URI, MP3 only)
    Provider   *string `json:"provider,omitempty"`
    Model      *string `json:"model,omitempty"`
    Parameters *HuggingFaceTranscriptionRequestParameters `json:"parameters,omitempty"`
}

Key Points:

Inputs: Used when JSON body is sent with raw bytes (most providers except hf-inference and fal-ai).
AudioURL: Used exclusively for fal-ai, must be a base64-encoded Data URI with MP3 format.
Note: For hf-inference, the entire request body is raw audio bytes—no JSON structure is used at all.

Raw JSON Body Handling

While most providers strictly serialize a struct to JSON, the Hugging Face provider’s Transcription method demonstrates a hybrid approach depending on the inference provider:

Embedding Requests

For embedding requests, different providers expect different field names:

Standard providers (most): Use input field
hf-inference: Uses inputs field (plural)

Request Structure:

type HuggingFaceEmbeddingRequest struct {
    Input    interface{} `json:"input,omitempty"`    // Used by all providers except hf-inference
    Inputs   interface{} `json:"inputs,omitempty"`   // Used by hf-inference
    Provider *string     `json:"provider,omitempty"` // Identifies the inference backend
    Model    *string     `json:"model,omitempty"`    
    // ... other fields
}

The converter in embedding.go populates both fields to ensure compatibility across providers.

Differences in Inference Provider Constraints

This multi-mode approach allows the provider to support diverse API contracts within a single implementation structure, accommodating:

Legacy endpoints that expect raw binary data
Modern JSON APIs with different schema expectations
Third-party providers (like fal-ai) with custom requirements
Performance optimizations (raw bytes avoid JSON overhead for hf-inference)

This flexibility allows the provider to support diverse API contracts within a single implementation structure.

Model Discovery & Caching

The provider implements sophisticated model discovery using the Hugging Face Hub API:

List Models Flow

Parallel Queries: Fetches models from multiple inference providers concurrently
Filter by Pipeline Tag: Uses pipeline_tag (e.g., text-to-speech, feature-extraction) to determine supported methods
Aggregate Results: Combines responses from all providers into a unified list
Model ID Format: Returns models as huggingface/{provider}/{model_id}

Provider Model Mapping Cache

The provider maintains a cache (modelProviderMappingCache) to map Hugging Face model IDs to provider-specific model identifiers:

// Example: "meta-llama/Meta-Llama-3-8B-Instruct" -> provider mappings
{
    "cerebras": {
        "ProviderTask": "chat-completion",
        "ProviderModelID": "llama3-8b-8192"
    },
    "groq": {
        "ProviderTask": "chat-completion", 
        "ProviderModelID": "llama3-8b-instant"
    }
}

Cache Invalidation: On HTTP 404 errors, the cache is cleared and the mapping is re-fetched, then the request is retried with the updated model ID.

Best Practices

When working with the Hugging Face provider:

Check Payload Size: Ensure request bodies are under 2 MB
Audio Format: Use MP3 for fal-ai, avoid WAV files
Model Aliases: Always specify provider in model string: huggingface/{provider}/{model}
Error Handling: Implement retries for 404 errors (cache invalidation scenarios)
Provider Selection: Use auto for automatic provider selection based on model capabilities
Pipeline Tags: Verify model’s pipeline_tag matches your use case (chat, embedding, TTS, ASR)

File Structure Reference

core/providers/huggingface/
├── huggingface.go       # Main provider implementation, HTTP request handling
├── types.go             # All provider-specific types (Request/Response DTOs)
├── utils.go             # Helpers, constants, URL builders, model mapping
├── chat.go              # Chat completion converters (Bifrost ↔ HF)
├── embedding.go         # Embedding converters
├── speech.go            # Text-to-speech converters
├── transcription.go     # Speech-to-text converters
├── models.go            # Model listing and capability detection
└── huggingface_test.go  # Comprehensive test suite

Each file follows strict separation of concerns as outlined in the Adding a Provider guide.

Quick Start

Providers

SDK Integrations

Custom plugins

Open Source Features

Enterprise Features

Overview

Supported Inference Providers

Model Aliases & Identification

Request Handling Differences

Inference Provider Constraints

Payload Limit

`fal-ai` Audio Format Restrictions

Speech (Text-to-Speech)

Transcription (Automatic Speech Recognition)

1. `hf-inference` (Raw Bytes)

2. `fal-ai` (JSON with Base64 Data URI)

Dual Fields in `types.go`

Raw JSON Body Handling

Embedding Requests

Differences in Inference Provider Constraints

Model Discovery & Caching

List Models Flow

Provider Model Mapping Cache

Best Practices

File Structure Reference

Quick Start

Providers

SDK Integrations

Custom plugins

Open Source Features

Enterprise Features

​Overview

​Supported Inference Providers

​Model Aliases & Identification

​Request Handling Differences

​Inference Provider Constraints

​Payload Limit

​fal-ai Audio Format Restrictions

​Speech (Text-to-Speech)

​Transcription (Automatic Speech Recognition)

​1. hf-inference (Raw Bytes)

​2. fal-ai (JSON with Base64 Data URI)

​Dual Fields in types.go

​Raw JSON Body Handling

​Embedding Requests

​Differences in Inference Provider Constraints

​Model Discovery & Caching

​List Models Flow

​Provider Model Mapping Cache

​Best Practices

​File Structure Reference

Overview

Supported Inference Providers

Model Aliases & Identification

Request Handling Differences

Inference Provider Constraints

Payload Limit

`fal-ai` Audio Format Restrictions

Speech (Text-to-Speech)

Transcription (Automatic Speech Recognition)

1. `hf-inference` (Raw Bytes)

2. `fal-ai` (JSON with Base64 Data URI)

Dual Fields in `types.go`

Raw JSON Body Handling

Embedding Requests

Differences in Inference Provider Constraints

Model Discovery & Caching

List Models Flow

Provider Model Mapping Cache

Best Practices

File Structure Reference