core/providers/huggingface) implements a complex integration that supports multiple inference providers (like hf-inference, fal-ai, cerebras, sambanova, etc.) through a unified interface.
Overview
The Hugging Face provider implements custom logic for:- Multiple inference backends: Routes requests to 19+ different inference providers
- Dynamic model aliasing: Transforms model IDs based on provider-specific mappings
- Heterogeneous request formats: Supports JSON, raw binary, and base64-encoded payloads
- Provider-specific constraints: Handles varying payload limits and format restrictions
Supported Inference Providers
The Hugging Face provider supports routing to 20+ inference backends. Below is the current list of supported providers and their capabilities (as of December 2025):| Provider | Chat | Embedding | Speech (TTS) | Transcription (ASR) |
|---|---|---|---|---|
hf-inference | ✅ | ✅ | ❌ | ✅ |
cerebras | ✅ | ❌ | ❌ | ❌ |
cohere | ✅ | ❌ | ❌ | ❌ |
fal-ai | ❌ | ❌ | ✅ | ✅ |
featherless-ai | ✅ | ❌ | ❌ | ❌ |
fireworks | ✅ | ❌ | ❌ | ❌ |
groq | ✅ | ❌ | ❌ | ❌ |
hyperbolic | ✅ | ❌ | ❌ | ❌ |
nebius | ✅ | ✅ | ❌ | ❌ |
novita | ✅ | ❌ | ❌ | ❌ |
nscale | ✅ | ❌ | ❌ | ❌ |
ovhcloud-ai-endpoints | ✅ | ❌ | ❌ | ❌ |
public-ai | ✅ | ❌ | ❌ | ❌ |
replicate | ❌ | ❌ | ✅ | ✅ |
sambanova | ✅ | ✅ | ❌ | ❌ |
scaleway | ✅ | ✅ | ❌ | ❌ |
together | ✅ | ❌ | ❌ | ❌ |
z-ai | ✅ | ❌ | ❌ | ❌ |
Provider capabilities may change over time. For the most up-to-date information, refer to the Hugging Face Inference Providers documentation. Also checkmarks (✅) indicate capabilities supported by the inference provider itself.
All Chat-supported models automatically support Responses(
v1/responses) as well via Bifrost’s internal conversion logic.Model Aliases & Identification
Unlike standard providers where model IDs are direct strings (e.g.,gpt-4), Hugging Face models in Bifrost are identified by a composite key to route requests to the correct inference backend.
Format: huggingface/[inference_provider]/[model_id]
- inference_provider: The backend service (e.g.,
hf-inference,fal-ai,cerebras). - model_id: The actual model identifier on Hugging Face Hub (e.g.,
meta-llama/Meta-Llama-3-8B-Instruct).
huggingface/hf-inference/meta-llama/Meta-Llama-3-8B-Instruct
This parsing logic is handled in utils.go and models.go, allowing Bifrost to dynamically route requests based on the model string.
Request Handling Differences
The Hugging Face provider handles various tasks (Chat, Speech, Transcription) which often require different request structures depending on the underlying inference provider.Inference Provider Constraints
Different inference providers have specific limitations and requirements:Payload Limit
HuggingFace API enforces a 2 MB request body limit across all request types (Chat, Embedding, Speech, Transcription). This constraint applies to:- JSON request payloads
- Raw audio bytes in transcription requests
- Any other request body data
fal-ai Audio Format Restrictions
The fal-ai provider has strict audio format requirements:
- Supported Format: Only MP3 (
audio/mpeg) is accepted - Rejected Formats: WAV (
audio/wav) and other formats are explicitly rejected - Encoding: Audio must be provided as a base64-encoded Data URI in the
audio_urlfield
core/providers/huggingface/transcription.go):
Speech (Text-to-Speech)
For Text-to-Speech (TTS) requests, the implementation differs from a standard pipeline request:- No Pipeline Tag: The
HuggingFaceSpeechRequeststruct does not include apipeline_tagfield in the JSON body, even though the model might be tagged astext-to-speechon the Hub. - Structure:
- Implementation: See
core/providers/huggingface/speech.go.
Transcription (Automatic Speech Recognition)
The Transcription implementation (core/providers/huggingface/transcription.go) exhibits a “pattern-breaking” behavior where the request format changes significantly based on the inference provider.
1. hf-inference (Raw Bytes)
When using the standard hf-inference provider, the API expects the raw audio bytes directly in the request body, not a JSON object.
- Content-Type: Audio mime type (e.g.,
audio/mpeg). - Body: Raw binary data from
request.Input.File. - Payload Limit: Maximum 2 MB for the raw audio bytes.
- Logic:
- URL Pattern:
/hf-inference/models/{model_name}(no/pipeline/suffix for ASR).
2. fal-ai (JSON with Base64 Data URI)
When using fal-ai through HuggingFace provider, the API expects a JSON body containing the audio as a base64-encoded Data URI.
- Content-Type:
application/json. - Body: JSON object with
audio_urlfield. - Audio Format Restriction: Only MP3 (
audio/mpeg) is supported. WAV files are rejected. - Encoding: Audio is base64-encoded and prefixed with a Data URI scheme.
- Logic:
Dual Fields in types.go
To support these divergent requirements, the HuggingFaceTranscriptionRequest struct in types.go contains fields for both scenarios, which are used mutually exclusively:
Inputs: Used when JSON body is sent with raw bytes (most providers excepthf-inferenceandfal-ai).AudioURL: Used exclusively forfal-ai, must be a base64-encoded Data URI with MP3 format.- Note: For
hf-inference, the entire request body is raw audio bytes—no JSON structure is used at all.
Raw JSON Body Handling
While most providers strictly serialize a struct to JSON, the Hugging Face provider’sTranscription method demonstrates a hybrid approach depending on the inference provider:
Embedding Requests
For embedding requests, different providers expect different field names:- Standard providers (most): Use
inputfield hf-inference: Usesinputsfield (plural)
embedding.go populates both fields to ensure compatibility across providers.
Differences in Inference Provider Constraints
This multi-mode approach allows the provider to support diverse API contracts within a single implementation structure, accommodating:- Legacy endpoints that expect raw binary data
- Modern JSON APIs with different schema expectations
- Third-party providers (like
fal-ai) with custom requirements - Performance optimizations (raw bytes avoid JSON overhead for
hf-inference)
Model Discovery & Caching
The provider implements sophisticated model discovery using the Hugging Face Hub API:List Models Flow
- Parallel Queries: Fetches models from multiple inference providers concurrently
- Filter by Pipeline Tag: Uses
pipeline_tag(e.g.,text-to-speech,feature-extraction) to determine supported methods - Aggregate Results: Combines responses from all providers into a unified list
- Model ID Format: Returns models as
huggingface/{provider}/{model_id}
Provider Model Mapping Cache
The provider maintains a cache (modelProviderMappingCache) to map Hugging Face model IDs to provider-specific model identifiers:
Best Practices
When working with the Hugging Face provider:- Check Payload Size: Ensure request bodies are under 2 MB
- Audio Format: Use MP3 for
fal-ai, avoid WAV files - Model Aliases: Always specify provider in model string:
huggingface/{provider}/{model} - Error Handling: Implement retries for 404 errors (cache invalidation scenarios)
- Provider Selection: Use
autofor automatic provider selection based on model capabilities - Pipeline Tags: Verify model’s
pipeline_tagmatches your use case (chat, embedding, TTS, ASR)

