Troubleshooting

This page covers the most common problems encountered when deploying Bifrost with Helm, along with diagnostic commands and fixes.

Pod Not Starting

Quick diagnostics

# Show pod status
kubectl get pods -l app.kubernetes.io/name=bifrost

# Show pod events (most useful first step)
kubectl describe pod -l app.kubernetes.io/name=bifrost

# Show pod logs (use --previous if the pod has already crashed)
kubectl logs -l app.kubernetes.io/name=bifrost
kubectl logs -l app.kubernetes.io/name=bifrost --previous

Image pull errors (`ErrImagePull` / `ImagePullBackOff`)

# Check which image is being pulled
kubectl describe pod -l app.kubernetes.io/name=bifrost | grep "Image:"

# Verify imagePullSecrets are attached
kubectl get pod -l app.kubernetes.io/name=bifrost -o jsonpath='{.items[0].spec.imagePullSecrets}'

# Test secret manually
kubectl get secret <pull-secret-name> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | jq .

Common causes:

image.tag not set - the chart requires it; the pod will not start without it
Pull secret missing or expired (ECR tokens expire after 12 hours)
Incorrect image.repository for enterprise registry

# Fix: set the correct tag
helm upgrade bifrost bifrost/bifrost --reuse-values --set image.tag=v1.4.11

PVC not binding (`Pending`)

# Check PVC status
kubectl get pvc -l app.kubernetes.io/instance=bifrost

# Show binding events
kubectl describe pvc -l app.kubernetes.io/instance=bifrost

Common causes:

No Persistent Volume provisioner in the cluster
storageClass set to a class that doesn’t exist
ReadWriteOnce access mode with multiple replicas (SQLite PVCs are single-node)

# List available storage classes
kubectl get storageclass

# Fix: pin to a valid storage class
helm upgrade bifrost bifrost/bifrost \
  --reuse-values \
  --set storage.persistence.storageClass=standard

ConfigMap / Secret errors

# View the generated ConfigMap (contains rendered config.json)
kubectl get configmap bifrost-config -o yaml

# View secrets the pod depends on
kubectl get secret -l app.kubernetes.io/instance=bifrost

# Decode a specific secret value
kubectl get secret bifrost-encryption -o jsonpath='{.data.key}' | base64 -d

CrashLoopBackOff

# Get last log lines before the crash
kubectl logs -l app.kubernetes.io/name=bifrost --previous --tail=50

# Common causes shown in logs:
# "encryption key is not initialized" → no key provided; optional, but data will be stored in plaintext
# "failed to connect to database" → see Database section below
# "image.tag is required" → set image.tag in values

Database Connection Issues

Embedded PostgreSQL

# Check if the PostgreSQL pod is running
kubectl get pods -l app.kubernetes.io/name=bifrost-postgresql

# Connect directly to inspect the database
kubectl exec -it deployment/bifrost-postgresql -- psql -U bifrost -d bifrost

# Test connectivity from the Bifrost pod
kubectl exec -it deployment/bifrost -- nc -zv bifrost-postgresql 5432

# Check PostgreSQL logs
kubectl logs deployment/bifrost-postgresql --tail=50

External PostgreSQL

# Test connectivity from within the cluster
kubectl run pg-test --image=postgres:16-alpine --rm -it --restart=Never -- \
  psql "host=your-db-host dbname=bifrost user=bifrost sslmode=require"

# Verify the secret value is correct
kubectl get secret postgres-credentials -o jsonpath='{.data.password}' | base64 -d

# Check that the external host/port is reachable
kubectl exec -it deployment/bifrost -- nc -zv your-db-host 5432

Common causes:

sslMode: disable when the database requires SSL - set sslMode: require
Password in secret doesn’t match the database user
Network policy blocking pod → database traffic
Database not UTF8 encoded (see PostgreSQL UTF8 Requirement)

# Fix: update the secret and restart
kubectl create secret generic postgres-credentials \
  --from-literal=password='correct-password' \
  --dry-run=client -o yaml | kubectl apply -f -

kubectl rollout restart deployment/bifrost

Ingress Not Working

# Check ingress resource status
kubectl describe ingress bifrost

# Check if the ingress controller is running
kubectl get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# View ingress controller logs for routing errors
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50

# Verify DNS resolves to the correct load balancer IP
nslookup bifrost.yourdomain.com
kubectl get ingress bifrost -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

# Test without TLS first
curl -v http://bifrost.yourdomain.com/health

Common causes:

ingress.className not set or set to a class not installed in the cluster
TLS certificate not issued yet (cert-manager can take up to 60 seconds)
Service port mismatch - Bifrost listens on 8080 by default

# Check cert-manager certificate status
kubectl get certificate -l app.kubernetes.io/instance=bifrost
kubectl describe certificate bifrost-tls

Secret and Credential Issues

Provider API key not resolving

If Bifrost logs show env.OPENAI_API_KEY: not set or similar:

# Check the env var is present in the running pod
kubectl exec -it deployment/bifrost -- env | grep OPENAI

# Verify the providerSecrets secret exists with the right key
kubectl get secret provider-api-keys -o yaml

# Check the providerSecrets configuration rendered correctly
kubectl get configmap bifrost-config -o yaml | grep -A5 providers

Encryption key issues

# Verify the secret exists and contains the right key name
kubectl get secret bifrost-encryption -o yaml

# Check the exact key name matches encryptionKeySecret.key in values
# Default key name is "encryption-key" - if you used "key", set:
#   bifrost.encryptionKeySecret.key: "key"

High Memory Usage

# Check current resource usage
kubectl top pods -l app.kubernetes.io/name=bifrost

# Check if OOM kills are happening
kubectl describe pod -l app.kubernetes.io/name=bifrost | grep -A3 "OOMKilled\|Limits"

# View resource requests/limits on running pods
kubectl get pod -l app.kubernetes.io/name=bifrost \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources}{"\n"}{end}'

Increase resource limits:

helm upgrade bifrost bifrost/bifrost \
  --reuse-values \
  --set resources.limits.memory=4Gi \
  --set resources.requests.memory=1Gi

Tune Go runtime (see Docker Tuning):

env:
  - name: GOGC
    value: "200"          # run GC less often
  - name: GOMEMLIMIT
    value: "3500MiB"      # hard memory ceiling slightly below the container limit

High CPU Usage / Latency

# Check CPU usage
kubectl top pods -l app.kubernetes.io/name=bifrost

# Check if HPA is scaling correctly
kubectl get hpa bifrost
kubectl describe hpa bifrost

Common causes:

initialPoolSize too small - goroutines queuing up; increase to 500–1000
dropExcessRequests: false with a small pool - queue depth growing unboundedly

helm upgrade bifrost bifrost/bifrost \
  --reuse-values \
  --set bifrost.client.initialPoolSize=1000 \
  --set bifrost.client.dropExcessRequests=true

Autoscaling Issues

HPA not scaling

# Check HPA status and current metrics
kubectl describe hpa bifrost

# Verify metrics server is installed
kubectl top nodes
kubectl top pods

# Common fix: metrics server not installed
# Install with:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Pods scaling down too aggressively (drops active SSE streams)

The default scaleDown.stabilizationWindowSeconds: 300 and preStop sleep of 15 seconds should prevent this. If streams are still being cut:

terminationGracePeriodSeconds: 120   # increase if streams run longer than 105s

autoscaling:
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600  # wait 10 min before scaling down
      policies:
        - type: Pods
          value: 1
          periodSeconds: 300           # remove at most 1 pod per 5 min

lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "sleep 30"]  # give load balancer more time to drain

helm upgrade bifrost bifrost/bifrost --reuse-values -f graceful-shutdown-values.yaml

SQLite / PVC Issues

StatefulSet migration (upgrading from chart < v2.0.0)

Older chart versions used a Deployment + manual PVC. v2.0.0 moved SQLite to a StatefulSet. If upgrading:

# 1. Scale down the old deployment
kubectl scale deployment bifrost --replicas=0

# 2. Note the existing PVC name
kubectl get pvc

# 3. Upgrade, pointing at the existing claim
helm upgrade bifrost bifrost/bifrost \
  --reuse-values \
  --set storage.persistence.existingClaim=<your-old-pvc-name> \
  --set image.tag=v1.4.11

Data lost after upgrade

# Check if PVCs still exist (they persist after helm uninstall)
kubectl get pvc -l app.kubernetes.io/instance=bifrost

# Re-attach by setting existingClaim
helm upgrade bifrost bifrost/bifrost \
  --reuse-values \
  --set storage.persistence.existingClaim=<pvc-name>

Cluster Mode Issues

Peers not discovering each other

# Check gossip port is reachable between pods
kubectl exec -it bifrost-0 -- nc -zv bifrost-1.bifrost-headless 7946

# View gossip-related log lines
kubectl logs -l app.kubernetes.io/name=bifrost --tail=100 | grep -i gossip

# Check the headless service exists
kubectl get svc bifrost-headless

For Kubernetes-based discovery, verify the service account has pod list permissions:

kubectl auth can-i list pods --as=system:serviceaccount:default:bifrost

Useful Diagnostic Commands

# Full state dump for a support ticket
kubectl get all -l app.kubernetes.io/instance=bifrost
kubectl describe pod -l app.kubernetes.io/name=bifrost > pod-describe.txt
kubectl logs -l app.kubernetes.io/name=bifrost --tail=200 > pod-logs.txt

# View the full rendered config.json
kubectl get configmap bifrost-config -o jsonpath='{.data.config\.json}' | jq .

# Check current Helm values (shows all overrides)
helm get values bifrost

# Check Helm release status
helm status bifrost

# View Helm release history
helm history bifrost

Still Stuck?

GitHub Issues - search existing issues or open a new one
Enterprise Support - for enterprise customers with SLA

Platform specific guides

Config as Code

Enterprise Deployment

Common setup instructions

Troubleshooting

Pod Not Starting

Quick diagnostics

Image pull errors (`ErrImagePull` / `ImagePullBackOff`)

PVC not binding (`Pending`)

ConfigMap / Secret errors

CrashLoopBackOff

Database Connection Issues

Embedded PostgreSQL

External PostgreSQL

Ingress Not Working

Secret and Credential Issues

Provider API key not resolving

Encryption key issues

High Memory Usage

High CPU Usage / Latency

Autoscaling Issues

HPA not scaling

Pods scaling down too aggressively (drops active SSE streams)

SQLite / PVC Issues

StatefulSet migration (upgrading from chart < v2.0.0)

Data lost after upgrade

Cluster Mode Issues

Peers not discovering each other

Useful Diagnostic Commands

Still Stuck?

​Pod Not Starting

​Quick diagnostics

​Image pull errors (ErrImagePull / ImagePullBackOff)

​PVC not binding (Pending)

​ConfigMap / Secret errors

​CrashLoopBackOff

​Database Connection Issues

​Embedded PostgreSQL

​External PostgreSQL

​Ingress Not Working

​Secret and Credential Issues

​Provider API key not resolving

​Encryption key issues

​High Memory Usage

​High CPU Usage / Latency

​Autoscaling Issues

​HPA not scaling

​Pods scaling down too aggressively (drops active SSE streams)

​SQLite / PVC Issues

​StatefulSet migration (upgrading from chart < v2.0.0)

​Data lost after upgrade

​Cluster Mode Issues

​Peers not discovering each other

​Useful Diagnostic Commands

​Still Stuck?

Pod Not Starting

Quick diagnostics

Image pull errors (`ErrImagePull` / `ImagePullBackOff`)

PVC not binding (`Pending`)

ConfigMap / Secret errors

CrashLoopBackOff

Database Connection Issues

Embedded PostgreSQL

External PostgreSQL

Ingress Not Working

Secret and Credential Issues

Provider API key not resolving

Encryption key issues

High Memory Usage

High CPU Usage / Latency

Autoscaling Issues

HPA not scaling

Pods scaling down too aggressively (drops active SSE streams)

SQLite / PVC Issues

StatefulSet migration (upgrading from chart < v2.0.0)

Data lost after upgrade

Cluster Mode Issues

Peers not discovering each other

Useful Diagnostic Commands

Still Stuck?