Configuration Troubleshooting

Overview

This guide covers common configuration issues, how to diagnose them, and proven solutions.

Quick Diagnostics

Check Environment Variables

# View all environment variables for a service
docker compose exec orchestrator env | sort

# Check specific variable
docker compose exec orchestrator env | grep MAX_COST_PER_REQUEST

# Check if variable is set
docker compose exec orchestrator printenv MAX_COST_PER_REQUEST

Verify Configuration Files

# Check if config file exists
docker compose exec orchestrator ls -la ./config/

# View config file contents
docker compose exec orchestrator cat ./config/features.yaml

# Check for syntax errors
docker compose exec orchestrator cat ./config/models.yaml | yq .

Check Service Health

# Gateway health
curl http://localhost:8080/health

# Orchestrator metrics
curl http://localhost:2112/metrics

# Agent Core health
grpcurl -plaintext localhost:50051 list

Common Issues

1. Services Won’t Start

Missing Environment Variables

Symptoms:

Service crashes immediately
Logs show “variable not set” errors
Container exits with code 1

Diagnosis:

docker compose logs orchestrator | grep -i "not set\|missing\|required"

Solution:

# Check .env file exists
ls -la .env

# Verify required variables are set
grep -E "OPENAI_API_KEY|POSTGRES" .env

# Copy from example if missing
cp .env.example .env
nano .env  # Fill in required values

# Restart services
docker compose restart

Required Variables:

At least one LLM provider key (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.)
Database credentials (POSTGRES_*)
Redis connection (REDIS_*)

Invalid Configuration Syntax

Symptoms:

“Failed to parse config” errors
YAML syntax errors
Service fails to start

Diagnosis:

# Check YAML syntax
docker compose exec orchestrator cat ./config/features.yaml | yq .

Solution:

# Validate YAML locally
yq eval ./config/features.yaml

# Check for common issues
cat ./config/features.yaml | grep -E "^\s+- |^\w+:"

# Reset to defaults
cp ./config/features.yaml.example ./config/features.yaml

2. Authentication Failures

Gateway Returns 401 Unauthorized

Symptoms:

All requests return 401
“Unauthorized” error
API key rejected

Diagnosis:

# Check if auth is enabled
docker compose exec gateway env | grep GATEWAY_SKIP_AUTH

# Test with curl
curl -v http://localhost:8080/api/v1/tasks \
  -H "X-API-Key: sk_test_123456" 2>&1 | grep "401"

Solution 1: Disable auth for development

# Add to .env
GATEWAY_SKIP_AUTH=1

# Restart gateway
docker compose restart gateway

# Test
curl http://localhost:8080/api/v1/tasks

Solution 2: Use valid API key

# Insert API key in database
docker compose exec postgres psql -U shannon -d shannon -c "
INSERT INTO auth.api_keys (key, user_id, tenant_id, name, enabled)
VALUES ('sk_test_123456', gen_random_uuid(), gen_random_uuid(), 'Test Key', true);
"

# Test with key
curl -H "X-API-Key: sk_test_123456" \
  http://localhost:8080/api/v1/tasks

JWT Secret Not Set

Symptoms:

“JWT secret not configured” error
Authentication middleware fails

Solution:

# Generate secure secret
JWT_SECRET=$(openssl rand -base64 32)

# Add to .env
echo "JWT_SECRET=$JWT_SECRET" >> .env

# Restart gateway
docker compose restart gateway

3. Database Connection Issues

Cannot Connect to PostgreSQL

Symptoms:

“connection refused” errors
“dial tcp: connect: connection refused”
Services crash on startup

Diagnosis:

# Check if PostgreSQL is running
docker compose ps postgres

# Check PostgreSQL logs
docker compose logs postgres --tail=50

# Test connection
docker compose exec postgres pg_isready -U shannon

Solution 1: PostgreSQL not started

# Start PostgreSQL
docker compose up -d postgres

# Wait for ready
docker compose exec postgres pg_isready -U shannon

# Restart dependent services
docker compose restart gateway orchestrator

Solution 2: Wrong credentials

# Verify .env settings
grep POSTGRES .env

# Should match docker-compose.yml
docker compose exec postgres psql -U shannon -d shannon -c "SELECT 1;"

# If password wrong, update .env and restart
docker compose down
docker compose up -d

Solution 3: Port conflict

# Check if port 5432 is in use
lsof -i :5432

# If conflict, change port in .env
POSTGRES_PORT=5433

# Update docker-compose.yml
# Restart
docker compose down
docker compose up -d

Database Schema Not Initialized

Symptoms:

“table does not exist” errors
“column not found” errors
SQL errors in logs

Solution:

# Run migrations
docker compose exec orchestrator make migrate

# Or reset database (⚠️ DESTRUCTIVE)
docker compose down -v
docker compose up -d

4. Redis Connection Issues

Cannot Connect to Redis

Symptoms:

“connection refused” to Redis
Session state not persisting
Cache misses

Diagnosis:

# Check Redis status
docker compose ps redis

# Test connection
docker compose exec redis redis-cli ping

# Check logs
docker compose logs redis --tail=20

Solution:

# Start Redis
docker compose up -d redis

# Test connection
docker compose exec redis redis-cli ping
# Should return: PONG

# Restart dependent services
docker compose restart gateway orchestrator llm-service

Redis Authentication Failed

Symptoms:

“NOAUTH Authentication required”
Connection works but commands fail

Solution:

# Check if password is set
docker compose exec redis redis-cli CONFIG GET requirepass

# If password required, add to .env
REDIS_PASSWORD=your-password

# Or disable auth (development only)
docker compose exec redis redis-cli CONFIG SET requirepass ""

# Restart services
docker compose restart

5. LLM Provider Issues

API Key Invalid or Expired

Symptoms:

“Invalid API key” errors
401 from LLM provider
Tasks fail immediately

Diagnosis:

# Check which provider is configured
docker compose exec llm-service env | grep API_KEY

# Test OpenAI key
curl https://api.openai.com/v1/models \
  -H "Authorization: Bearer $OPENAI_API_KEY"

# Test Anthropic key
curl https://api.anthropic.com/v1/messages \
  -H "X-API-Key: $ANTHROPIC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"claude-3-haiku-20240307","max_tokens":10,"messages":[{"role":"user","content":"Hi"}]}'

Solution:

# Update key in .env
OPENAI_API_KEY=sk-...new-key...

# Restart LLM service
docker compose restart llm-service

# Verify
docker compose logs llm-service | grep "API key"

Rate Limit Exceeded

Symptoms:

429 errors from LLM provider
“Rate limit exceeded” in logs
Tasks timeout or fail

Solution 1: Wait for rate limit reset

# Check rate limit headers
docker compose logs llm-service | grep "rate"

# Typical reset: 60 seconds for most providers

Solution 2: Configure rate limiting

# Add to .env
RATE_LIMIT_REQUESTS=50  # Lower than provider limit
RATE_LIMIT_WINDOW=60

# Restart
docker compose restart llm-service

Solution 3: Use multiple providers

# Configure fallback providers in models.yaml
providers:
  - id: openai
    primary: true
  - id: anthropic
    fallback: true

Quota Exceeded

Symptoms:

“insufficient_quota” errors
“You exceeded your current quota”
All LLM calls fail

Solution:

# Check quota
# OpenAI: https://platform.openai.com/account/usage
# Anthropic: https://console.anthropic.com/settings/limits

# Add credits or upgrade plan
# Or use different provider
OPENAI_API_KEY=
ANTHROPIC_API_KEY=sk-ant-...

# Restart
docker compose restart llm-service

6. Model Configuration Issues

Model Not Found

Symptoms:

“model not found” errors
“invalid model” errors
Tasks fail with model errors

Diagnosis:

# Check configured models
docker compose exec llm-service cat ./config/models.yaml | grep "id:"

# Check environment variables
docker compose exec orchestrator env | grep MODEL

Solution:

# Use valid model IDs in .env
DEFAULT_MODEL_TIER=small
COMPLEXITY_MODEL_ID=gpt-5  # Verify this exists
DECOMPOSITION_MODEL_ID=claude-sonnet-4-20250514

# Or configure in models.yaml
docker compose exec orchestrator cat ./config/models.yaml

# Restart
docker compose restart orchestrator llm-service

7. Budget and Cost Issues

Tasks Exceed Budget

Symptoms:

“Budget exceeded” errors
Tasks fail with cost errors
MAX_COST_PER_REQUEST exceeded

Solution:

# Increase budget limits
# In .env:
MAX_COST_PER_REQUEST=1.00  # Increase from 0.50
MAX_TOKENS_PER_REQUEST=20000  # Increase from 10000

# Restart
docker compose restart orchestrator

# Or use cheaper models
DEFAULT_MODEL_TIER=small  # Use GPT-5o-mini instead of GPT-5

Budget Enforcement Not Working

Symptoms:

Costs exceed limits
No budget errors

Diagnosis:

# Check budget enforcement
docker compose exec orchestrator env | grep LLM_DISABLE_BUDGETS

Solution:

# Enable budget enforcement
LLM_DISABLE_BUDGETS=1  # Orchestrator enforces budgets

# Set limits
MAX_COST_PER_REQUEST=0.50
MAX_TOKENS_PER_REQUEST=10000

# Restart
docker compose restart orchestrator

8. Performance Issues

Slow Task Execution

Symptoms:

Tasks take 2-3x expected time
High latency
Timeouts

Diagnosis:

# Check resource usage
docker stats

# Check worker concurrency
docker compose exec orchestrator env | grep WORKER

# Check tool parallelism
docker compose exec orchestrator env | grep TOOL_PARALLELISM

Solution 1: Increase parallelism

# In .env:
TOOL_PARALLELISM=10  # Increase from 5
WORKER_ACT_CRITICAL=20  # Increase from 10

# Restart
docker compose restart orchestrator

Solution 2: Enable caching

# In .env:
ENABLE_CACHE=true
CACHE_SIMILARITY_THRESHOLD=0.95

# Restart
docker compose restart llm-service

Solution 3: Optimize model selection

# Use faster models
DEFAULT_MODEL_TIER=small  # GPT-5o-mini is 10x faster than GPT-5

High Memory Usage

Symptoms:

OOM errors
Container restarts
High swap usage

Diagnosis:

docker stats

Solution:

# Reduce cache sizes
HISTORY_WINDOW_MESSAGES=25  # Reduce from 50
STREAMING_RING_CAPACITY=500  # Reduce from 1000

# Limit tool parallelism
TOOL_PARALLELISM=3  # Reduce from 5

# Restart
docker compose restart

9. Streaming Issues

SSE Connection Drops

Symptoms:

SSE stream disconnects
Events stop mid-task
“Connection closed” errors

Solution 1: Increase timeouts

# In nginx/proxy config:
proxy_read_timeout 600s;
proxy_connect_timeout 600s;

# In docker-compose.yml for gateway:
GATEWAY_READ_TIMEOUT=600

Solution 2: Handle reconnection

# Client-side reconnection
while True:
    try:
        for event in stream_events(task_id):
            process(event)
        break  # Task completed
    except ConnectionError:
        time.sleep(2)  # Wait and retry

Events Not Received

Symptoms:

No events in stream
Empty SSE response
Stream connects but no data

Diagnosis:

# Check if events are being created
docker compose exec postgres psql -U shannon -d shannon -c "
SELECT COUNT(*) FROM event_logs WHERE workflow_id = 'task_abc123';
"

# Check Redis streams
docker compose exec redis redis-cli XLEN "stream:task_abc123"

Solution:

# Verify admin server is running
docker compose ps orchestrator

# Check admin server endpoint
curl http://localhost:8081/health

# Restart orchestrator
docker compose restart orchestrator

10. Tool Execution Issues

Python Code Execution Fails

Symptoms:

“WASI interpreter not found”
Python code tools fail
Sandbox errors

Solution:

# Download Python WASI interpreter
./scripts/setup_python_wasi.sh

# Or manual download
wget https://github.com/vmware-labs/webassembly-language-runtimes/releases/download/python%2F3.11.4%2B20230908-ba7c2cf/python-3.11.4.wasm
mkdir -p ./wasm-interpreters
mv python-3.11.4.wasm ./wasm-interpreters/

# Verify path in .env
PYTHON_WASI_WASM_PATH=./wasm-interpreters/python-3.11.4.wasm

# Restart
docker compose restart agent-core llm-service

Tool Timeout

Symptoms:

“Tool execution timeout” errors
Tools hang indefinitely
WASI timeout errors

Solution:

# Increase timeouts
WASI_TIMEOUT_SECONDS=120  # Increase from 60
ENFORCE_TIMEOUT_SECONDS=180  # Increase from 90

# Restart
docker compose restart agent-core

Configuration Validation

Validate All Settings

#!/bin/bash

echo "=== Shannon Configuration Validation ==="

# Check .env file
if [ ! -f .env ]; then
  echo "❌ .env file not found"
  exit 1
fi
echo "✓ .env file exists"

# Check required variables
required_vars=(
  "POSTGRES_HOST"
  "REDIS_HOST"
  "TEMPORAL_HOST"
)

for var in "${required_vars[@]}"; do
  if grep -q "^${var}=" .env; then
    echo "✓ $var is set"
  else
    echo "❌ $var is missing"
  fi
done

# Check at least one LLM provider
if grep -qE "^(OPENAI|ANTHROPIC|GOOGLE)_API_KEY=.+" .env; then
  echo "✓ LLM provider configured"
else
  echo "❌ No LLM provider API key set"
fi

# Check services are running
echo ""
echo "=== Service Health ==="
services=("postgres" "redis" "temporal" "qdrant" "orchestrator" "agent-core" "llm-service" "gateway")

for service in "${services[@]}"; do
  if docker compose ps | grep -q "$service.*running"; then
    echo "✓ $service is running"
  else
    echo "❌ $service is not running"
  fi
done

echo ""
echo "=== Endpoint Tests ==="

# Test Gateway
if curl -f -s http://localhost:8080/health > /dev/null; then
  echo "✓ Gateway health check passed"
else
  echo "❌ Gateway health check failed"
fi

# Test Orchestrator metrics
if curl -f -s http://localhost:2112/metrics > /dev/null; then
  echo "✓ Orchestrator metrics available"
else
  echo "❌ Orchestrator metrics failed"
fi

echo ""
echo "=== Configuration Validation Complete ==="

Best Practices

1. Use Environment-Specific Configs

# Development
.env.development
ENVIRONMENT=dev
DEBUG=true
GATEWAY_SKIP_AUTH=1

# Production
.env.production
ENVIRONMENT=prod
DEBUG=false
GATEWAY_SKIP_AUTH=0
JWT_SECRET=<secure-secret>

2. Document Custom Settings

# In .env, add comments
# Custom rate limit for high-volume API
RATE_LIMIT_REQUESTS=500  # Increased for enterprise tier

3. Version Control

# .gitignore
.env
.env.local

# Commit templates
.env.example
.env.template

4. Regular Validation

# Add to CI/CD
./scripts/validate-config.sh

5. Monitor Configuration

# Track configuration changes
git diff .env.example

# Alert on critical changes
# Monitor environment variables in production

Quick Fixes Checklist

When things go wrong, try these in order:

Restart all services: docker compose restart
Check logs: docker compose logs --tail=50
Verify .env file exists and has required variables
Test database connection: docker compose exec postgres pg_isready
Test Redis: docker compose exec redis redis-cli ping
Verify at least one LLM API key is set
Check disk space: df -h
Check memory: docker stats
Full reset (last resort): docker compose down -v && docker compose up -d

Getting Help

If issues persist:

Collect logs:
```
docker compose logs > shannon-logs.txt
```

Export configuration:

docker compose exec orchestrator env | grep -v API_KEY > config.txt

Check GitHub issues: https://github.com/Kocoro-lab/Shannon/issues
Join Discord: https://discord.gg/NB7C2fMcQR

Environment Variables

Complete variable reference

Docker Compose

Docker deployment guide

Troubleshooting

General troubleshooting

Performance Tuning

Performance optimization

Getting Started

Cloud Platforms

Operations

​Overview

​Quick Diagnostics

​Check Environment Variables

​Verify Configuration Files

​Check Service Health

​Common Issues

​1. Services Won’t Start

​Missing Environment Variables

​Invalid Configuration Syntax

​2. Authentication Failures

​Gateway Returns 401 Unauthorized

​JWT Secret Not Set

​3. Database Connection Issues

​Cannot Connect to PostgreSQL

​Database Schema Not Initialized

​4. Redis Connection Issues

​Cannot Connect to Redis

​Redis Authentication Failed

​5. LLM Provider Issues

​API Key Invalid or Expired

​Rate Limit Exceeded

​Quota Exceeded

​6. Model Configuration Issues

​Model Not Found

​7. Budget and Cost Issues

​Tasks Exceed Budget

​Budget Enforcement Not Working

​8. Performance Issues

​Slow Task Execution

​High Memory Usage

​9. Streaming Issues

​SSE Connection Drops

​Events Not Received

​10. Tool Execution Issues

​Python Code Execution Fails

​Tool Timeout

​Configuration Validation

​Validate All Settings

​Best Practices

​1. Use Environment-Specific Configs

​2. Document Custom Settings

​3. Version Control

​4. Regular Validation

​5. Monitor Configuration

​Quick Fixes Checklist

​Getting Help

​Related Documentation

Environment Variables

Docker Compose

Troubleshooting

Performance Tuning

Overview

Quick Diagnostics

Check Environment Variables

Verify Configuration Files

Check Service Health

Common Issues

1. Services Won’t Start

Missing Environment Variables

Invalid Configuration Syntax

2. Authentication Failures

Gateway Returns 401 Unauthorized

JWT Secret Not Set

3. Database Connection Issues

Cannot Connect to PostgreSQL

Database Schema Not Initialized

4. Redis Connection Issues

Cannot Connect to Redis

Redis Authentication Failed

5. LLM Provider Issues

API Key Invalid or Expired

Rate Limit Exceeded

Quota Exceeded

6. Model Configuration Issues

Model Not Found

7. Budget and Cost Issues

Tasks Exceed Budget

Budget Enforcement Not Working

8. Performance Issues

Slow Task Execution

High Memory Usage

9. Streaming Issues

SSE Connection Drops

Events Not Received

10. Tool Execution Issues

Python Code Execution Fails

Tool Timeout

Configuration Validation

Validate All Settings

Best Practices

1. Use Environment-Specific Configs

2. Document Custom Settings

3. Version Control

4. Regular Validation

5. Monitor Configuration

Quick Fixes Checklist

Getting Help

Related Documentation