Skip to main content

Quick Diagnostics

Before diving into specific issues, run these quick checks:
# Check all services are running
docker compose ps

# View recent logs from all services
docker compose logs --tail=50

# Check specific service health
curl http://localhost:8080/health
curl http://localhost:8000/health  # LLM Service

Installation & Setup Issues

Docker Compose Fails to Start

Symptoms:
  • Services won’t start
  • Exit code errors
  • Container crashes immediately
Common Causes:
Check:
docker info
Solution:
# macOS
open -a Docker

# Linux
sudo systemctl start docker

# Verify
docker info
Check which ports are in use:
# Check all Shannon ports
lsof -i :8080  # Gateway
lsof -i :50051 # Agent Core
lsof -i :50052 # Orchestrator
lsof -i :8000  # LLM Service
lsof -i :5432  # PostgreSQL
lsof -i :6379  # Redis
lsof -i :6333  # Qdrant
lsof -i :7233  # Temporal
Solution - Kill conflicting processes:
# Find process using port
lsof -ti :8080

# Kill the process (macOS/Linux)
kill -9 $(lsof -ti :8080)
Solution - Change Shannon ports: Edit docker-compose.yml to use different ports:
gateway:
  ports:
    - "8081:8080"  # Use 8081 instead of 8080
Check Docker resources:
docker system df
docker stats
Solution - Increase Docker resources:
  • macOS: Docker Desktop → Preferences → Resources
    • RAM: Minimum 8GB (16GB recommended)
    • CPUs: Minimum 4 cores
    • Disk: Minimum 20GB free
  • Linux: Edit Docker daemon config
    sudo nano /etc/docker/daemon.json
    
    {
      "default-ulimits": {
        "nofile": {
          "Name": "nofile",
          "Hard": 64000,
          "Soft": 64000
        }
      }
    }
    
Error: WARNING: The OPENAI_API_KEY variable is not setSolution:
# Create .env from template
make setup

# Or manually
cp .env.example .env

# Add your API keys
echo "OPENAI_API_KEY=sk-..." >> .env
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env
Error: python_wasi/bin/python3.11: No such file or directorySolution:
# Download and setup Python WASI (20MB)
./scripts/setup_python_wasi.sh

# Verify installation
ls -lh python_wasi/bin/python3.11

API & Connection Issues

401 Unauthorized

Symptoms:
  • HTTP 401 responses
  • “Unauthorized” error messages
Diagnosis:
# Check if auth is enabled
docker compose exec orchestrator env | grep GATEWAY_SKIP_AUTH
Edit .env:
GATEWAY_SKIP_AUTH=1  # 1 = disabled, 0 = enabled
Restart:
docker compose restart gateway
Test:
curl http://localhost:8080/api/v1/tasks
# Should work without X-API-Key header
Request with API key:
curl -H "X-API-Key: sk_test_123456" \
  http://localhost:8080/api/v1/tasks
Python SDK:
from shannon import ShannonClient

client = ShannonClient(
    base_url="http://localhost:8080",
    api_key="sk_test_123456"
)

Connection Refused / Service Unavailable

Symptoms:
  • connection refused
  • dial tcp: connect: connection refused
  • Services not responding
Diagnosis:
# Check service status
docker compose ps

# Check specific service logs
docker compose logs orchestrator --tail=50
docker compose logs agent-core --tail=50
docker compose logs llm-service --tail=50

# Test endpoints
curl http://localhost:8080/health
curl http://localhost:50052  # Should fail - gRPC doesn't support HTTP GET
Wait for all services to initialize:
# Watch logs until services are ready
docker compose logs -f

# Look for these messages:
# orchestrator: "gRPC server listening on :50052"
# agent-core: "Server started on :50051"
# llm-service: "Uvicorn running on http://0.0.0.0:8000"
# gateway: "Gateway listening on :8080"
Typical startup time: 30-60 seconds
Check for crash errors:
docker compose logs orchestrator | grep -i error
docker compose logs orchestrator | grep -i fatal
Restart crashed service:
docker compose restart orchestrator
docker compose restart agent-core
docker compose restart llm-service
Full reset if needed:
docker compose down
docker compose up -d
Check PostgreSQL:
docker compose logs postgres --tail=20

# Test connection
docker compose exec postgres psql -U shannon -d shannon -c "SELECT 1;"
Solution:
# Restart database
docker compose restart postgres

# Wait for it to be ready
docker compose exec postgres pg_isready -U shannon

Task Stuck in RUNNING or QUEUED State

Symptoms:
  • Task never completes
  • Status remains RUNNING for hours
  • No progress updates
Diagnosis:
# Check Temporal workflows
docker compose logs temporal --tail=100

# Check orchestrator worker
docker compose logs orchestrator | grep -i workflow

# View task in Temporal UI
open http://localhost:8088
Check LLM service logs:
docker compose logs llm-service | grep -i "api key\|unauthorized\|quota"
Solution:
# Verify API keys in .env
grep -E "OPENAI_API_KEY|ANTHROPIC_API_KEY" .env

# Test API key
curl https://api.openai.com/v1/models \
  -H "Authorization: Bearer $OPENAI_API_KEY"

# Update .env with valid key
nano .env

# Restart LLM service
docker compose restart llm-service
Restart Temporal workers:
docker compose restart orchestrator

# Check workflow in Temporal UI
open http://localhost:8088
# Navigate to Workflows → Find your workflow → View execution history
Force workflow termination (last resort):
# In Temporal UI: Workflows → Select workflow → Terminate
Check circuit breaker status:
docker compose logs orchestrator | grep -i "circuit"
Circuit breakers protect against cascading failures:
  • LLM Service circuit breaker
  • Database circuit breaker
  • Redis circuit breaker
Solution - Wait for automatic recovery (30-60 seconds) Or restart services:
docker compose restart orchestrator agent-core llm-service

Budget & Cost Issues

Budget Exceeded Errors

Symptoms:
  • budget exceeded error
  • Tasks fail with cost limit errors
  • HTTP 429 (Rate Limited) Payment Required
Diagnosis:
# Check budget configuration
docker compose exec orchestrator env | grep BUDGET
docker compose exec orchestrator env | grep MAX_COST
Edit .env:
MAX_COST_PER_REQUEST=1.00    # Increase from 0.50
MAX_TOKENS_PER_REQUEST=20000  # Increase from 10000
Restart:
docker compose restart orchestrator llm-service
Budgets are configured server-side via environment variables. The SDK does not accept per-request budget parameters.
# Instead of advanced mode
client.submit_task(query="...", # Mode auto-selected)

# Advanced → Standard → Simple (cheapest)
Cost comparison:
  • Simple: 1 LLM call, $0.01-0.05
  • Standard: 3-5 LLM calls, $0.05-0.20
  • Advanced: 10+ LLM calls, $0.20-1.00+
⚠️ Warning: Only for development/testingEdit .env:
LLM_DISABLE_BUDGETS=1  # Disable budget checks
Restart:
docker compose restart orchestrator llm-service

Performance Issues

Slow Response Times

Symptoms:
  • Tasks take 2-3x longer than expected
  • High latency
  • Timeouts
Diagnosis:
# Check resource usage
docker stats

# Check for slow queries
docker compose logs postgres | grep "duration:"

# Check Redis latency
docker compose exec redis redis-cli --latency

# Check Qdrant performance
curl http://localhost:6333/metrics
Check resources:
docker stats
# Look for CPU > 80% or Memory near limit
Increase Docker resources:
  • macOS: Docker Desktop → Resources → increase RAM to 16GB, CPUs to 6
  • Linux: More powerful machine or reduce concurrent workflows
Tune worker concurrency in .env:
WORKER_ACT_CRITICAL=5   # Reduce from 10
WORKER_WF_CRITICAL=3     # Reduce from 5
TOOL_PARALLELISM=2       # Reduce from 5
First request is always slower (10-30s)Subsequent requests use caching:
  • LLM response cache (Redis)
  • Session context cache
  • Tool result cache
Solution: Warm up with a test request
curl -X POST http://localhost:8080/api/v1/tasks \
  -H "Content-Type: application/json" \
  -d '{"query": "Hello"}'
Increase pool size in .env:
DB_MAX_OPEN_CONNS=50    # Increase from 25
DB_MAX_IDLE_CONNS=10    # Increase from 5
Restart:
docker compose restart orchestrator

Tokens > 0 but empty result

Symptoms:
  • Database or logs show non‑zero completion tokens, but the final result text is empty.
  • Complex prompts return nothing while simple prompts work.
Cause:
  • Some GPT‑5 chat responses return content as structured parts instead of a plain string. Older parsing could miss the text. This is fixed by routing GPT‑5 models via the Responses API and defensively normalizing content for chat responses.
Fix (Shannon ≥ 2025‑11‑05):
  • LLM Service routes GPT‑5 models to the Responses API and prefers output_text when available.
  • Chat providers normalize content by joining text parts when a list is returned.
  • If you upgraded from an older build, restart the LLM Service to clear cached empty responses.
Verify:
  • Re‑run a long, multi‑paragraph prompt. result length should be > 0 and session history should include the assistant message.

High Memory Usage

Symptoms:
  • OOM (Out of Memory) errors
  • Container restarts
  • Swap usage high
Diagnosis:
docker stats

# Check session cache size
docker compose logs orchestrator | grep "session.*cache"
Edit config/shannon.yaml or set env vars:
# Reduce session cache
SESSION_CACHE_SIZE=5000  # From 10000

# Reduce history
SESSION_MAX_HISTORY=250  # From 500

# Reduce LRU caches
TOOL_CACHE_SIZE=1000     # From 5000
Restart:
docker compose restart orchestrator agent-core

Data & State Issues

Sessions Not Persisting

Symptoms:
  • Session context lost between requests
  • Agent doesn’t remember previous tasks
Diagnosis:
# Check Redis connectivity
docker compose exec orchestrator nc -zv redis 6379

# Check session data
docker compose exec redis redis-cli KEYS "session:*"
Check Redis status:
docker compose ps redis
docker compose logs redis --tail=20
Restart Redis:
docker compose restart redis
Test connection:
docker compose exec redis redis-cli ping
# Should return "PONG"
Sessions expire after 30 days by defaultIncrease TTL in .env:
REDIS_TTL_SECONDS=7776000  # 90 days
Check session expiry:
docker compose exec redis redis-cli TTL "session:YOUR_SESSION_ID"
# Returns seconds until expiry, or -1 for no expiry
Provide a stable session_id explicitly:
session_id = "user-123-conversation"

handle1 = client.submit_task("Load data", session_id=session_id)
handle2 = client.submit_task("Analyze data", session_id=session_id)

Database Migration Errors

Symptoms:
  • Table doesn’t exist errors
  • Column not found errors
  • Schema version mismatch
Solution:
# Run migrations
docker compose exec orchestrator make migrate

# Or reset database (⚠️ DESTRUCTIVE)
docker compose down -v  # Remove volumes
docker compose up -d

Debugging Tools

Viewing Logs

# All services
docker compose logs -f

# Specific service
docker compose logs -f orchestrator
docker compose logs -f agent-core
docker compose logs -f llm-service

# Last N lines
docker compose logs --tail=100 orchestrator

# Search logs
docker compose logs orchestrator | grep -i error
docker compose logs orchestrator | grep "task_id=YOUR_TASK_ID"

Temporal UI

Access: http://localhost:8088 Features:
  • View all workflows
  • See execution history
  • Replay failed workflows
  • Terminate stuck workflows
  • Time-travel debugging
Usage:
  1. Navigate to Workflows
  2. Search by workflow ID (task ID)
  3. View execution history to see where it failed
  4. Check Activity logs for detailed errors

Prometheus Metrics

# Orchestrator metrics
curl http://localhost:2112/metrics

# Agent Core metrics
curl http://localhost:2113/metrics

# LLM Service metrics
curl http://localhost:8000/metrics
Key metrics:
  • tasks_submitted_total
  • tasks_completed_total
  • tasks_failed_total
  • llm_requests_total
  • circuit_breaker_state

Real-time Dashboard

Access: http://localhost:2111 Features:
  • Live task execution
  • Event streams
  • Token usage graphs
  • System health

Getting Help

Quick Reference Commands

# Health checks
curl http://localhost:8080/health
curl http://localhost:8000/health

# Service status
docker compose ps
docker stats

# Restart services
docker compose restart orchestrator
docker compose restart agent-core
docker compose restart llm-service

# View logs
docker compose logs -f orchestrator

# Full reset
docker compose down -v
docker compose up -d

# Database access
docker compose exec postgres psql -U shannon -d shannon

# Redis CLI
docker compose exec redis redis-cli

# Check environment
docker compose exec orchestrator env | grep -E "OPENAI|ANTHROPIC"