Skip to main content
Monitoring documentation is under development. Core concepts are outlined below.

Overview

Shannon provides comprehensive monitoring and observability features to track task execution, system performance, and resource usage in production environments.

Monitoring Capabilities

Task Monitoring

Track individual task execution:
  • Execution status and progress
  • Resource consumption
  • Error rates and types
  • Latency metrics
  • Cost tracking

System Monitoring

Monitor Shannon infrastructure:
  • Service health status
  • API endpoint latency
  • Queue depths
  • Agent availability
  • LLM provider status

Metrics

Task Metrics

MetricDescriptionUnit
task.latencyEnd-to-end task completion timems
task.costTotal cost per taskUSD
task.tokens.inputInput tokens consumedcount
task.tokens.outputOutput tokens generatedcount
task.iterationsNumber of agent iterationscount
task.tools.invocationsTool usage countcount

System Metrics

MetricDescriptionUnit
api.latencyAPI response timems
api.requestsRequest ratereq/s
api.errorsError rateerrors/s
queue.depthTasks waitingcount
agents.activeActive agent countcount

Health Checks

API Health

curl http://localhost:8080/health
Response (gateway):
{
  "status": "healthy",
  "version": "0.1.0",
  "time": "2025-01-20T10:00:00Z",
  "checks": {
    "gateway": "ok"
  }
}
Readiness (checks orchestrator connectivity):
curl http://localhost:8080/readiness
Response:
{
  "status": "ready",
  "version": "0.1.0",
  "time": "2025-01-20T10:00:02Z",
  "checks": {
    "orchestrator": "ok"
  }
}

Component Health

import requests

health = requests.get("http://localhost:8080/health").json()
print("Gateway:", health.get("status"), health.get("checks"))

ready = requests.get("http://localhost:8080/readiness").json()
print("Readiness:", ready.get("status"), ready.get("checks"))

Logging

Log Levels

Shannon uses structured logging with levels:
  • DEBUG - Detailed diagnostic information
  • INFO - General operational messages
  • WARN - Warning conditions
  • ERROR - Error conditions
  • FATAL - Critical failures

Log Format

{
  "timestamp": "2024-10-27T10:00:00Z",
  "level": "INFO",
  "service": "orchestrator",
  "task_id": "task-dev-1730000000",
  "message": "Task submitted",
  "metadata": {
    "mode": "standard",
    "estimated_cost": 0.15
  }
}

Dashboards

Task Dashboard

Monitor task execution in real-time:
  • Active tasks
  • Completion rate
  • Average latency
  • Error rate
  • Cost per hour

System Dashboard

Track system health:
  • Service status
  • Resource utilization
  • Queue lengths
  • Provider availability

Alerting

Alert Types

Configure alerts for:
  • Task failures
  • Budget exceeded
  • High latency
  • Service degradation
  • Rate limiting

Alert Configuration

alerts:
  - name: high_error_rate
    condition: error_rate > 0.05
    action: notify
    channels: [email, slack]

  - name: budget_warning
    condition: daily_cost > 100
    action: notify
    channels: [email]

  - name: service_down
    condition: health_check_failed
    action: page
    channels: [pagerduty]

Prometheus Integration

Export metrics to Prometheus (example scrape targets for local dev):
# prometheus.yml
scrape_configs:
  - job_name: 'orchestrator'
    static_configs:
      - targets: ['localhost:2112']   # Go Orchestrator /metrics

  - job_name: 'agent_core'
    static_configs:
      - targets: ['localhost:2113']   # Rust Agent Core /metrics

  - job_name: 'llm_service'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['localhost:8000']   # Python LLM Service /metrics

Available Metrics

# HELP shannon_task_total Total number of tasks
# TYPE shannon_task_total counter
shannon_task_total{status="completed"} 1234
shannon_task_total{status="failed"} 12

# HELP shannon_task_duration_seconds Task execution duration
# TYPE shannon_task_duration_seconds histogram
shannon_task_duration_seconds_bucket{le="1.0"} 100
shannon_task_duration_seconds_bucket{le="5.0"} 450

Grafana Dashboards

Pre-built Grafana dashboards for:
  • Task analytics
  • Cost tracking
  • Performance monitoring
  • Error analysis

OpenTelemetry

Shannon supports OpenTelemetry for distributed tracing:
from opentelemetry import trace
from shannon import ShannonClient

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("analyze_data"):
    client = ShannonClient()
    handle = client.submit_task(query="Analyze dataset")
    result = client.get_status(handle.task_id)

Best Practices

  1. Set up alerts for critical metrics
  2. Monitor costs to prevent budget overruns
  3. Track error patterns to identify issues
  4. Use distributed tracing for debugging
  5. Archive logs for compliance
  6. Create custom dashboards for your use case
  7. Implement SLOs for reliability

Debugging

Enable Debug Logging

import logging
logging.basicConfig(level=logging.DEBUG)

# Now Shannon will output detailed logs
client = ShannonClient()

Trace Requests

Use distributed tracing via OpenTelemetry or increase logging verbosity in services. Refer to your observability stack configuration (Jaeger/Tempo) for exporters.

Next Steps