Monitoring documentation is under development. Core concepts are outlined below.
Overview
Shannon provides comprehensive monitoring and observability features to track task execution, system performance, and resource usage in production environments.
Monitoring Capabilities
Task Monitoring
Track individual task execution:
- Execution status and progress
- Resource consumption
- Error rates and types
- Latency metrics
- Cost tracking
System Monitoring
Monitor Shannon infrastructure:
- Service health status
- API endpoint latency
- Queue depths
- Agent availability
- LLM provider status
Metrics
Task Metrics
| Metric | Description | Unit |
|---|
task.latency | End-to-end task completion time | ms |
task.cost | Total cost per task | USD |
task.tokens.input | Input tokens consumed | count |
task.tokens.output | Output tokens generated | count |
task.iterations | Number of agent iterations | count |
task.tools.invocations | Tool usage count | count |
System Metrics
| Metric | Description | Unit |
|---|
api.latency | API response time | ms |
api.requests | Request rate | req/s |
api.errors | Error rate | errors/s |
queue.depth | Tasks waiting | count |
agents.active | Active agent count | count |
Health Checks
API Health
curl http://localhost:8080/health
Response (gateway):
{
"status": "healthy",
"version": "0.1.0",
"time": "2025-01-20T10:00:00Z",
"checks": {
"gateway": "ok"
}
}
Readiness (checks orchestrator connectivity):
curl http://localhost:8080/readiness
Response:
{
"status": "ready",
"version": "0.1.0",
"time": "2025-01-20T10:00:02Z",
"checks": {
"orchestrator": "ok"
}
}
Component Health
import requests
health = requests.get("http://localhost:8080/health").json()
print("Gateway:", health.get("status"), health.get("checks"))
ready = requests.get("http://localhost:8080/readiness").json()
print("Readiness:", ready.get("status"), ready.get("checks"))
Logging
Log Levels
Shannon uses structured logging with levels:
DEBUG - Detailed diagnostic information
INFO - General operational messages
WARN - Warning conditions
ERROR - Error conditions
FATAL - Critical failures
{
"timestamp": "2024-10-27T10:00:00Z",
"level": "INFO",
"service": "orchestrator",
"task_id": "task-dev-1730000000",
"message": "Task submitted",
"metadata": {
"mode": "standard",
"estimated_cost": 0.15
}
}
Dashboards
Task Dashboard
Monitor task execution in real-time:
- Active tasks
- Completion rate
- Average latency
- Error rate
- Cost per hour
System Dashboard
Track system health:
- Service status
- Resource utilization
- Queue lengths
- Provider availability
Alerting
Alert Types
Configure alerts for:
- Task failures
- Budget exceeded
- High latency
- Service degradation
- Rate limiting
Alert Configuration
alerts:
- name: high_error_rate
condition: error_rate > 0.05
action: notify
channels: [email, slack]
- name: budget_warning
condition: daily_cost > 100
action: notify
channels: [email]
- name: service_down
condition: health_check_failed
action: page
channels: [pagerduty]
Prometheus Integration
Export metrics to Prometheus (example scrape targets for local dev):
# prometheus.yml
scrape_configs:
- job_name: 'orchestrator'
static_configs:
- targets: ['localhost:2112'] # Go Orchestrator /metrics
- job_name: 'agent_core'
static_configs:
- targets: ['localhost:2113'] # Rust Agent Core /metrics
- job_name: 'llm_service'
metrics_path: '/metrics'
static_configs:
- targets: ['localhost:8000'] # Python LLM Service /metrics
Available Metrics
# HELP shannon_task_total Total number of tasks
# TYPE shannon_task_total counter
shannon_task_total{status="completed"} 1234
shannon_task_total{status="failed"} 12
# HELP shannon_task_duration_seconds Task execution duration
# TYPE shannon_task_duration_seconds histogram
shannon_task_duration_seconds_bucket{le="1.0"} 100
shannon_task_duration_seconds_bucket{le="5.0"} 450
Grafana Dashboards
Pre-built Grafana dashboards for:
- Task analytics
- Cost tracking
- Performance monitoring
- Error analysis
OpenTelemetry
Shannon supports OpenTelemetry for distributed tracing:
from opentelemetry import trace
from shannon import ShannonClient
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("analyze_data"):
client = ShannonClient()
handle = client.submit_task(query="Analyze dataset")
result = client.get_status(handle.task_id)
Best Practices
- Set up alerts for critical metrics
- Monitor costs to prevent budget overruns
- Track error patterns to identify issues
- Use distributed tracing for debugging
- Archive logs for compliance
- Create custom dashboards for your use case
- Implement SLOs for reliability
Debugging
Enable Debug Logging
import logging
logging.basicConfig(level=logging.DEBUG)
# Now Shannon will output detailed logs
client = ShannonClient()
Trace Requests
Use distributed tracing via OpenTelemetry or increase logging verbosity in services. Refer to your observability stack configuration (Jaeger/Tempo) for exporters.
Next Steps