Skip to main content

Overview

Shannon provides comprehensive cost control features to prevent unexpected LLM charges and optimize spending. With built-in budget enforcement and intelligent routing, teams often see significant cost reductions (60–90%) versus naive implementations (workload‑dependent).

Setting Budgets

Budgets are configured at the platform level (not per request via REST). Use environment variables in .env:
# LLM service budget guards
MAX_TOKENS_PER_REQUEST=10000    # Max tokens per request
MAX_COST_PER_REQUEST=0.50       # Max cost per request (USD)

# Apply changes
docker compose restart
Shannon enforces budgets during execution. When limits are reached, the system halts further spending and returns the best available result or an error depending on context.

Model Tiers

Shannon categorizes models into tiers based on capability and cost:
TierModelsCost per 1M tokensUse Case
SMALLgpt-5-mini
claude-haiku
0.150.15 - 0.25Simple queries, high volume
MEDIUMgpt-5
claude-sonnet
3.003.00 - 15.00General purpose tasks
LARGEgpt-5-thinking
claude-opus
15.0015.00 - 75.00Complex reasoning, critical tasks

Explicit Tier Preference

Set a preferred default tier via environment variable:
DEFAULT_MODEL_TIER=small   # small | medium | large

Intelligent Router

Shannon’s learning router automatically selects the cheapest model capable of handling each task.

How It Works

  1. Task Analysis: Analyzes complexity, required capabilities
  2. Model Selection: Starts with smallest viable model
  3. Quality Check: Validates output quality
  4. Learning: Remembers successful model-task pairings

Cost Savings

# Without intelligent routing
Traditional: Always use GPT-5$0.50 per task

# With Shannon's routing
Shannon:
  - 70% routed to gpt-5-mini → $0.01
  - 25% routed to gpt-5$0.15
  - 5% routed to gpt-5-thinking → $0.50
Example average: $0.05 per task (~90% savings)

Monitoring Router Decisions

Use the dashboard or SDK status to review costs:
status = client.wait(handle.task_id)
if status.metrics:
    print(f"Cost (USD): {status.metrics.cost_usd:.4f}")

Response Caching

Shannon caches LLM responses to eliminate redundant API calls:

Cache Strategy

  • Key: SHA256 hash of (messages + model + parameters)
  • TTL: Configurable (often ~1 hour via Redis TTL)
  • Storage: In-memory LRU + optional Redis for distributed caching
  • Hit Rate: Typical 30-50% for production workloads

Cache Benefits

# First call: Cache miss
Task 1: "What is Python?" $0.002 (LLM call)

# Second call: Cache hit
Task 2: "What is Python?" $0.000 (cached)

Monitoring Cache Performance

status = client.wait(handle.task_id)
if status.metrics:
    if status.metrics.cost_usd == 0:
        print("Likely served from cache (no LLM cost)")
    else:
        print(f"Cost: ${status.metrics.cost_usd:.4f}")

Provider Rate Limits

Shannon respects provider rate limits automatically:

Configured Limits

From config/models.yaml:
providers:
  openai:
    rpm: 10000  # Requests per minute
    tpm: 2000000  # Tokens per minute
  anthropic:
    rpm: 4000
    tpm: 400000

Automatic Throttling

When approaching limits:
  1. Queues requests
  2. Spreads load over time
  3. Falls back to alternative providers if available

Cost Monitoring

Track Spending Per Task

status = client.wait(handle.task_id)
if status.metrics:
    print(f"Tokens used: {status.metrics.tokens_used}")
    print(f"Cost: ${status.metrics.cost_usd:.4f}")

Aggregate Metrics

Shannon tracks cumulative costs in the dashboard:
  • Total spend by day/week/month
  • Cost per user/team
  • Cost per cognitive pattern
  • Token usage trends
Visit http://localhost:2111 to view real-time cost analytics.

Best Practices

1. Always Set Budgets

Never run production tasks without budget limits:
# ❌ Bad: No budget limits
client.submit_task(query="...")

# ✅ Good: Budget protection
client.submit_task(
    query="...",
    # Budget configured via .env}
)

2. Use Simple Mode When Possible

Complex patterns cost more:
# Simple query: Use simple mode
client.submit_task(
    query="What is the capital of France?",
    # Mode auto-selected  # Single agent, minimal tokens
)

# Complex query: Use standard/complex mode
client.submit_task(
    query="Research and compare 5 database technologies",
    # Mode auto-selected  # Task decomposition justified
)

3. Leverage Caching

For repeated queries, use consistent phrasing to maximize cache hits:
# ❌ Bad: Different phrasing prevents cache hits
client.submit_task(query="What's Python?")
client.submit_task(query="Tell me about Python")
client.submit_task(query="Explain Python")

# ✅ Good: Consistent queries hit cache
standard_query = "What is Python?"
client.submit_task(query=standard_query)  # Cache miss
client.submit_task(query=standard_query)  # Cache hit
client.submit_task(query=standard_query)  # Cache hit

4. Monitor and Optimize

Review cost metrics regularly:
# Enable detailed logging
import logging
logging.basicConfig(level=logging.INFO)

total_cost = 0.0
for t in tasks:
    st = client.wait(t.task_id)
    if st.metrics:
        total_cost += st.metrics.cost_usd
        print(f"Task: ${st.metrics.cost_usd:.4f}, Running total: ${total_cost:.4f}")

5. Use Smaller Models First

Let the intelligent router prove when larger models are needed:
# Let Shannon choose
client.submit_task(query="...")  # Auto-selects tier

# Model tier is selected by the platform router. No per-request model tier or budget parameters are accepted by the SDK.

Cost Optimization Checklist

  • Set budget env vars (MAX_COST_PER_REQUEST, MAX_TOKENS_PER_REQUEST)
  • Use # Mode auto-selected for straightforward queries
  • Enable response caching (default: enabled)
  • Use model_tier="SMALL" when appropriate
  • Standardize query phrasing for cache hits
  • Monitor cost metrics in dashboard
  • Set up budget alerts (via Prometheus)
  • Review and optimize prompt templates
  • Use session context to reduce token usage
  • Enable learning router (default: enabled)

Example: Cost-Optimized Workflow

from shannon import ShannonClient

client = ShannonClient()

# High-volume, simple queries
simple_tasks = [
    "Classify sentiment: Great product!",
    "Classify sentiment: Terrible experience",
    "Classify sentiment: It's okay"
]

total_cost = 0
for query in simple_tasks:
    handle = client.submit_task(query=query)
    status = client.wait(handle.task_id)
    if status.metrics:
        total_cost += status.metrics.cost_usd

print(f"Total cost for 3 tasks: ${total_cost:.4f}")
# Example: ~$0.006 (~90% savings vs always GPT-5)

Next Steps