Cost Control

Overview
Setting Budgets
Model Tiers
Explicit Tier Preference
Intelligent Router
How It Works
Cost Savings
Monitoring Router Decisions
Response Caching
Cache Strategy
Cache Benefits
Monitoring Cache Performance
Provider Rate Limits
Configured Limits
Automatic Throttling
Cost Monitoring
Track Spending Per Task
Aggregate Metrics
Best Practices
1. Always Set Budgets
2. Use Simple Mode When Possible
3. Leverage Caching
4. Monitor and Optimize
5. Use Smaller Models First
Cost Optimization Checklist
Example: Cost-Optimized Workflow
Next Steps

Overview

Shannon provides comprehensive cost control features to prevent unexpected LLM charges and optimize spending. With built-in budget enforcement and intelligent routing, teams often see significant cost reductions (60–90%) versus naive implementations (workload‑dependent).

Setting Budgets

Budgets are configured at the platform level (not per request via REST). Use environment variables in .env:

# LLM service budget guards
MAX_TOKENS_PER_REQUEST=10000    # Max tokens per request
MAX_COST_PER_REQUEST=0.50       # Max cost per request (USD)

# Apply changes
docker compose restart

Shannon enforces budgets during execution. When limits are reached, the system halts further spending and returns the best available result or an error depending on context.

Model Tiers

Shannon categorizes models into tiers based on capability and cost:

Tier	Models	Cost per 1M tokens	Use Case
SMALL	gpt-5-mini claude-haiku	$0.15 -$ 0.25	Simple queries, high volume
MEDIUM	gpt-5 claude-sonnet	$3.00 -$ 15.00	General purpose tasks
LARGE	gpt-5-thinking claude-opus	$15.00 -$ 75.00	Complex reasoning, critical tasks

Explicit Tier Preference

Set a preferred default tier via environment variable:

DEFAULT_MODEL_TIER=small   # small | medium | large

Intelligent Router

Shannon’s learning router automatically selects the cheapest model capable of handling each task.

How It Works

Task Analysis: Analyzes complexity, required capabilities
Model Selection: Starts with smallest viable model
Quality Check: Validates output quality
Learning: Remembers successful model-task pairings

Cost Savings

# Without intelligent routing
Traditional: Always use GPT-5 → $0.50 per task

# With Shannon's routing
Shannon:
  - 70% routed to gpt-5-mini → $0.01
  - 25% routed to gpt-5 → $0.15
  - 5% routed to gpt-5-thinking → $0.50
Example average: $0.05 per task (~90% savings)

Monitoring Router Decisions

Use the dashboard or SDK status to review costs:

status = client.wait(handle.task_id)
if status.metrics:
    print(f"Cost (USD): {status.metrics.cost_usd:.4f}")

Response Caching

Shannon caches LLM responses to eliminate redundant API calls:

Cache Strategy

Key: SHA256 hash of (messages + model + parameters)
TTL: Configurable (often ~1 hour via Redis TTL)
Storage: In-memory LRU + optional Redis for distributed caching
Hit Rate: Typical 30-50% for production workloads

Cache Benefits

# First call: Cache miss
Task 1: "What is Python?" → $0.002 (LLM call)

# Second call: Cache hit
Task 2: "What is Python?" → $0.000 (cached)

Monitoring Cache Performance

status = client.wait(handle.task_id)
if status.metrics:
    if status.metrics.cost_usd == 0:
        print("Likely served from cache (no LLM cost)")
    else:
        print(f"Cost: ${status.metrics.cost_usd:.4f}")

Provider Rate Limits

Shannon respects provider rate limits automatically:

Configured Limits

From config/models.yaml:

providers:
  openai:
    rpm: 10000  # Requests per minute
    tpm: 2000000  # Tokens per minute
  anthropic:
    rpm: 4000
    tpm: 400000

Automatic Throttling

When approaching limits:

Queues requests
Spreads load over time
Falls back to alternative providers if available

Cost Monitoring

Track Spending Per Task

status = client.wait(handle.task_id)
if status.metrics:
    print(f"Tokens used: {status.metrics.tokens_used}")
    print(f"Cost: ${status.metrics.cost_usd:.4f}")

Aggregate Metrics

Shannon tracks cumulative costs in the dashboard:

Total spend by day/week/month
Cost per user/team
Cost per cognitive pattern
Token usage trends

Visit http://localhost:2111 to view real-time cost analytics.

Best Practices

1. Always Set Budgets

Never run production tasks without budget limits:

# ❌ Bad: No budget limits
client.submit_task(query="...")

# ✅ Good: Budget protection
client.submit_task(
    query="...",
    # Budget configured via .env}
)

2. Use Simple Mode When Possible

Complex patterns cost more:

# Simple query: Use simple mode
client.submit_task(
    query="What is the capital of France?",
    # Mode auto-selected  # Single agent, minimal tokens
)

# Complex query: Use standard/complex mode
client.submit_task(
    query="Research and compare 5 database technologies",
    # Mode auto-selected  # Task decomposition justified
)

3. Leverage Caching

For repeated queries, use consistent phrasing to maximize cache hits:

# ❌ Bad: Different phrasing prevents cache hits
client.submit_task(query="What's Python?")
client.submit_task(query="Tell me about Python")
client.submit_task(query="Explain Python")

# ✅ Good: Consistent queries hit cache
standard_query = "What is Python?"
client.submit_task(query=standard_query)  # Cache miss
client.submit_task(query=standard_query)  # Cache hit
client.submit_task(query=standard_query)  # Cache hit

4. Monitor and Optimize

Review cost metrics regularly:

# Enable detailed logging
import logging
logging.basicConfig(level=logging.INFO)

total_cost = 0.0
for t in tasks:
    st = client.wait(t.task_id)
    if st.metrics:
        total_cost += st.metrics.cost_usd
        print(f"Task: ${st.metrics.cost_usd:.4f}, Running total: ${total_cost:.4f}")

5. Use Smaller Models First

Let the intelligent router prove when larger models are needed:

# Let Shannon choose
client.submit_task(query="...")  # Auto-selects tier

# Model tier is selected by the platform router. No per-request model tier or budget parameters are accepted by the SDK.

Cost Optimization Checklist

Optimization Checklist

Example: Cost-Optimized Workflow

from shannon import ShannonClient

client = ShannonClient()

# High-volume, simple queries
simple_tasks = [
    "Classify sentiment: Great product!",
    "Classify sentiment: Terrible experience",
    "Classify sentiment: It's okay"
]

total_cost = 0
for query in simple_tasks:
    handle = client.submit_task(query=query)
    status = client.wait(handle.task_id)
    if status.metrics:
        total_cost += status.metrics.cost_usd

print(f"Total cost for 3 tasks: ${total_cost:.4f}")
# Example: ~$0.006 (~90% savings vs always GPT-5)

Next Steps

Streaming Events

Real-time task monitoring

Configuration

Advanced cost settings

API Overview

Endpoints and usage

Monitoring

Cost monitoring UI

Workflows and Patterns Real-Time Streaming

⌘I

Getting Started

Core Concepts

Guides

Overview

Setting Budgets

Model Tiers

Explicit Tier Preference

Intelligent Router

How It Works

Cost Savings

Monitoring Router Decisions

Response Caching

Cache Strategy

Cache Benefits

Monitoring Cache Performance

Provider Rate Limits

Configured Limits

Automatic Throttling

Cost Monitoring

Track Spending Per Task

Aggregate Metrics

Best Practices

1. Always Set Budgets

2. Use Simple Mode When Possible

3. Leverage Caching

4. Monitor and Optimize

5. Use Smaller Models First

Cost Optimization Checklist

Example: Cost-Optimized Workflow

Next Steps

Streaming Events

Configuration

API Overview

Monitoring

Getting Started

Core Concepts

Guides

​Overview

​Setting Budgets

​Model Tiers

​Explicit Tier Preference

​Intelligent Router

​How It Works

​Cost Savings

​Monitoring Router Decisions

​Response Caching

​Cache Strategy

​Cache Benefits

​Monitoring Cache Performance

​Provider Rate Limits

​Configured Limits

​Automatic Throttling

​Cost Monitoring

​Track Spending Per Task

​Aggregate Metrics

​Best Practices

​1. Always Set Budgets

​2. Use Simple Mode When Possible

​3. Leverage Caching

​4. Monitor and Optimize

​5. Use Smaller Models First

​Cost Optimization Checklist

​Example: Cost-Optimized Workflow

​Next Steps

Streaming Events

Configuration

API Overview

Monitoring

Overview

Setting Budgets

Model Tiers

Explicit Tier Preference

Intelligent Router

How It Works

Cost Savings

Monitoring Router Decisions

Response Caching

Cache Strategy

Cache Benefits

Monitoring Cache Performance

Provider Rate Limits

Configured Limits

Automatic Throttling

Cost Monitoring

Track Spending Per Task

Aggregate Metrics

Best Practices

1. Always Set Budgets

2. Use Simple Mode When Possible

3. Leverage Caching

4. Monitor and Optimize

5. Use Smaller Models First

Cost Optimization Checklist

Example: Cost-Optimized Workflow

Next Steps