Recommendations

AI-generated optimization opportunities

Available savings
$59.9k/mo

Route 62% of gpt-4o calls to gpt-4o-mini

Classification & extraction prompts under 800 tokens show identical quality on mini. Auto-routing layer with confidence threshold of 0.92.

+$18.4k/molow effort96% confidence
new

Enable prompt caching on Claude Sonnet system prompts

Detected 1.2M repeated system prompts across customer-support workload. Caching reduces input cost by 90%.

+$12.2k/molow effort99% confidence
new

Batch async embeddings via OpenAI Batch API

Document indexing job runs in real-time but tolerates 24h SLA. Batch tier is 50% cheaper.

+$6.4k/momedium effort92% confidence
in progress

Compress retrieval context with semantic dedup

RAG pipeline sends ~40% redundant chunks. Dedup before LLM call reduces input tokens by ~32%.

+$9.8k/momedium effort88% confidence
new

Failover routing: Azure → OpenAI on rate-limit

Currently failing over to retry on Azure. Cross-provider routing eliminates 4% wasted retries.

+$2.2k/molow effort94% confidence
new

Switch internal eval suite to Haiku

Eval grader prompts perform within 1.2% of Sonnet but cost 12× less.

+$4.8k/molow effort91% confidence
implemented

Trim verbose JSON schema in tool definitions

Tool schemas average 2,400 tokens/request. Compaction layer cuts to ~800.

+$3.6k/mohigh effort84% confidence
new

Move bulk summarization to Gemini Flash

Daily report summarization workload (2.1M docs/mo) is latency-tolerant and Flash-compatible.

+$7.2k/momedium effort89% confidence
new