Recommendations
AI-generated optimization opportunities
Route 62% of gpt-4o calls to gpt-4o-mini
Classification & extraction prompts under 800 tokens show identical quality on mini. Auto-routing layer with confidence threshold of 0.92.
Enable prompt caching on Claude Sonnet system prompts
Detected 1.2M repeated system prompts across customer-support workload. Caching reduces input cost by 90%.
Batch async embeddings via OpenAI Batch API
Document indexing job runs in real-time but tolerates 24h SLA. Batch tier is 50% cheaper.
Compress retrieval context with semantic dedup
RAG pipeline sends ~40% redundant chunks. Dedup before LLM call reduces input tokens by ~32%.
Failover routing: Azure → OpenAI on rate-limit
Currently failing over to retry on Azure. Cross-provider routing eliminates 4% wasted retries.
Switch internal eval suite to Haiku
Eval grader prompts perform within 1.2% of Sonnet but cost 12× less.
Trim verbose JSON schema in tool definitions
Tool schemas average 2,400 tokens/request. Compaction layer cuts to ~800.
Move bulk summarization to Gemini Flash
Daily report summarization workload (2.1M docs/mo) is latency-tolerant and Flash-compatible.