cv usk(@cv_usk):# Practices for Embedding AI Agents in Software # Context Budget Allocator 🎯 The Hook "Just put everything in the context window" sounds reasonable until your costs spike, your system instructions get pushed out, and the LLM ignores the most important retrieved documents buried in the middle. 🔥 The Problem In RAG-powered agents, search results, conversation history, system instructions, and long-term memory all compete for the same finite token window. More input means higher cost but not necessarily better output. As conversations grow, system instructions shrink proportionally and behavior degrades. The "Lost in the Middle" phenomenon means information placed in the center of a long context gets less attention than content at the beginning or end. 💡 The Pattern Divide the context window into named slots (system instructions, retrieval, history, memory) each with a maximum token ratio and priority. Reserve system instructions as a non-compressible fixed slot at 10-20%. Cap retrieval results at a reranked top-k of 3-8 documents. Compress conversation history via summarization when window usage exceeds a threshold. Arrange content to counter Lost in the Middle: critical information first, recent user input last. The higher the cost sensitivity, the tighter the top-k, the lower the compression threshold, and the shorter the history retention. ✅ When to Use Use when: - RAG or memory is active and candidate content could exceed 50% of the model's context window - Cost sensitivity is medium or higher, with token volume affecting both cost and inference latency - Multi-turn conversations accumulate history that crowds out other content types Don't use when: - Input is just system instructions plus a single user message, fitting within 30% of the window - Using a long-context model with input under 20% of the window and low cost sensitivity ⚠️ Pitfalls - Never compress system instructions. Losing tool definitions or safety rules breaks agent behavior entirely - Raw top-k without reranking has low signal density. Retrieve 20 candidates, rerank to 3-8 with a cross-encoder - Summarization is lossy. Key decisions and proper nouns can vanish. Combine with keyword extraction to preserve critical terms 🔧 Implementation Approach - Model the context window as named slots (system/user/retrieval/history/memory) with a struct defining max token ratio, priority, and compressibility per slot - Reserve system instructions as the highest-priority non-compressible fixed allocation, then distribute remaining budget to other slots in descending priority order - Cap retrieval content by reranking vector search candidates with a cross-encoder before fitting within the slot budget, maximizing signal density - Trigger summarization compression on the history slot when it exceeds budget, combining with keyword extraction to prevent loss of critical terms #AIAgents #SoftwareArchitecture

2026.06.18 02:56

# Practices for Embedding AI Agents in Software # Context Budget Allocator 🎯 The Hook "Just put everything in the context window" sounds reasonable until your costs spike, your system instructions get pushed out, and the LLM ignores the most important retrieved documents buried in the middle. 🔥 The Problem In RAG-powered agents, search results, conversation history, system instructions, and long-term memory all compete for the same finite token window. More input means higher cost but not necessarily better output. As conversations grow, system instructions shrink proportionally and behavior degrades. The "Lost in the Middle" phenomenon means information placed in the center of a long context gets less attention than content at the beginning or end. 💡 The Pattern Divide the context window into named slots (system instructions, retrieval, history, memory) each with a maximum token ratio and priority. Reserve system instructions as a non-compressible fixed slot at 10-20%. Cap retrieval results at a reranked top-k of 3-8 documents. Compress conversation history via summarization when window usage exceeds a threshold. Arrange content to counter Lost in the Middle: critical information first, recent user input last. The higher the cost sensitivity, the tighter the top-k, the lower the compression threshold, and the shorter the history retention. ✅ When to Use Use when: - RAG or memory is active and candidate content could exceed 50% of the model's context window - Cost sensitivity is medium or higher, with token volume affecting both cost and inference latency - Multi-turn conversations accumulate history that crowds out other content types Don't use when: - Input is just system instructions plus a single user message, fitting within 30% of the window - Using a long-context model with input under 20% of the window and low cost sensitivity ⚠️ Pitfalls - Never compress system instructions. Losing tool definitions or safety rules breaks agent behavior entirely - Raw top-k without reranking has low signal density. Retrieve 20 candidates, rerank to 3-8 with a cross-encoder - Summarization is lossy. Key decisions and proper nouns can vanish. Combine with keyword extraction to preserve critical terms 🔧 Implementation Approach - Model the context window as named slots (system/user/retrieval/history/memory) with a struct defining max token ratio, priority, and compressibility per slot - Reserve system instructions as the highest-priority non-compressible fixed allocation, then distribute remaining budget to other slots in descending priority order - Cap retrieval content by reranking vector search candidates with a cross-encoder before fitting within the slot budget, maximizing signal density - Trigger summarization compression on the history slot when it exceeds budget, combining with keyword extraction to prevent loss of critical terms #AIAgents# #SoftwareArchitecture#