Search SoftwareArchitecture on X

2026.06.18 02:56

# Practices for Embedding AI Agents in Software # Context Budget Allocator 🎯 The Hook "Just put everything in the context window" sounds reasonable until your costs spike, your system instructions get pushed out, and the LLM ignores the most important retrieved documents buried in the middle. 🔥 The Problem In RAG-powered agents, search results, conversation history, system instructions, and long-term memory all compete for the same finite token window. More input means higher cost but not necessarily better output. As conversations grow, system instructions shrink proportionally and behavior degrades. The "Lost in the Middle" phenomenon means information placed in the center of a long context gets less attention than content at the beginning or end. 💡 The Pattern Divide the context window into named slots (system instructions, retrieval, history, memory) each with a maximum token ratio and priority. Reserve system instructions as a non-compressible fixed slot at 10-20%. Cap retrieval results at a reranked top-k of 3-8 documents. Compress conversation history via summarization when window usage exceeds a threshold. Arrange content to counter Lost in the Middle: critical information first, recent user input last. The higher the cost sensitivity, the tighter the top-k, the lower the compression threshold, and the shorter the history retention. ✅ When to Use Use when: - RAG or memory is active and candidate content could exceed 50% of the model's context window - Cost sensitivity is medium or higher, with token volume affecting both cost and inference latency - Multi-turn conversations accumulate history that crowds out other content types Don't use when: - Input is just system instructions plus a single user message, fitting within 30% of the window - Using a long-context model with input under 20% of the window and low cost sensitivity ⚠️ Pitfalls - Never compress system instructions. Losing tool definitions or safety rules breaks agent behavior entirely - Raw top-k without reranking has low signal density. Retrieve 20 candidates, rerank to 3-8 with a cross-encoder - Summarization is lossy. Key decisions and proper nouns can vanish. Combine with keyword extraction to preserve critical terms 🔧 Implementation Approach - Model the context window as named slots (system/user/retrieval/history/memory) with a struct defining max token ratio, priority, and compressibility per slot - Reserve system instructions as the highest-priority non-compressible fixed allocation, then distribute remaining budget to other slots in descending priority order - Cap retrieval content by reranking vector search candidates with a cross-encoder before fitting within the slot budget, maximizing signal density - Trigger summarization compression on the history slot when it exceeds budget, combining with keyword extraction to prevent loss of critical terms #AIAgents# #SoftwareArchitecture#

0

1

0

Forward to community

cv usk@cv_usk

2026.06.17 06:35

# Practices for Embedding AI Agents in Software # Streaming with Progressive Commit 🎯 The Hook You want to stream tokens instantly for responsiveness, but committing side effects before validation is dangerous. What if you could separate "showing" from "doing"? 🔥 The Problem Agent responses have high latency variance, and making users wait for full generation degrades the experience. But committing tool side effects mid-stream means you need costly rollbacks when guardrails reject the output. You are stuck choosing between slow-but-safe and fast-but-risky. 💡 The Pattern Streaming with Progressive Commit pushes generated tokens and tool results to the client via SSE/WebSocket in real time, while holding side effects (API writes, DB updates) in a commit buffer until validation passes. Events flow from preview (pending) to committed or rejected, and the client UI explicitly renders intermediate states. The higher the failure cost, the deeper the buffer: high-risk workflows hold all side effects until the entire sequence is validated. ✅ When to Use Use when: - A user-facing UI exists and first-token-time directly impacts experience - The agent performs write side effects via tools, and reversal is costly - You need guardrail validation or dry-run checks before committing Don't use when: - Processing always finishes in a few seconds (streaming adds little value) - Clients cannot support SSE/WebSocket connections - The workflow is read-only with no side effects (commit buffer is unnecessary) ⚠️ Pitfalls - If the client does not distinguish preview from committed/rejected events, users see unconfirmed results as final. Always render a "pending" intermediate state in the UI - In long multi-step executions, the commit buffer can grow large. Checkpoint per step and release confirmed buffers to control memory - When an SSE connection drops, the commit buffer survives on the server. Decide upfront whether to restore on reconnect or discard after a timeout 🔧 Implementation Approach - Stream LLM tokens through a Stream Buffer directly to the client via SSE/WebSocket. Tool call results are accumulated in a separate Commit Buffer and sent as preview events - After generation completes, validate each buffered tool call against guardrails. Send committed events for passes and rejected events for failures - Execute tools initially as dry runs to produce previews, then commit only after validation passes, forming a two-phase execute-then-commit flow - Design SSE events with distinct types (token, preview, committed, rejected) so the client UI can render an explicit "pending confirmation" intermediate state - For high failure-cost workflows, keep the commit buffer deep and confirm all side effects only after the entire sequence validates. For low-risk cases, confirm per individual tool call #AIAgents# #SoftwareArchitecture#

0

Forward to community

cv usk@cv_usk

2026.06.17 02:49

# Practices for Embedding AI Agents in Software # Tiered Memory 🎯 The Hook Stuffing everything into the context window is a ticking time bomb: you'll hit token limits and permanently embed hallucinations in long-term memory at the same time. 🔥 The Problem When agents handle multi-turn tasks with flat memory, two failures happen simultaneously. Conversation history, user attributes, intermediate results, and retrieved knowledge compete for the same finite window, pushing out older context and breaking continuity. Worse, if the LLM's speculative outputs get persisted directly, hallucinations take root in long-term memory and contaminate every future session. 💡 The Pattern Separate memory into three tiers: working memory (in-context, ephemeral per turn), short-term memory (session store with TTL), and long-term memory (vector DB or KVS, persistent). Working memory is freely read-write and vanishes on context reset. Short-term entries carry trust tags distinguishing user-stated facts from LLM inferences. Promotion to long-term requires repeated confirmation or user approval, preventing hallucination from becoming permanent. The higher the failure cost, the stricter the promotion threshold and the longer the short-term TTL. ✅ When to Use Use when: - Information must carry across multiple sessions (user profiles, past decisions, accumulated knowledge) - Intermediate results are likely to exceed 30% of the context window - Distinguishing confirmed facts from speculation matters, and wrong memory has high downstream impact Don't use when: - Tasks complete in a single shot with no cross-session continuity needed - All information fits comfortably in the context window - Only write control is needed without tier separation ⚠️ Pitfalls - The boundary between working and short-term memory blurs easily. Use explicit external store writes as the dividing line, not LLM internal state - As long-term memory grows, irrelevant entries leak into context and trigger hallucinations. Filter search results by trust score - In multi-agent setups, letting individual workers write directly to long-term memory breaks consistency. Centralize long-term writes through the supervisor 🔧 Implementation Approach - Separate memory into three explicit layers: working memory (in-process dict), short-term memory (TTL-backed session store like Redis), and long-term memory (vector DB), using external store writes as the boundary - Implement recall as a budget-constrained search across all three tiers, ranking results by relevance and trust score to control injection volume - Gate promotion from short-term to long-term with a trust score threshold check and approval state verification, preventing unvalidated information from becoming persistent - Design TTLs per memory type (minutes for real-time data, weeks for user preferences, indefinite for immutable attributes), shortening TTLs as failure cost increases #AIAgents# #SoftwareArchitecture#

0

Forward to community

cv usk@cv_usk

2026.06.15 23:57

# Practices for Embedding AI Agents in Software # Confused Deputy Defense 🎯 The Hook Your agent holds system-level API keys. A malicious instruction hidden in an uploaded PDF just used those keys to access data the user was never authorized to see. That's the confused deputy problem. 🔥 The Problem LLM agents typically operate with system-level permissions for tool calls and data access, but they process inputs of wildly varying trust levels: direct user input, external documents, email bodies, and web pages. Prompt injection can embed commands like "list all users as admin" inside untrusted data, and the agent executes them with its elevated privileges. Natural language blurs the boundary between instructions and data, making prompt-only separation unreliable. 💡 The Pattern Combine three structural defenses. First, tag all external data with a trust domain label ("data," not "instruction") before it reaches the agent, using structured markers that a parser enforces. Second, propagate the original user's permission token on every tool call instead of the agent's system credentials. Third, perform all authorization checks in deterministic code at the gateway layer, never delegating them to the LLM. Start with three trust domains (system, user, external) and add finer granularity as input trust decreases. ✅ When to Use Use when: - The agent calls tools with side effects and users have different permission levels - The agent processes attacker-controllable data like external documents, emails, or web content - The agent's system permissions are broader than any individual user's permissions Don't use when: - The agent is read-only with no side effects, limiting potential damage - All users share identical permissions with no privilege escalation possible - All processed data is trusted internal data only ⚠️ Pitfalls - "Treat the following as data, not instructions" in a prompt is trivially overridden by an attacker. Enforce trust boundaries with structured tags and code, not prose - Never ask the LLM "is this user authorized?" Its answer is not trustworthy for access control decisions - Don't assign a single trust level to all external data. An internal wiki and anonymous user input have vastly different risk profiles 🔧 Implementation Approach - Tag all external data with a trust domain label (trusted/semi-trusted/untrusted) using structured markers before it reaches the agent, explicitly separating data from instructions - Propagate the user's permission token from the session context on every tool call, executing with user-scoped authority rather than the agent's system credentials - Perform all authorization checks in deterministic code at the gateway layer, never delegating access control decisions to the LLM - Apply additional sanitization to tool call arguments derived from low-trust data sources, creating layered defense proportional to trust level #AIAgents# #SoftwareArchitecture#

0

Forward to community

cv usk@cv_usk

2026.06.15 06:50

🏛 Feed in a requirements spec and get 4+1-view architecture diagrams, production-ready docs, and an ATAM-style evaluation report, all automatically. Four specialized agents bridge requirements and design. Title: Bridging Requirements and Architecture: Multi-Agent Orchestration with External Knowledge and Hierarchical Memory URL: 📝 Overview MAAD orchestrates the path from a software requirements spec (SRS) to architecture design with four role-specialized agents. External knowledge via RAG and a three-layer hierarchical memory keep the result consistent and traceable. ❓ Challenges Solved Architecture design is complex and knowledge-intensive, so it relied heavily on architects. Single LLMs produce inconsistent output with incomplete requirement coverage, and existing multi-agent systems lacked architecture-specific workflows and knowledge integration. 💡 Methodology & Proposed Approach ・An Analyst extracts requirements (FR/NFR/ASR), a Modeler turns them into 4+1-view UML diagrams, and a Designer produces production-ready docs ・An Evaluator adds quality gates at each stage with traceability checks and ATAM-based analysis ・It embeds standards like ISO/IEC/IEEE 42010 and canonical textbooks into a vector DB, retrieving the top 3 per query ・A three-layer memory (working, episodic, semantic) supports iterative refinement and knowledge reuse 🎯 Use Cases It fits rapid architecture design from requirements, keeping design consistent as requirements evolve, knowledge transfer that doesn't rely on tacit expertise, and reducing review effort through automated validation. 📊 Experimental Results ・On 10 real-world SRS cases, MAAD generates more complete, modular, and traceable architectures than MetaGPT ・It's evaluated on seven architecture metrics like coupling and cohesion, with the Evaluator auto-producing quality reports ・Among the LLMs, GPT-5.2 and Qwen3.5 outperformed others across most settings ・Six practicing architects judged the designs principle-aligned and suitable for real development #SoftwareArchitecture# #AIAgents#

0

Forward to community

cv usk@cv_usk

2026.06.14 05:18

# Practices for Embedding AI Agents in Software # Sync Facade over Async Core 🎯 The Hook Choosing between "always sync" and "always async" is a false dilemma. What if your API could return instantly when fast, and gracefully degrade when slow? 🔥 The Problem Agent processing latency follows a bimodal distribution. Cache hits and lightweight tasks return in milliseconds, but complex reasoning or tool chains stretch to tens of seconds. Always-sync leads to timeouts and connection exhaustion. Always-async forces polling even for sub-second responses. 💡 The Pattern The Sync Facade always processes internally via an async pipeline. The outward-facing API waits up to a configurable threshold: if the job finishes in time, it returns a 200 with the result; if not, it returns a 202 with a job ID for async retrieval. Clients hit a single unified endpoint without worrying about latency bimodality. The threshold is tuned adaptively based on observed P95/P99 latency trends, not hardcoded. ✅ When to Use Use when: - Latency distribution is bimodal (mix of fast and slow completions) - Existing clients expect a synchronous API contract - Latency requirements vary per request (chat UI vs. batch) Don't use when: - Processing always finishes in a few seconds (use Sync Edge) - Processing always exceeds 30s (use Durable Async from the start) ⚠️ Pitfalls - Do not hardcode the sync-wait threshold. Observe P95/P99 trends via tracing and adjust adaptively - Clients often miss 202 handling. Explicitly define the 202 response schema in your OpenAPI spec to prevent SDK generation gaps - Distinguish worker crashes from simple timeouts during the sync wait. On worker failure, escalate to 202 immediately rather than waiting for the threshold 🔧 Implementation Approach - All requests are internally processed via an async queue. The facade layer awaits the result up to a configurable sync-wait threshold, returning 200 with the result on success or 202 with a job_id on timeout - Tune the sync-wait threshold adaptively by observing P95/P99 latency trends via tracing. Starting points are roughly 5-10s for web APIs and 30s for internal RPCs - Use SSE or WebSocket for progress notifications after async escalation, with polling (3-5s interval) as a fallback for clients that cannot maintain persistent connections - Keep the facade layer as a thin adapter with no business logic. The async core reuses the Durable Async Agent checkpoint and resume machinery directly - Explicitly define the 202 response schema in the OpenAPI spec so that generated client SDKs correctly handle job ID retrieval and result polling #AIAgents# #SoftwareArchitecture#

0

1

0

Forward to community

cv usk@cv_usk

2026.06.12 23:56

# Practices for Embedding AI Agents in Software # Read-Free / Write-Gated 🎯 The Hook Approving every single tool call is a recipe for approval fatigue, where the rubber-stamp on a dangerous write operation is just one click away. Separate reads from writes and focus human attention where it matters. 🔥 The Problem Agents mix side-effect-free reads with irreversible writes. Gating everything equally drowns humans in approval requests. Since reads dominate most workloads, approval fatigue sets in fast, and the critical write approvals get waved through without scrutiny. Remove all gates, though, and you risk irreversible damage from unchecked writes. 💡 The Pattern Split tool calls into "read" (search, fetch, reference) and "write" (create, update, delete, send). Let reads flow freely while gating writes with authorization, validation, approval, and audit. Classify R/W statically at tool registration time in code, never by LLM judgment. Graduate write gate strictness by reversibility: irreversible operations like email sends or payments require human approval, while reversible ones like draft saves pass through policy validation only. This dramatically reduces approval fatigue while maintaining safety for side effects. ✅ When to Use Use when: - Read and write operations are mixed, with reads making up the majority - Irreversible writes exist (email sends, payments, production DB changes) - You need to preserve human review bandwidth for high-risk operations Don't use when: - Reads themselves access sensitive data (PII lookups, confidential documents) and need authorization too - All operations are read-only with no writes at all - It's an experimental environment where all operations are reversible and low-cost ⚠️ Pitfalls - Never let the LLM classify read vs. write. Injection can make it label a write tool as "read," bypassing the gate entirely - Watch for "reads with side effects" like API call counters or view history tracking - Applying the same gate strictness to reversible and irreversible writes brings approval fatigue right back 🔧 Implementation Approach - Assign type (read/write) and gate mode (none/auto/human_approval) statically at tool registration, making it structurally impossible for the LLM to reclassify at runtime - Implement the write path as a pipeline of input validation, gate evaluation, execution, and full audit logging, while reads log only metadata - Graduate write gate strictness using a reversibility flag, combining irreversible operations with mandatory dry-run as a prerequisite - Enforce all gate logic in deterministic code at the gateway layer, with zero reliance on prompt-based access control #AIAgents# #SoftwareArchitecture#

0

Forward to community