write /goals like acceptance criteria.
/goal is now everywhere. Claude Code, Codex, Hermes, and more agents are adopting the same pattern: you set a completion condition, the agent works autonomously until a fast evaluator model confirms the condition is met.
the feature is simple. writing good goals is not.
vague goals fail in two ways: the agent loops forever trying to satisfy an unclear condition, or the evaluator hallucinates success because there's nothing concrete to check against. both burn tokens for nothing.
here's what separates goals that work from goals that break:
๐ด๐ผ๐ผ๐ฑ ๐ด๐ผ๐ฎ๐น๐ ๐ฑ๐ฒ๐๐ฐ๐ฟ๐ถ๐ฏ๐ฒ ๐ฎ๐ป ๐ผ๐ฏ๐๐ฒ๐ฟ๐๐ฎ๐ฏ๐น๐ฒ ๐ฒ๐ป๐ฑ ๐๐๐ฎ๐๐ฒ.
"all tests in test/auth pass and lint is clean" works because the agent can run the tests, print the output, and the evaluator can confirm it from the transcript.
"every call site of the old API migrated and build succeeds" works because there's a verifiable artifact: the build output.
"CHANGELOG.md has an entry for each PR merged this week" works because it points to a concrete file with concrete content.
๐ฏ๐ฎ๐ฑ ๐ด๐ผ๐ฎ๐น๐ ๐ต๐ฎ๐๐ฒ ๐ป๐ผ ๐ณ๐ถ๐ป๐ถ๐๐ต ๐น๐ถ๐ป๐ฒ.
"make the codebase better" fails because better by what metric? "refactor everything" fails because there's no exit condition. "fix the bugs" fails because which bugs, verified how?
the mental model that helps: if a human couldn't tell when the ticket is done, neither can the evaluator.
treat every /goal like a ticket you're assigning to a very literal junior developer who never gets tired. write the exact acceptance criteria you'd put in that ticket.
one more thing: complex multi-step objectives overwhelm it. "redesign auth, add OAuth, write tests, update docs" is four goals pretending to be one. break them into sequential /goal calls where each has a single verifiable finish line.
i wrote a detailed breakdown of /goal (article below) covering the full mechanics.
Show more
RT
@twid: Best setup guide for Hermes Agent I've seen. Gets you fully configured without getting too far into the weed.
the three-tier memory of Hermes agent.
AI agents forgets everything when your session ends. Hermes doesn't.
it has three memory layers, each at a different speed.
๐๐ถ๐ฒ๐ฟ ๐ญ: ๐๐๐ผ ๐๐ถ๐ป๐ ๐บ๐ฎ๐ฟ๐ธ๐ฑ๐ผ๐๐ป ๐ณ๐ถ๐น๐ฒ๐
MEMORY.md (2,200 chars) and USER.md (1,375 chars). injected into the system prompt at session start as a frozen snapshot.
MEMORY.md holds project conventions, tool quirks, lessons learned. USER.md holds your profile: name, communication style, skill level.
these files are tiny on purpose. when MEMORY.md hits ~80% capacity, the agent consolidates: merges related entries, drops redundancy, keeps only the densest facts.
natural selection pressure applied to memory. the files stay small, but what's inside gets sharper over time.
๐๐ถ๐ฒ๐ฟ ๐ฎ: ๐ณ๐๐น๐น-๐๐ฒ๐
๐ ๐๐ฒ๐๐๐ถ๐ผ๐ป ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต (๐๐พ๐น๐ถ๐๐ฒ + ๐ณ๐๐๐ฑ)
every conversation gets stored in SQLite with FTS5 indexing. the agent can search weeks of past sessions on demand.
when the agent calls session_search: FTS5 ranks matches in ~10ms over 10,000+ docs, an LLM summarizes the top hits, and a concise result returns to context.
tier 1 is always present but tiny. tier 2 has unlimited capacity but requires an active search. critical facts live in memory, everything else is searchable.
๐๐ถ๐ฒ๐ฟ ๐ฏ: ๐ฒ๐
๐๐ฒ๐ฟ๐ป๐ฎ๐น ๐บ๐ฒ๐บ๐ผ๐ฟ๐ ๐ฝ๐ฟ๐ผ๐๐ถ๐ฑ๐ฒ๐ฟ๐
8 pluggable providers that run alongside tiers 1 and 2, never replacing them. three worth knowing: Honcho (dialectic user modeling, 12 identity layers), Holographic (local-first, HRR vectors, no external calls), and Supermemory (context fencing that prevents the same fact from being re-stored infinitely).
when active, hermes auto-syncs every turn: prefetch before, sync after, extract at session end.
๐ต๐ผ๐ ๐๐ต๐ฒ๐ ๐ฐ๐ผ๐บ๐ฝ๐ผ๐๐ฒ ๐ถ๐ป ๐ฎ ๐๐ถ๐ป๐ด๐น๐ฒ ๐๐๐ฟ๐ป
this is the part most people miss. the tiers compose on every turn through a five-step cycle:
1. turn opens. tier 1 is already in prompt, tier 3 prefetches and prepends.
2. agent responds using all three tiers as context.
3. periodic nudge fires (~every 300s). the agent reflects: "has anything worth persisting happened?" if yes, it writes. if no, it returns silently.
4. memory written to MEMORY.md on disk. invisible this session because the prefix cache stays warm.
5. session closes. tier 2 logs the transcript, tier 3 extracts semantics. next session opens with the new state.
agent memory today is either always-on but shallow (stuff everything in the prompt) or deep but passive (vector store that never fires at the right time).
hermes composes across both: tiny always-present files for critical facts, full-text search for deep recall, external providers for semantic modeling, all orchestrated by a nudge that decides autonomously what's worth saving.
the agent doesn't just store memories. it curates them under pressure.
i wrote a full deep dive (article below) covering hermes agent's memory system, self-evolving skills, GEPA optimization, and how to set up multiple specialized agents on your machine.
Show more
What actually is GBrain?
(Y Combinator CEO's personal agent brain)
Every agent memory tool you've seen solves a simple problem: store facts, retrieve facts.
GBrain solves a different one. It gives your agent a knowledge system that wires itself, enriches itself, and compounds while you're not even using it.
Here's what makes it fundamentally different from Mem0, Zep, LangMem, or a CLAUDE.md file.
The standard approach to agent memory is vector-based. Your agent stores memories as embeddings, retrieves them by semantic similarity, and that's the loop. Some tools add a knowledge graph on top.
GBrain flips the model entirely. The source of truth is a folder of markdown files. One page per person, one page per company, one page per concept. Every page follows the same two-part structure:
๐๐ผ๐บ๐ฝ๐ถ๐น๐ฒ๐ฑ ๐๐ฟ๐๐๐ต on top: your current best understanding, rewritten as new evidence arrives
๐ง๐ถ๐บ๐ฒ๐น๐ถ๐ป๐ฒ on the bottom: an append-only evidence trail that never gets edited
This is not a vector store with a markdown export. The markdown IS the system of record. You can open it in VS Code, edit it by hand, and ๐ด๐ฏ๐ฟ๐ฎ๐ถ๐ป ๐๐๐ป๐ฐ picks up the changes.
Now the part that makes this compound.
Every time a page is written, GBrain extracts entity references and creates typed relationship links: ๐๐ผ๐ฟ๐ธ๐_๐ฎ๐, ๐ถ๐ป๐๐ฒ๐๐๐ฒ๐ฑ_๐ถ๐ป, ๐ณ๐ผ๐๐ป๐ฑ๐ฒ๐ฑ, ๐ฎ๐๐๐ฒ๐ป๐ฑ๐ฒ๐ฑ, ๐ฎ๐ฑ๐๐ถ๐๐ฒ๐. All deterministic, all regex-based, zero LLM calls.
The knowledge graph wires itself on every single write, without spending tokens.
So when you ask "who works at Acme AI?" or "what has Bob invested in this quarter?", the agent walks the graph instead of relying on vector similarity (which struggles with relational queries like these).
Search layers ~20 deterministic techniques in concert: intent classification, multi-query expansion, vector search, keyword search, reciprocal rank fusion, cosine re-scoring, compiled-truth boosting, and backlink ranking. Each catches what the others miss.
But the real unlock is the compounding loop.
GBrain has a ๐๐ถ๐ด๐ป๐ฎ๐น ๐ฑ๐ฒ๐๐ฒ๐ฐ๐๐ผ๐ฟ that fires on every message and captures entities in the background. Person mentioned once? They get a stub page. Three mentions across different sources? Web enrichment kicks in. After a meeting? Full pipeline.
The agent runs a ๐ฑ๐ฟ๐ฒ๐ฎ๐บ ๐ฐ๐๐ฐ๐น๐ฒ overnight: scans conversations, enriches missing entities, fixes broken citations, consolidates memory. You wake up and the brain is smarter than when you went to bed.
This is fundamentally different from memory systems that only store what you explicitly tell them to store.
Garry Tan (President and CEO of Y Combinator) built this to run his actual AI agents. It ships with 34 skills, runs on embedded PGLite (no server, ready in 2 seconds), and works as an MCP server for Claude Code, Cursor, and Windsurf.
GBrain:
Show more
As an AI Engineer. Please learn:
- Harness engineering, not just prompt engineering
- Prompt caching vs. semantic caching tradeoffs
- KV cache management at scale
- Speculative decoding vs quantization
- Structured output failures & fallback chains
- Evals (LLM-as-judge + human evals)
- Cost attribution per feature, not just per model
- Agent guardrails & loop budgets
- LLM observability as a first-class discipline
- Model routing & graceful fallback logic
- Knowing when to fine-tune vs. in-context learning
Show more
Claude Code's architecture, mapped.
Calude Code is one of the most powerful agent harnessed out there, it's a lot more than "a CLI that calls claude." the actual system has six layers, and the model is just one node inside the loop.
the diagram breaks down every component:
๐๐ป๐ฝ๐๐ ๐๐ฎ๐๐ฒ๐ฟ handles session management, permission gating, and YAML-based trust tiers before anything reaches the model.
๐๐ป๐ผ๐๐น๐ฒ๐ฑ๐ด๐ฒ ๐๐ฎ๐๐ฒ๐ฟ holds the skill registry, context compressor (3-layer, 92% threshold), task graph, and cross-session memory store. this is where harness intelligence lives outside the weights.
๐๐
๐ฒ๐ฐ๐๐๐ถ๐ผ๐ป ๐๐ฎ๐๐ฒ๐ฟ runs tool dispatch through a typed registry with one handler per tool. bash, read, write, grep, glob, revert. streaming runtime handles parallel execution. prompt cache reuses stable prefixes at 10% cost.
๐๐ป๐๐ฒ๐ด๐ฟ๐ฎ๐๐ถ๐ผ๐ป ๐๐ฎ๐๐ฒ๐ฟ connects the MCP runtime to external servers. filesystem, git, custom. tools register inward, memory writes outward to agent_memory. md.
๐ ๐๐น๐๐ถ-๐๐ด๐ฒ๐ป๐ ๐๐ฎ๐๐ฒ๐ฟ is the most underappreciated piece. subagent spawner, teammate mailboxes over redis pub/sub, FSM protocol (IDLEโREQUESTโWAITโRESPOND), autonomous board with atomic locks, and worktree isolation with per-task branches and conflict detection on merge.
๐ข๐ฏ๐๐ฒ๐ฟ๐๐ฎ๐ฏ๐ถ๐น๐ถ๐๐ ๐๐ฎ๐๐ฒ๐ฟ wraps everything. event bus with lifecycle hooks, background executor running daemon threads non-blocking.
the master agent loop sits at the center. perception โ action โ observation. it's deliberately simple. a "dumb loop" where the model reasons and the harness mediates.
this is the architecture behind what feels like magic when you use claude code. it's not magic. it's harness engineering.
the article below is a deep-dive covering how Anthropic, OpenAI, LangChain, and others build this pattern from the ground up.
Show more
Claude Code's architecture, mapped.
Calude Code is one of the most powerful agent harnessed out there, it's a lot more than "a CLI that calls claude." the actual system has six layers, and the model is just one node inside the loop.
the diagram breaks down every component:
๐๐ป๐ฝ๐๐ ๐๐ฎ๐๐ฒ๐ฟ handles session management, permission gating, and YAML-based trust tiers before anything reaches the model.
๐๐ป๐ผ๐๐น๐ฒ๐ฑ๐ด๐ฒ ๐๐ฎ๐๐ฒ๐ฟ holds the skill registry, context compressor (3-layer, 92% threshold), task graph, and cross-session memory store. this is where harness intelligence lives outside the weights.
๐๐
๐ฒ๐ฐ๐๐๐ถ๐ผ๐ป ๐๐ฎ๐๐ฒ๐ฟ runs tool dispatch through a typed registry with one handler per tool. bash, read, write, grep, glob, revert. streaming runtime handles parallel execution. prompt cache reuses stable prefixes at 10% cost.
๐๐ป๐๐ฒ๐ด๐ฟ๐ฎ๐๐ถ๐ผ๐ป ๐๐ฎ๐๐ฒ๐ฟ connects the MCP runtime to external servers. filesystem, git, custom. tools register inward, memory writes outward to agent_memory. md.
๐ ๐๐น๐๐ถ-๐๐ด๐ฒ๐ป๐ ๐๐ฎ๐๐ฒ๐ฟ is the most underappreciated piece. subagent spawner, teammate mailboxes over redis pub/sub, FSM protocol (IDLEโREQUESTโWAITโRESPOND), autonomous board with atomic locks, and worktree isolation with per-task branches and conflict detection on merge.
๐ข๐ฏ๐๐ฒ๐ฟ๐๐ฎ๐ฏ๐ถ๐น๐ถ๐๐ ๐๐ฎ๐๐ฒ๐ฟ wraps everything. event bus with lifecycle hooks, background executor running daemon threads non-blocking.
the master agent loop sits at the center. perception โ action โ observation. it's deliberately simple. a "dumb loop" where the model reasons and the harness mediates.
this is the architecture behind what feels like magic when you use claude code. it's not magic. it's harness engineering.
the article below is a deep-dive covering how Anthropic, OpenAI, LangChain, and others build this pattern from the ground up.
Show more
The MCP vs CLI debate.
For most of 2025, AI Engineers argued about it.
The skeptics had real numbers:
- Playwright MCP eats 13.7K tokens
- Chrome DevTools MCP eats 18K
- A 5-server setup burns 55K tokens before any work
The defenders pushed back:
- CLIs break on multi-tenant apps
- No typed contracts, so the agent guesses at outputs
- On unfamiliar APIs, agents waste turns parsing text
Both sides were arguing about the wrong thing.
In November 2025, Anthropic published "Code execution with MCP" and reframed it from first principles.
The problem was never the protocol. It was the habit of dumping every tool's full description into the model's context the moment a session starts. Add the data those tools return, passed through the model on every step, and a single workflow can balloon to 150K tokens. Most of which the model never needed.
The fix is to flip the model's job. Instead of the model calling tools through its context, the model writes code that calls tools through a runtime. The runtime is where tools live. The model only sees what it imports.
In Anthropic's example, a Google Drive transcript flows into a Salesforce CRM update. The old way loaded both tool schemas and piped the entire transcript through the model twice. The new way is ten lines of TypeScript that import what they need. Same task, 2K tokens. A 98.7% drop.
Cloudflare pushed the idea to its limit. They collapsed their entire 2,500-endpoint API from 1.17M tokens of schemas down to 1K tokens, by exposing just two functions: search and execute. The agent writes code that searches the catalog, then executes only what matches.
The new pattern has a name: Code Mode.
It is a runtime where the agent writes code that mixes two primitives. Bash, for anything with a binary already installed like git or curl. Typed module imports, for proprietary APIs where the type signatures load only when the agent actually imports the tool.
That second part is the unlock. Types travel with imports, so the agent gets a strict contract for the tools it picks, and pays nothing for the ones it skips.
MCP's typed contracts plus CLI's lazy loading, in one runtime. The agent picks per task.
"MCP is dead" was the wrong takeaway.
Anthropic just reported 300M MCP SDK downloads, up from 100M at the start of the year. The protocol is not dying. It is the fastest growing piece of agent infrastructure right now.
What died was loading every tool upfront. That was always a bad idea.
If you are building agents in 2026, the rule is simple. Tool definitions belong in code, not in context. The model writes a few lines that call them. The runtime does the rest.
That is what the debate was actually about.
Show more
Naive RAG vs. Blockify!
There's a new RAG approach that:
- cuts corpus size by 40x.
- reduces tokens per query by 3x.
- improves vector search relevance by 2.3x.
Blockify GitHub:
Show more
this is the most underrated update in the agent space right now.
your AI workflow runs for 47 minutes, burns 312 LLM calls, then crashes at step 8.
most frameworks make you restart from zero.
@crewAIInc just shipped checkpointing. think google docs autosave, but for your agent's work-in-progress. every flow method becomes a recovery point.
resume in one line. fork from any saved state into a new branch. edit past outputs and watch changes ripple downstream. visual TUI to inspect everything.
your pipelines stop being fragile one-shot jobs. they become resumable, inspectable, branchable processes.
zero extra infra. 100% open-source.
get started with CrewAI here:
(don't forget to star it โญ๏ธ)
Show more