anybody who uses or learns agentic systems, SHOULD READ THIS
the install order I run before any new agentic project:
1. PRIVACY: direnv + a real secrets manager
install direnv, then plug it into your team's password manager (1Password CLI via op run, doppler, infisical, vault, pick one)
what direnv does: loads per-folder environment variables when you cd in, unloads when you cd out. the real move is wiring it into your secrets manager so credentials NEVER live in plain text on disk
what this stops:
- API keys accidentally committed to git history, the most common AI agent breach pattern in 2026
- credentials leaking from one project into another through your shell history
- shared .env files that one teammate quietly backs up to Dropbox
- secrets that survive a laptop theft because they were sitting in /Users/you/projects
the part nobody mentions: most "my agent got jailbroken" stories actually trace back to one credential the agent had access to that it shouldn't have. scope keys to projects, scope projects to folders, and the blast radius of any single compromise drops dramatically
I shipped 2 agents with keys in .env files before switching. the day I plugged direnv into op run I stopped having that whole class of nightmare
2. TOKENS: litellm or portkey as your model proxy
one URL that fronts every AI provider (Anthropic, OpenAI, Google, Mistral, local models). all your spend flows through one place
what it saves you:
- response caching keyed by prompt hash, cuts your bill 30-60% on repeat tasks
- automatic fallback on rate limits (Sonnet hits a 429? falls to Opus, then GPT, then your local backup, no broken users)
- per-feature and per-user budget caps, block the call before it costs $200 instead of auditing it after
- model routing rules, cheap tasks to Haiku, expensive ones to Opus, never the wrong way
- PII redaction before requests leave your network, security side benefit
the part nobody mentions: every "$4k AI bill" story I've heard ends with "we didn't have a proxy in front." this is where you put guardrails around spend BEFORE the spend happens
I built my own router for 2 weeks. it took 20 minutes to replace with litellm. I will be embarrassed about this forever
3. CONTEXT: uv + git commit on every passing eval
install uv (the new Python package manager, 10-100x faster than pip+venv, by the Astral team behind ruff). then commit every time an eval suite PASSES, with the model version and pass rate in the commit message
what this preserves:
- exact dependency set via uv.lock, you always know which packages your agent was using, no nasty surprises from a quiet update
- exact prompt + code state, you can reproduce any past run from a single git hash
- exact model version paired to exact pass rate, a paper trail when prod breaks weeks later
- one-command rollback to a known-working state when a refactor goes sideways
- a compliance story, every prompt version tied to a model version in your commit log
the security side: when something blows up in prod, you want to say "the prompt was version X, model was Sonnet 4.6.1, last eval pass rate was 94%." not "I think we deployed on Tuesday?" the first is an incident report. the second is a resignation letter
I've lost more agents to "I changed 3 prompts in one session and broke something" than to any actual bug
4. VISIBILITY: mitmproxy in front of every LLM call
it's basically a wiretap for your agent. install it, point your agent through it, and now you see every conversation your agent has with the model in real time
what actually shows up:
- every silent retry your SDK sneaks in when a call fails
- the full prompt being sent (including any creds you accidentally embedded)
- what the model returns BEFORE your code reacts to it
- exact token cost per call, per tool, per loop iteration
- responses that quietly trigger your code into doing something you didn't intend, this is where prompt injection lives
the part nobody talks about: if a website your agent scraped slipped instructions into its data, mitmproxy is how you SEE the moment your agent decides to follow them. without this layer, you're trusting your agent did the right thing, not verifying
I shipped 3 agents before adding this. I have no honest idea what they were doing in production
5. EVALS: inspect-ai (the framework the labs actually use)
an eval framework is what tells you "this agent works" with numbers instead of vibes. inspect-ai is the one Anthropic, DeepMind, and the UK AI Safety Institute use for the eval reports you read in their papers. open source, MIT licensed
what your homegrown version won't have:
- run the same task across 5 different models and compare scores side by side
- pre-built tests for risky agent behavior (lying, manipulating, misusing tools)
- proper structure for evaluating tool-using agents, not just chat
- repeatable scoring, the same input always gets graded the same way
- reproducible eval seeds, so a flaky test is actually flaky and not just unlucky
I wrote my own eval harness 4 times across 4 projects. threw it out 4 times
if you ever want to say "my agent passes safety checks" out loud, the check has to come from a framework someone else can re-run. this is that framework
the move that ties this together: keep a /lessons.md in every repo. every weird agent behavior, every edge case, every config change you find at 2am, write it down
you will not remember it. you'll come back in 3 weeks and the lessons file is the only reason you still know what's going on
lock these 5, keep the lessons file, your next agentic system takes 2 days instead of 2 months
p.s. half of "AI agent" content online is people who've never run mitmproxy on their own loop. they don't actually know what their agent is doing. they're shipping demo videos. don't be that guy
Show more