anybody who uses or learns agentic systems, SHOULD READ THIS
the install order I run before any new agentic project:
1. PRIVACY: direnv + a real secrets manager
install direnv, then plug it into your team's password manager (1Password CLI via op run, doppler, infisical, vault, pick one)
what direnv does: loads per-folder environment variables when you cd in, unloads when you cd out. the real move is wiring it into your secrets manager so credentials NEVER live in plain text on disk
what this stops:
- API keys accidentally committed to git history, the most common AI agent breach pattern in 2026
- credentials leaking from one project into another through your shell history
- shared .env files that one teammate quietly backs up to Dropbox
- secrets that survive a laptop theft because they were sitting in /Users/you/projects
the part nobody mentions: most "my agent got jailbroken" stories actually trace back to one credential the agent had access to that it shouldn't have. scope keys to projects, scope projects to folders, and the blast radius of any single compromise drops dramatically
I shipped 2 agents with keys in .env files before switching. the day I plugged direnv into op run I stopped having that whole class of nightmare
2. TOKENS: litellm or portkey as your model proxy
one URL that fronts every AI provider (Anthropic, OpenAI, Google, Mistral, local models). all your spend flows through one place
what it saves you:
- response caching keyed by prompt hash, cuts your bill 30-60% on repeat tasks
- automatic fallback on rate limits (Sonnet hits a 429? falls to Opus, then GPT, then your local backup, no broken users)
- per-feature and per-user budget caps, block the call before it costs $200 instead of auditing it after
- model routing rules, cheap tasks to Haiku, expensive ones to Opus, never the wrong way
- PII redaction before requests leave your network, security side benefit
the part nobody mentions: every "$4k AI bill" story I've heard ends with "we didn't have a proxy in front." this is where you put guardrails around spend BEFORE the spend happens
I built my own router for 2 weeks. it took 20 minutes to replace with litellm. I will be embarrassed about this forever
3. CONTEXT: uv + git commit on every passing eval
install uv (the new Python package manager, 10-100x faster than pip+venv, by the Astral team behind ruff). then commit every time an eval suite PASSES, with the model version and pass rate in the commit message
what this preserves:
- exact dependency set via uv.lock, you always know which packages your agent was using, no nasty surprises from a quiet update
- exact prompt + code state, you can reproduce any past run from a single git hash
- exact model version paired to exact pass rate, a paper trail when prod breaks weeks later
- one-command rollback to a known-working state when a refactor goes sideways
- a compliance story, every prompt version tied to a model version in your commit log
the security side: when something blows up in prod, you want to say "the prompt was version X, model was Sonnet 4.6.1, last eval pass rate was 94%." not "I think we deployed on Tuesday?" the first is an incident report. the second is a resignation letter
I've lost more agents to "I changed 3 prompts in one session and broke something" than to any actual bug
4. VISIBILITY: mitmproxy in front of every LLM call
it's basically a wiretap for your agent. install it, point your agent through it, and now you see every conversation your agent has with the model in real time
what actually shows up:
- every silent retry your SDK sneaks in when a call fails
- the full prompt being sent (including any creds you accidentally embedded)
- what the model returns BEFORE your code reacts to it
- exact token cost per call, per tool, per loop iteration
- responses that quietly trigger your code into doing something you didn't intend, this is where prompt injection lives
the part nobody talks about: if a website your agent scraped slipped instructions into its data, mitmproxy is how you SEE the moment your agent decides to follow them. without this layer, you're trusting your agent did the right thing, not verifying
I shipped 3 agents before adding this. I have no honest idea what they were doing in production
5. EVALS: inspect-ai (the framework the labs actually use)
an eval framework is what tells you "this agent works" with numbers instead of vibes. inspect-ai is the one Anthropic, DeepMind, and the UK AI Safety Institute use for the eval reports you read in their papers. open source, MIT licensed
what your homegrown version won't have:
- run the same task across 5 different models and compare scores side by side
- pre-built tests for risky agent behavior (lying, manipulating, misusing tools)
- proper structure for evaluating tool-using agents, not just chat
- repeatable scoring, the same input always gets graded the same way
- reproducible eval seeds, so a flaky test is actually flaky and not just unlucky
I wrote my own eval harness 4 times across 4 projects. threw it out 4 times
if you ever want to say "my agent passes safety checks" out loud, the check has to come from a framework someone else can re-run. this is that framework
the move that ties this together: keep a /lessons.md in every repo. every weird agent behavior, every edge case, every config change you find at 2am, write it down
you will not remember it. you'll come back in 3 weeks and the lessons file is the only reason you still know what's going on
lock these 5, keep the lessons file, your next agentic system takes 2 days instead of 2 months
p.s. half of "AI agent" content online is people who've never run mitmproxy on their own loop. they don't actually know what their agent is doing. they're shipping demo videos. don't be that guy
Show more
Why Generic Humanoid Robots Will Fail — And What's Next
Imagine an alternate world where we never invented the car. In that world, a robotics engineer might reasonably conclude that robotic horses are the future — replace the living ones, keep the stables and saddles, ride them to work. Convenient, modern, and the roads stay free of manure. It sounds absurd only because you already know about cars.
We keep making the same mistake with humanoid robots.
Consider transportation. To finally make driving safe, we had two options: put a humanoid in the driver's seat, or embed sensing and compute directly into the vehicle. Waymo chose the latter. It has no steering wheel. It exists purely to move people efficiently from A to B. The humanoid was not needed.
Consider a sock factory. Yes, you could replace workers with humanoid robots one-for-one on the assembly line — and gain maybe 2-3x efficiency. Or you could completely redesign the workflow around a purpose-built autonomous sewing system and eliminate most of the factory, the chairs, the cafeteria, the manual sewing machines, the HVAC, the doors, and the restrooms. The actual optimization is to side-step the previous human-imposed physical constraint.
Look at Ukraine. The front lines aren't filling up with Terminator-style humanoids carrying rifles. Human soldiers are being replaced by heterogeneous swarms of purpose-specific drones: some for reconnaissance, some for logistics, some for delivering munitions. War is being restructured around the desired outcome (survival), not the soldier's shape.
Consider a 1970's office. Want to move information through teams of people? We once used typists, paper, trucks to supply the paper, typewriters, and repair technicians. A linear improvement would have been to replace the human typist with a 10-fingered humanoid. What actually happened? The entire workflow — paper, printers, typewriter factories, delivery trucks, the desks, the offices — was obliterated. Email deleted the human clerk's entire universe.
Consider cancer early detection by mammography. Today, getting a mammogram requires expensive hardware, logistics infrastructure, human nurses and doctors, a biopsy workflow, a human pathologist with a microscope (imported from Germany or Japan), a written finding, multiple physician reviews. Sure, you could replace the pathologist with a humanoid (the microscope focus knob requires finger dexterity) and get a modest efficiency gain (and faster responses at 2 am). Or — the far more likely future — we all swallow a cancer detection pill every few months, and 24 hours later a color-changing sticker on our arm turns red or green. No hardware. No hospital. No logistics. No pathologist. No office. No desk. No humanoid. The workflow isn't optimized by a literal drop-in swap of a human pathologist for a humanoid. The entire workflow simply ceases to exist.
Consider life sciences research and drug development. We're seeing excitement about robot arms and humanoids pipetting water in research labs. Robot horses, episode 7. We don't design aircraft by crashing test planes — we simulate them entirely in software first. Biology will go the same way. The path to scalable drug discovery isn't robot arms in conventional wet labs demonstrating 10 fingered prowess in manipulating Eppendorf tubes filled with purple food coloring. Rather, we need in-silico biological models that evaluate billions of hypotheses computationally, with physical manipulation of atoms only at the very end.
The clear pattern. Efficient automation doesn't try to replicate a 10-fingered human in a static context. Automation eliminates physical rate-limiting steps in their entirety. That's why "classical" humanoid robots, as a generic category, will largely fail. They're robotic horses. They assume the infrastructure and workflows stay fixed and only the 10-fingered human is swapped out. That's not how economic and technological pressure works.
What actually matters? If humans continue to inhabit the physical world, then moving atoms will remain important, and that requires five things: atoms, energy, force generation and actuation, sensing, and compute. Everything else — form factor, number of limbs, type of end effector — is a variable to be optimized for the task.
So if you are a pathologist, a robotics engineer, a teacher, a parent, a politician, or a sewing factory owner - please think different. Most obviously, we should all anticipate, and build for, a future in which robots exhibit extreme physical fluidity: Two arms or four. Wheels or legs. Tentacles or flippers. Three fingers or twelve, or none at all. Eyes at the front, side, or tip of a tentacle. At OpenMind, we don't care what you look like right now - we got you, in all your physical form factors. OM2 ships in July, for all machines. Let's build.
Show more