LLM inference speed with vs. without speculative decoding:
(learn how it works below)
Claude vs. Claude Code vs. Cowork.
Anthropic offers three distinct ways to interact with Claude, and each one targets a fundamentally different workflow. Think of it as: Chat for thinking, Code for building, and Cowork for doing.
Here's a quick breakdown:
1️⃣ Claude Chat
This is the conversational AI assistant most people already know. You type a prompt, Claude responds, and you iterate together.
- Turn rough ideas into structured plans through conversation
- Write emails, reports, essays, and long-form content
- Research and summarize complex topics in minutes
- Analyze documents, PDFs, and images
- Build interactive prototypes through Artifacts
The key here is that everything happens through conversation. You're thinking with Claude, not delegating work to it.
It's available on every device, has a free tier, and supports persistent memory across sessions.
The tradeoff is that it has no direct access to your local files (upload only), and it can't generate raster images natively.
2️⃣ Claude Code
This is a terminal-native coding agent. You describe what you want in plain English, and Claude reads your codebase, writes code, runs tests, fixes errors, and ships the result.
- Build and debug entire features across the full codebase
- Write, run, and fix tests automatically
- Manage git workflows and create pull requests
- Spawn multiple parallel agents working on different parts of a task simultaneously
It handles the full development cycle end to end, from planning to execution to testing. With the CLAUDE(.)md configuration file, you can teach it your project's conventions, patterns, and constraints so it writes code the way your team expects.
The tradeoff is a steeper learning curve compared to Chat, and token costs can add up during heavy sessions.
3️⃣ Claude Cowork
This is the newest addition. Anthropic describes it as Claude Code for the rest of your work.
It's an agentic desktop assistant that automates file management and repetitive tasks through a GUI. You describe an outcome, and Claude plans, executes, and delivers finished work: formatted documents, organized file systems, spreadsheets with working formulas, and synthesized research.
- Direct local file access and editing (no upload/download cycle)
- Schedule recurring tasks automatically
- Assign tasks remotely via Dispatch from your phone
- Computer Use lets Claude control your screen directly
It runs inside a sandboxed virtual machine on your computer, so Claude can only access folders you explicitly grant. You don't need to know how to code to use it.
The tradeoff is that your computer must stay awake for tasks to run, and it's still in research preview.
Here's how to think about choosing between them:
→ If you need to think through a problem or get writing/research help, use Chat
→ If you're building software and want an autonomous coding partner, use Code
→ If you have a clearly defined deliverable that involves local files and desktop workflows, use Cowork
All three are included in the same subscription starting at $20/month, which makes it one of the highest-leverage subscriptions in productivity software right now.
I've put together a visual below that maps the workflow of each product side by side.
Also, if you want to go deeper into Claude Code specifically, my co-founder wrote a detailed article covering the anatomy of the .claude/ folder, a complete guide to CLAUDE(.)md, custom commands, skills, agents, and permissions, and how to set them all up properly.
Read it below.
Show more
Claude vs. Claude Code vs. Cowork.
Anthropic offers three distinct ways to interact with Claude, and each one targets a fundamentally different workflow. Think of it as: Chat for thinking, Code for building, and Cowork for doing.
Here's a quick breakdown:
1️⃣ Claude Chat
This is the conversational AI assistant most people already know. You type a prompt, Claude responds, and you iterate together.
- Turn rough ideas into structured plans through conversation
- Write emails, reports, essays, and long-form content
- Research and summarize complex topics in minutes
- Analyze documents, PDFs, and images
- Build interactive prototypes through Artifacts
The key here is that everything happens through conversation. You're thinking with Claude, not delegating work to it.
It's available on every device, has a free tier, and supports persistent memory across sessions.
The tradeoff is that it has no direct access to your local files (upload only), and it can't generate raster images natively.
2️⃣ Claude Code
This is a terminal-native coding agent. You describe what you want in plain English, and Claude reads your codebase, writes code, runs tests, fixes errors, and ships the result.
- Build and debug entire features across the full codebase
- Write, run, and fix tests automatically
- Manage git workflows and create pull requests
- Spawn multiple parallel agents working on different parts of a task simultaneously
It handles the full development cycle end to end, from planning to execution to testing. With the CLAUDE(.)md configuration file, you can teach it your project's conventions, patterns, and constraints so it writes code the way your team expects.
The tradeoff is a steeper learning curve compared to Chat, and token costs can add up during heavy sessions.
3️⃣ Claude Cowork
This is the newest addition. Anthropic describes it as Claude Code for the rest of your work.
It's an agentic desktop assistant that automates file management and repetitive tasks through a GUI. You describe an outcome, and Claude plans, executes, and delivers finished work: formatted documents, organized file systems, spreadsheets with working formulas, and synthesized research.
- Direct local file access and editing (no upload/download cycle)
- Schedule recurring tasks automatically
- Assign tasks remotely via Dispatch from your phone
- Computer Use lets Claude control your screen directly
It runs inside a sandboxed virtual machine on your computer, so Claude can only access folders you explicitly grant. You don't need to know how to code to use it.
The tradeoff is that your computer must stay awake for tasks to run, and it's still in research preview.
Here's how to think about choosing between them:
→ If you need to think through a problem or get writing/research help, use Chat
→ If you're building software and want an autonomous coding partner, use Code
→ If you have a clearly defined deliverable that involves local files and desktop workflows, use Cowork
All three are included in the same subscription starting at $20/month, which makes it one of the highest-leverage subscriptions in productivity software right now.
I've put together a visual below that maps the workflow of each product side by side.
Also, if you want to go deeper into Claude Code specifically, my co-founder wrote a detailed article covering the anatomy of the .claude/ folder, a complete guide to CLAUDE(.)md, custom commands, skills, agents, and permissions, and how to set them all up properly.
Read it below.
Show more
The full-stack AI engineering roadmap covering:
> Prompt engineering
> RAG systems
> Context engineering
> Fine-tuning
> Agents
> LLM deployment
> LLM optimization
> Safety, evals & observability
Free and open-source resources in the article below.
(don't forget to bookmark)
Show more
A smarter Claude model burns more tokens, not fewer!
And it's not a minor 3-5% difference.
But 54% higher token usage.
It sounds counterintuitive, but MCPMark V2 benchmarks confirmed this across 21 backend tasks.
The reason has nothing to do with the model itself.
Instead, it has to do with what the agent needs to know before it can start building.
When you're building a full-stack app, CC must understand the entire backend, like:
- what tables already exist
- what RLS policies are active
- what storage buckets are available
- which auth providers are configured
- and what edge functions are deployed
Most backends don't hand over this info cleanly.
For instance, with Supabase, asking for OAuth setup via MCP returns the entire auth docs, including sections on email/password, magic links, phone auth, SAML, and SSO.
That's 5-10x more tokens than the agent actually needed. And this happens on every MCP call across every domain.
The agent then discovers the state through separate calls to list_tables, execute_sql, and list_extensions, each returning a partial view.
Some info, like which auth providers are configured, isn't queryable through MCP at all.
And when something breaks, Supabase returns the same error code whether the rejection came from the platform layer or from the function code.
The agent has no way to infer accurately, so it cycles through code-level fixes for a problem that might not be in the code at all.
A better model does not have a magical way to skip these gaps.
In fact, it tries even harder to fill them, which means more discovery queries, more reasoning, and more retries. That's why the token cost went up with a better Claude model.
A smarter approach is actually implemented in InsForge, an open-source backend (self-hostable via Docker) that offers the same primitives as Supabase but structures everything around the assumption that an agent is operating the backend, not a human on a dashboard.
Before writing any code, a single CLI call returns the full backend topology in ~500 tokens.
The agent sees every table, auth provider, storage bucket, and available AI models in one structured response.
Instead of one broad skill like Supabase that triggers on everything, it has four narrowly scoped skills.
- Creating tables only activates the CLI skill.
- Debug skill only activates when code breaks.
- Building frontend only activates the SDK skill.
- Wiring third-party auth only activates the integrations skill.
This keeps the agent's cognitive load lean since it only loads what matches the current task.
The CLI returns structured JSON with semantic exit codes on every operation, so the agent always knows whether something succeeded or failed and why. There are no ambiguous 401s that may indicate three different things.
I tested both backends on the same full-stack RAG app and recorded the full sessions.
Supabase:
- consumed 10.4M tokens
- needed 10 manual interventions
InsForge:
- consumed 3.7M tokens
- completed the entire build without any errors
This isn't a Supabase-specific problem. Most backends were designed for humans who can see dashboards and interpret raw errors.
When an agent operates the backend instead, every missing piece of context needs a discovery call, and every ambiguous error enters a retry loop.
Fixing this requires giving agents structured backend context before they start writing code.
InsForge is an open-source implementation of exactly this, and you can self-host it via Docker.
GitHub repo (9k+ stars):
(don't forget to star it ⭐ )
You can find my walkthrough on building the full-stack RAG with Supabase and Insforge in the article below.
Show more
Karpathy said something you'll regret ignoring:
"Remove yourself as the bottleneck. Maximize your leverage. Put in very few tokens, and a huge amount of stuff happens on your behalf."
The reason most people can't do this today is because their AI has little to no memory of their work.
You sit in meetings, read threads, make decisions, and your brain quietly drops half of it by next week. Then you spend time re-reading, re-asking, re-explaining context to your own AI.
You can't remove yourself from the loop when YOU are the only one who remembers what happened.
That's why the smartest builders are setting up AI second brains that compound everything automatically.
Rowboat is an open-source implementation of exactly this, built on top of the same Markdown-and-Obsidian foundation that Karpathy uses, but extended into a work context.
Emails, meetings, decisions, commitments, and deadlines, everything is linked in a knowledge graph that gets denser every day without you touching it.
And the whole setup runs 100% locally.
6 months from now, you'll either have an AI second brain or wish you did.
Find my full 100% local setup guide in the article quoted below to start today.
Here's the Rowboat Repo:
(don't forget to star it 🌟)
Show more
A time-complexity cheat sheet of 10 ML algorithms:
What's the inference time-complexity of KMeans?
The most comprehensive RL overview I've ever seen.
Kevin Murphy from Google DeepMind, who has over 128k citations, wrote this.
What makes this different from other RL resources:
→ It bridges classical RL with the modern LLM era:
There's an entire chapter dedicated to "LLMs and RL" covering:
- RLHF, RLAIF, and reward modeling
- PPO, GRPO, DPO, RLOO, REINFORCE++
- Training reasoning models
- Multi-turn RL for agents
- Test-time compute scaling
→ The fundamentals are crystal clear
Every major algorithm, like value-based methods, policy gradients, and actor-critic are explained with mathematical rigor.
→ Model-based RL and world models get proper coverage
Covers Dreamer, MuZero, MCTS, and beyond, which is exactly where the field is heading.
→ Multi-agent RL section
Game theory, Nash equilibrium, and MARL for LLM agents.
I have shared the arXiv paper in the replies!
Show more
Layers of observability in AI systems, explained visually:
If you’re deploying LLM-powered apps to real users, you need to know what’s happening inside your pipeline at every step.
Here’s the mental model (see the attached diagram):
Think of your AI pipeline as a series of steps. For simplicity, consider RAG.
A user asks a question, it flows through multiple components, and eventually, a response comes out.
Each of those steps takes time, each step can fail, and each step has its own cost. And if you’re only looking at the input and output of the entire system, you will never have full visibility.
This is where traces and spans come in.
> A Trace captures the entire journey, from the moment a user submits a query to when they get a response. Look at the "Trace" column in the diagram below. One continuous bar that encompasses everything.
> Spans are the individual operations within that trace. Each colored box on the right represents a span.
Let’s understand what each span captures in this case:
- Query span: User submits a question. This is where your trace begins. You capture the raw input, timestamp, and session info.
- Embedding Span: The query hits the embedding model and becomes a vector. This span tracks token count and latency. If your embedding API is slow or hitting rate limits, you’ll catch it here.
- Retrieval Span: The vector goes to your database for similarity search. Our observation suggests that this is where most RAG problems hide, with the most common reasons being bad chunks, low relevance scores, wrong top-k values, etc. The retrieval span exposes all of it.
- Context Span: In this span, the retrieved chunks get assembled with your system prompt. This span shows you exactly what’s being fed to the LLM. So if the context is too long, you’ll see it here.
- Generation Span: Finally, the LLM produces a response. This span is usually the longest and most expensive. Input tokens, output tokens, latency, reasoning (if any), etc., everything is logged for cost tracking and debugging.
This should make it clear that without span-level tracing, debugging is almost impossible.
You would just know that the response was bad, but you would never know if it was due to bad retrieval, bad context, or the LLM’s hallucination.
Cost tracking is another big one. Span-level tracking lets you see where the money is actually going.
One more thing: AI systems degrade over time. What worked last month might not work today. Span-level metrics let you catch drift early and tune each component independently.
Lastly, to clarify, a Trace is the container that ties everything together for a single request. When a user submits a query, a unique Trace ID gets generated. Every span that happens as part of that request carries this same Trace ID.
So if your system processes 1000 queries, you have 1000 traces. Each trace contains multiple spans (embedding, retrieval, generation, etc.), but they’re all linked by that one Trace ID.
The “Trace” column shows one long continuous bar. It starts when the query comes in and ends when the response goes out. All the colored spans on the right are nested inside it, linked by the same Trace ID.
If you want to see how component-level observability + evals are implemented in practice, I have quoted one of my posts below that uses the DeepEval open-source framework.
Read it below.
____
Find me →
@_avichawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
Show more
A Python decorator is all you need to trace LLM apps (open-source).
Most LLM evals treat the app like an end-to-end black box.
But LLM apps need component-level evals and tracing since the issue can be anywhere inside the box, like the retriever, tool call, or the LLM itself.
In
@deepeval, you can do that with just 3 lines of code:
- Trace individual LLM components (tools, retrievers, generators) with the "@ observe" decorator.
- Attach different metrics to each part.
- Get a visual breakdown of what’s working and what’s not.
Done!
You don't need to refactor any of your existing code.
See the example below for a RAG app.
Deepeval is 100% open-source with 8500+ stars, and you can easily self-host it so your data stays where you want.
Find the repo in the replies!
Show more
You're in an ML Engineer interview at Apple.
The interviewer asks:
"Two models are 88% accurate.
- Model A is 89% confident.
- Model B is 99% confident.
Which one would you pick?"
You: "Any would work since both have same accuracy."
Interview over.
Here's what you missed:
Modern neural networks can be misleading.
They are overconfident in their predictions.
For instance, I saw an experiment that used the CIFAR-100 dataset to compare LeNet with ResNet.
LeNet produced:
- Accuracy = ~0.55
- Average confidence = ~0.54
ResNet produced:
- Accuracy = ~0.7
- Average confidence = ~0.9
Despite being more accurate, the ResNet model is overconfident in its predictions. While the model thinks it's 90% confident in its predictions, in reality, it only turns out to be 70% accurate.
Calibration solves this.
A model is calibrated if the predicted probabilities align with the actual outcomes.
For instance, say a model predicts an event with a 70% probability. Then, ideally, out of 100 such predictions, ~70 should result in the event.
Handling this is important because the model will be used in decision-making.
In fact, an overly confident that is not equally accurate model can be highly misleading.
To exemplify, say a government hospital wants to conduct an expensive medical test on patients.
To ensure that the govt. funding is used optimally, a reliable probability estimate can help the doctors make this decision.
If the model isn't calibrated, it will produce overly confident predictions.
Reliability Diagrams are a visual way to inspect how well the model is currently calibrated.
More specifically, this diagram plots the expected sample accuracy as a function of the corresponding confidence value (softmax) output by the model.
If the model is perfectly calibrated, then the diagram should look like the identity function.
That said, it is often also useful to compute a scalar value that measures the amount of miscalibration, called expected calibration error (ECE).
One way to approximate the expected calibration error shown above is by partitioning predictions into equally spaced bins and taking a weighted average of the bins’ accuracy/confidence difference.
These are some common techniques to calibrate ML models:
> For binary classification models:
- Histogram binning
- Isotonic regression
- Platt scaling
> For multiclass classification models:
- Binning methods
- Matrix and vector scaling
👉 If you care about probabilities and both models are operationally similar, which model would you prefer?
____
Find me →
@_avichawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
Show more
The most comprehensive RL overview I've ever seen.
Kevin Murphy from Google DeepMind, who has over 128k citations, wrote this.
What makes this different from other RL resources:
→ It bridges classical RL with the modern LLM era:
There's an entire chapter dedicated to "LLMs and RL" covering:
- RLHF, RLAIF, and reward modeling
- PPO, GRPO, DPO, RLOO, REINFORCE++
- Training reasoning models
- Multi-turn RL for agents
- Test-time compute scaling
→ The fundamentals are crystal clear
Every major algorithm, like value-based methods, policy gradients, and actor-critic are explained with mathematical rigor.
→ Model-based RL and world models get proper coverage
Covers Dreamer, MuZero, MCTS, and beyond, which is exactly where the field is heading.
→ Multi-agent RL section
Game theory, Nash equilibrium, and MARL for LLM agents.
I have shared the arXiv paper in the replies!
Show more
A Python decorator is all you need to trace LLM apps (open-source).
Most LLM evals treat the app like an end-to-end black box.
But LLM apps need component-level evals and tracing since the issue can be anywhere inside the box, like the retriever, tool call, or the LLM itself.
In
@deepeval, you can do that with just 3 lines of code:
- Trace individual LLM components (tools, retrievers, generators) with the "@ observe" decorator.
- Attach different metrics to each part.
- Get a visual breakdown of what’s working and what’s not.
Done!
You don't need to refactor any of your existing code.
See the example below for a RAG app.
Deepeval is 100% open-source with 8500+ stars, and you can easily self-host it so your data stays where you want.
Find the repo in the replies!
Show more