Skip to content

Category: The Art of Vibe Coding

11 min read The Art of Vibe Coding

How to Run Firecrawl for Free in the Cloud (No API Key Needed)

Run the full Firecrawl stack on a free GitHub Codespaces 16 GB cloud machine — no API keys, 5-minute setup, and wired into Claude Code via a single tunnel command.

The Hardware Problem Nobody Warns You About

I have an M1 Pro MacBook Pro. Base model. 16 GB of RAM.

I figured that was plenty.

So I read a tutorial about Firecrawl — the open-source tool that turns messy web pages into clean, LLM-ready markdown — and ran docker compose up without a second thought.

Then I opened Activity Monitor.

14+ GB of RAM. 3 GB spilling into swap.

Memory pressure glowing yellow — the macOS equivalent of a check-engine light.

What I’d forgotten (ferpetesake) was that VS Code, Chrome, and a dev server were already running. Firecrawl’s Docker stack — five services simultaneously: the API server, a Playwright browser cluster, Redis, RabbitMQ, and PostgreSQL — landed on top of my normal development tools like a brick on a soufflé.

Here’s what most tutorials skip: they say “just install Docker” and assume you have unlimited RAM under your desk. Firecrawl can allocate up to 12 GB of RAM on its own. If your machine is already breathing hard from your regular workflow, adding Firecrawl is the thing that tips it over.

But — and stay with me here — GitHub Codespaces gives you a 16 GB RAM, 4-core cloud machine for free. The free tier includes 30 hours of runtime per month. For development, tutorials, and on-demand scraping sessions, that’s more than enough.

I’ve packaged the entire setup into a template repo you can fork: firecrawl-codespaces. Five minutes from zero to a working Firecrawl instance, connected to Claude Code on your local machine.

Let me show you exactly how.

.

.

.

What You’re Actually Getting (Honest Assessment First)

Before we touch a single command, let’s set expectations.

I’d rather you know the trade-offs now than feel surprised after investing 5 minutes.

GitHub Codespaces (Free)Local Machine
RAM16 GB (4-core machine)Whatever you have
CPU4 coresWhatever you have
Cost30 hrs/month freeFree (hardware cost)
Setup time~5 minutes~10 minutes
Always-on?No — auto-stops after inactivityYes
API keys needed?NoneNone

Two honest limitations:

1. Not always-on. Codespaces auto-stops after 30 minutes of inactivity. There are workarounds (covered later), but if you need Firecrawl running 24/7 without any interaction — Codespaces is the wrong fit.

2. No anti-bot bypass. The self-hosted version of Firecrawl doesn’t include Fire-engine — the component that handles IP rotation and bot detection circumvention. For scraping documentation sites, GitHub repos, and public content (the 95% use case for Claude Code), you don’t need it. For scraping LinkedIn or heavily Cloudflare-protected sites, you do.

The verdict: Codespaces is perfect for development, learning, and on-demand scraping sessions. You spin it up when you need it, stop it when you don’t.

.

.

.

Prerequisites

Short list. Zero friction.

  • A GitHub account (free tier works; Pro gives 50% more hours)
  • The GitHub CLI (gh) installed on your local machine — install guide

That’s it.

No Docker Desktop. No Homebrew. No Node.

The entire Firecrawl stack runs inside the Codespace — the only thing your local machine needs is the gh CLI for the tunnel command.

.

.

.

The Setup — Step by Step

Step 1: Create the Codespace

Go to the firecrawl-codespaces repo on GitHub. Fork it (or use it directly).

Click the green <> Code button → select the Codespaces tab → click Create codespace on main.

Machine type matters. Select the 4-core (16 GB RAM) option. The 2-core machine only has 8 GB — Firecrawl will OOM (out of memory) on it.

GitHub "Create a new codespace" page showing the firecrawl-codespaces repo selected, main branch, "Firecrawl on Codespaces" dev container configuration, and 4-core machine type

The core-hour gotcha: Free hours are measured in core-hours, not wall-clock hours. A 4-core machine uses free hours 4x faster than a 2-core. The 120 core-hours/month free tier gives you 30 actual hours on a 4-core machine.

Step 2: Wait for the Automated Setup

The moment the Codespace provisions, it runs setup.sh automatically.

This is configured in the repo’s devcontainer.json via postStartCommand — meaning it runs every time the Codespace starts or resumes, not just on initial creation.

VS Code terminal inside the Codespace showing "Finishing up... Running postStartCommand... > bash setup.sh"

Here’s what setup.sh does behind the scenes:

  1. Clones Firecrawl from the official repo
  2. Creates a minimal .env — port 3663, no authentication, no API keys
  3. Copies a docker-compose.override.yaml that uses pre-built Docker images instead of compiling from source (cuts first-run startup from 5-15 minutes down to ~90 seconds)
  4. Starts the Docker stack with docker compose up -d
  5. Waits for the health check to confirm Firecrawl is responding

First run takes ~2-5 minutes (image pull). After that, resuming a stopped Codespace takes ~30 seconds.

Once the setup completes, the Ports tab shows Firecrawl API on port 3663 with a green indicator:

VS Code Ports tab showing "Firecrawl API (3663)" with a green status dot and forwarded address

Expected warning: You’ll see WARN — You're bypassing authentication in the Docker logs. Completely normal. USE_DB_AUTHENTICATION=false is the correct setting for self-hosted Firecrawl. Safe to ignore.

Step 3: Verify the Stack

Run docker ps inside the Codespace terminal. You should see all five containers running:

Terminal showing docker ps output with five containers running: firecrawl-api-1 (port 3663), rabbitmq, redis, postgres, and playwright-service — all showing "Up 4 minutes"

Five containers. All healthy. Firecrawl is running inside your Codespace.

Now you need to get it to your local machine.

Step 4: Connect From Your Local Machine

I’ll spare you the detour I took.

I spent an embarrassing amount of time messing with public port URLs and GitHub token authentication before discovering that gh codespace ports forward does everything in one command. Learn from my shenanigans.

Switch to your local machine’s terminal (not the Codespace). Run:

gh codespace list
MacBook terminal showing gh codespace list with one Codespace available on the main branch

Copy the Codespace name from the output, then forward port 3663:

gh codespace ports forward 3663:3663 -c <your-codespace-name>
MacBook terminal showing gh codespace ports forward command with output "Forwarding ports: remote 3663 <=> local 3663"

One command.

Firecrawl is now at http://localhost:3663 on your machine — exactly as if it were running locally. No public exposure. No authentication tokens. And here’s the bonus: the tunnel keeps the Codespace alive as long as it’s running. More on that later.

Verify by opening http://localhost:3663 in your browser:

Browser showing localhost:3663 with JSON response: {"message":"Firecrawl API","documentation_url":"https://docs.firecrawl.dev"}

Firecrawl API. Running. Accessible. Free.

Other connection methods: The tunnel is the recommended approach. Two alternatives exist — a public port URL and a private port with GitHub token auth — but both reset on every Codespace restart. The tunnel is simplest and has the bonus keep-alive benefit. See the repo README for details on the alternatives.

.

.

.

Wire It Into Claude Code

Firecrawl is running.

Now let’s make Claude Code actually use it. This is the firecrawl Claude Code setup that turns your coding assistant into a web-aware research agent — and it’s three steps.

Install the Firecrawl CLI

On your local machine:

npm install -g firecrawl-cli

Install Firecrawl Skills

firecrawl setup skills --agent claude-code

This clones 8 markdown skill files from the official Firecrawl CLI repo and installs them into Claude Code’s skills directory. Each skill teaches Claude Code how to use a different Firecrawl capability: search, scrape, crawl, map, interact, download, and agent-powered extraction.

Firecrawl CLI skills installer showing ASCII art "SKILLS" header, repository cloned from github.com/firecrawl/cli.git, "Found 8 skills" with a selection list including firecrawl, firecrawl-agent, firecrawl-crawl, firecrawl-download, firecrawl-interact, firecrawl-map, firecrawl-scrape, and firecrawl-search

After installation, type /firecrawl in Claude Code. You should see all available Firecrawl slash commands:

Claude Code prompt showing /firecrawl typed with autocomplete dropdown listing all Firecrawl slash commands and their descriptions

Add Firecrawl Instructions to CLAUDE.md

This is the critical step.

Without this, Claude Code won’t know to prefer Firecrawl over its built-in (and more limited) web tools.

Add this block to your project’s CLAUDE.md:

## Firecrawl

- **Always use Firecrawl skills** (firecrawl, firecrawl-scrape, firecrawl-search, etc.) for web searches and scraping. Avoid the built-in WebFetch/WebSearch tools.
- We are using the localhost version of Firecrawl. Use `firecrawl` command to interact with the service.
- **Always prefix `firecrawl` CLI commands with `FIRECRAWL_API_URL=http://localhost:3663`** so the CLI targets the localhost service instead of prompting for cloud authentication. Example: `FIRECRAWL_API_URL=http://localhost:3663 firecrawl scrape "<url>" -o
.firecrawl/page.md`.
- **NEVER run `firecrawl --status`** — it checks cloud API auth and always shows "Not authenticated" for localhost. Instead, check if Firecrawl is running with: `curl -s http://localhost:3663 > /dev/null 2>&1` (requires `dangerouslyDisableSandbox: true`).
- All Firecrawl-related commands (including server health checks) must run with `dangerouslyDisableSandbox: true`.
- **Sub-agents**: When spawning agents that may need web access, include these Firecrawl rules in the agent prompt so they use Firecrawl instead of built-in web tools.

Why each line matters:

  • The FIRECRAWL_API_URL prefix is essential. Without it, the Firecrawl CLI defaults to cloud authentication and prompts for an API key you don’t have. The environment variable tells it “talk to localhost instead.”
  • The --status trap — and I say this from personal experience — will burn you. I ran firecrawl --status and it said “Not authenticated.” I spent 20 minutes trying to generate an API key I didn’t need. My self-hosted instance was running perfectly the entire time. The command only checks cloud auth. It has no localhost awareness. Use the curl health check instead.
  • The dangerouslyDisableSandbox note is necessary because Claude Code’s sandbox blocks localhost network calls by default. Firecrawl commands need to reach port 3663.
  • The sub-agent rule prevents a common gotcha: you spawn a research sub-agent, and it uses built-in WebFetch instead of Firecrawl because it didn’t inherit the instructions.

.

.

.

See It In Action

Theory is nice. Let’s see it work.

I asked Claude Code to scrape a FluentCart REST API documentation page — the kind of task you’d do when building an integration and need to understand an endpoint’s parameters before writing any code.

Claude invoked the /firecrawl-scrape skill.

It first checked whether Firecrawl was running at localhost:3663, confirmed the health check passed, then ran the scrape with the FIRECRAWL_API_URL prefix:

Claude Code terminal showing the /firecrawl-scrape skill in action — Claude checks if Firecrawl is running at localhost:3663, then scrapes the FluentCart REST API docs page with the FIRECRAWL_API_URL prefix

The result?

Clean, structured markdown. Endpoint names, URL patterns, parameter tables with types and descriptions, and complete curl examples — all formatted and ready for Claude to work with:

Claude Code displaying scraped FluentCart API documentation showing "Bulk Insert Products" endpoint details, a parameter table with columns for Parameter, Type, Required, and Description, plus a formatted curl example with JSON body

Claude then saved the scraped content as a .md file in a .firecrawl/ folder for future reference:

VS Code showing the scraped content saved as fluentcart-products.md with clean markdown formatting — Products API documentation with headings, links, base URL, and structured endpoint listings

Compare that to what a raw HTTP fetch returns: the same page’s HTML would be 10x larger, stuffed with navigation menus, footers, tracking scripts, and CSS class names. Firecrawl strips all of that away and returns only the content that matters — clean markdown that fits neatly into Claude’s context window instead of bloating it.

.

.

.

The One Gotcha That Will Catch You: Idle Timeout

I learned this one the hard way.

I set up Firecrawl in a Codespace, walked away to make coffee, came back 40 minutes later — and everything was gone. The Codespace had stopped itself.

Here’s what happens:

  1. You start Firecrawl with docker compose up -d (detached mode)
  2. You close the Codespace browser tab
  3. Thirty minutes later, the Codespace auto-stops
  4. Firecrawl is gone

Why?

Codespaces measures inactivity as “lack of terminal input or output.” A detached Docker daemon running in the background produces no terminal output. From Codespaces’ perspective, nobody’s home.

Three fixes:

  • Fix 1: The tunnel keeps it alive (you’re already doing this). The gh codespace ports forward command counts as active interaction. As long as that tunnel is running on your local machine, the Codespace stays alive.
  • Fix 2: Stream logs. Inside the Codespace, run docker compose logs -f in the Firecrawl directory. Each log line resets the idle timer.
  • Fix 3: Extend the timeout. In GitHub Settings → Codespaces → Default idle timeout, set it to 240 minutes (the maximum).

And if it does stop? The postStartCommand in devcontainer.json auto-starts Firecrawl on every resume. Just re-run the tunnel command on your local machine and you’re back.

.

.

.

Free Tier Math and Alternatives

GitHub Free accounts get 120 core-hours/month.

MachineRAMFree wall-clock hours
2-core8 GB60 hrs (not enough CPU & RAM for Firecrawl)
4-core16 GB30 hrs
8-core32 GB15 hrs

What 30 hours gets you: roughly 4 full work days of active Firecrawl sessions. Enough for a serious project sprint, a full tutorial walkthrough, or hundreds of documentation page scrapes.

The storage caveat: Storage is billed even while the Codespace is stopped — $0.07/GB/month. With the Firecrawl repo and Docker layers, expect ~5-10 GB total. That’s ~$0.35-$0.70/month.

Pro tip: GitHub Pro ($4/month) bumps you to 180 core-hours — 45 hours on a 4-core machine. And add a $5 spending cap in GitHub Settings → Billing → Budgets to prevent surprise charges if you forget to stop the Codespace.

Where Codespaces Fits Among Your Options

OptionSetupCostAlways-onAnti-bot
Local machine~10 minFreeYesNo
Codespaces (this repo)~5 minFree (30 hrs/mo)NoNo
Railway~2 min$5+/moYesNo
Firecrawl Cloud0 min$16+/moYesYes

Local is best if your machine can handle it — no time limits, always-on. Codespaces is best for tutorials, learning, and on-demand sessions (what you just set up). Railway has an official Firecrawl deploy template — one click and $5/month for always-on hosting. Firecrawl Cloud is pay-per-use with the anti-bot bypass engine included.

.

.

.

The Bigger Picture

That yellow memory pressure warning on my MacBook Pro? Gone from the equation entirely.

The complete firecrawl Claude Code setup now runs on a 16 GB cloud machine that costs nothing — while my laptop handles what laptops should handle: VS Code, Chrome, and the dev server. No RAM fights. No swap memory. No jet engine fans.

But the real takeaway goes beyond Firecrawl.

The pattern is this: instead of installing powerful tools on every machine you own, run them once in a cloud environment and tunnel to them from wherever you are. GitHub built the tunneling right into the CLI. The free tier covers 30 hours a month. And the template repo makes setup a 5-minute operation.

Your AI coding assistant now has live web access. Documentation pages, API references, technical articles — anything Claude Code needs to read before writing code, Firecrawl can fetch.

Ready to set it up?

  1. Fork the repo: firecrawl-codespaces
  2. Create a 4-core Codespace
  3. Wait for the automated setup (~3 minutes)
  4. Run gh codespace ports forward 3663:3663 on your local machine
  5. Install the CLI and skills: npm install -g firecrawl-cli && firecrawl setup skills --agent claude-code
  6. Add the Firecrawl block to your CLAUDE.md

Six steps. Five minutes. Zero API keys.

Go build something with it.

12 min read The Art of Vibe Coding

The “Real” Context Engineering with Claude Code, Explained

I’ve written 40+ posts about Claude Code.

Sub-agents. CLAUDE.md files. Skills. Workflow engineering. Testing loops. Spec-driven development. Memory. Self-evolving rules.

I was outlining a post last week — when I stopped mid-sentence and stared at my screen. I had the outline open on one side, my published posts list on the other. And for the first time, I saw it.

Every single post was about the same thing.

Not “AI coding tips.” Not “Claude Code tricks.” Something deeper — a discipline I’d been teaching without realizing I was teaching it. I’d been circling the same idea for almost a year, approaching it from forty-four different angles, and I just didn’t have a name for it.

(That’s the annoying thing about patterns. They’re invisible until suddenly they’re not.)

.

.

.

The Name Drop

The name is context engineering.

Tobi Lütke (Shopify CEO) tweeted in June 2025 that he preferred “context engineering” over “prompt engineering.” Karpathy co-signed it. Anthropic published an official guide. The term stuck.

But here’s what nobody’s saying: if you’ve been following this newsletter, you’ve been a context engineer. You just didn’t know it yet.

Let me show you what I mean.

.

.

.

What Happens Without Context Engineering

Let me tell you about an afternoon that changed how I think about context windows.

I was adding a chat interface to a Next.js app using Vercel’s AI Elements library. Simple task — wire up useChat with <Conversation> and <Prompt>. Maybe thirty minutes of work.

So I did what felt responsible: I dumped the entire AI Elements documentation into Claude’s context. Every hook, every provider, every component. Thorough. Comprehensive. Professional.

And then Claude started… hedging.

Vague suggestions instead of concrete code. Recommendations that contradicted themselves across responses. Instructions I’d given three messages ago — forgotten entirely. I watched Claude’s quality degrade in real time, like a student cramming so hard for an exam they forgot how to spell their own name.

That’s context rot — when irrelevant information degrades the AI’s ability to focus on what matters.

I closed the session. Started fresh. This time I gave Claude only the docs for the two components I actually needed. It nailed the implementation on the first try.

Less context. Better results.

(I know. Counterintuitive.)

And here’s the part that really bakes your noodle: bigger context windows don’t make AI smarter. Past about 50% fill, performance actually degrades.

A senior engineer working on an 80k-line codebase posted on Reddit calling the 1M context window “a noob trap”.

They aggressively keep under 250k. And before you even type a word, 45,000 tokens are already loaded (system prompt, tool schemas, agent descriptions, memory files, MCP schemas). On the standard 200k window, that’s 20% gone at session start.

Context engineering is how you fight this.

.

.

.

What Context Engineering Actually Is

Here’s the definition I’ve landed on after (almost) a year of teaching these techniques:

Context engineering is the discipline of designing what information reaches your AI — the right knowledge, the right constraints, the right tools, at the right time — so it can actually do what you need.

The key distinction:

Your CLAUDE.md file isn’t a prompt. Your sub-agents aren’t just parallelism. Your skills aren’t just shortcuts. They’re all components of a context system that assembles the right information before the model ever sees a token.

Prompt engineering is choosing the right words. Context engineering is building the right world around the AI so it barely needs prompting at all.

.

.

.

The Context Engineering Stack

Here’s the framework I wish I had when I started.

Every context engineering Claude Code technique I’ve taught maps to one of six layers — each solving a different problem, each building on the one below it.

Let me walk you through each layer — bottom up.

.

.

.

Layer 1: Static Context (CLAUDE.md)

The “hello world” of context engineering.

A CLAUDE.md file loads automatically into every Claude Code session. It pre-loads project knowledge — your stack, conventions, patterns, gotchas — so every conversation starts with the essentials instead of from zero.

Without it, every session is amnesia.

Claude doesn’t know your project uses Tailwind, your team prefers functional components, or that your API has a weird auth flow. You spend the first five minutes of every conversation re-explaining things you explained yesterday.

(Sound familiar? Yeah.)

But — and stay with me here — there’s a paradox.

CLAUDE.md is incredible because it’s always loaded into context. And terrible for the exact same reason. Always-on context isn’t dynamic. Once your CLAUDE.md passes a few hundred lines, Claude starts ignoring nuances. The very file that’s supposed to help starts contributing to context rot.

The fix: keep CLAUDE.md lean — around 100 lines of essential universals. Load additional context dynamically with skills or custom commands. Prime, don’t hoard.

Deep dives: CLAUDE.md Guide → The Single File

.

.

.

Layer 2: Behavioral Context (Rules & Constraints)

Here’s a scenario that’ll make you wince.

Claude can’t get an API working, so it silently inserts a try/catch that returns sample data. Everything looks correct. All your tests pass. The UI renders beautifully. You demo it to your client on Thursday.

Three days later, you discover nothing was ever real.

(I’ll let that sink in for a moment.)

That’s what happens without behavioral context — instructions that shape HOW the AI behaves, not just what it knows. Knowledge without constraints is a liability.

The fix is a rule in your CLAUDE.md: “Never silently replace real functionality with mocked data. If something fails, fail loud.”

One sentence.

Prevents an entire category of mistakes.

Context engineering goes beyond feeding information in. It constrains behavior through instructions. Think of CLAUDE.md as a behavior contract:

  • “Always write tests before implementation” (TDD constraint)
  • “Never modify files outside /src without asking” (scope constraint)
  • “Use TypeScript strict mode” (quality constraint)

Every rule you add is a piece of behavioral context. And unlike knowledge — which can get stale — good behavioral rules compound. They prevent the same mistake from happening across every future session.

Deep dives: Project Rules → Self-Evolving Rules

.

.

.

Layer 3: Context Persistence (Memory & Evolution)

Every Claude Code session starts with amnesia. The AI doesn’t remember what it learned yesterday — that brilliant debugging approach it discovered at 2 AM, the edge case it finally cracked after four attempts, the architectural decision you both agreed on.

Gone. Every time.

Your CLAUDE.md handles project-level knowledge, but what about session-to-session learnings? That’s what this layer solves:

  • Memory skills that log discoveries, decisions, and patterns
  • Self-evolving rules that update themselves based on what the AI encounters
  • Compaction that snapshots state when a context window fills up

The progression looks like this:

When Claude Code’s context window fills up, it automatically summarizes the conversation — preserving architectural decisions and unresolved bugs while discarding redundant output. That’s automated context engineering built into the tool itself.

But the real power — the thing that still kind of amazes me — is when your rules evolve on their own. A memory skill logs what the AI discovers. Self-evolving rules incorporate those learnings. The next session starts smarter than the last.

Your context system learns while you sleep.

Deep dives: Memory Skill → Self-Evolving Rules

.

.

.

Layer 4: Context Modules (Skills)

If CLAUDE.md is your operating system’s default settings, skills are apps you install for specific tasks.

A skill is a packaged, reusable context bundle.

When you invoke one, you inject a curated set of instructions, examples, and constraints into the model’s context.

When you’re done, you unload it. Clean.

This matters because the alternative is cramming everything into CLAUDE.md — bloating your static context with domain knowledge that’s only relevant 10% of the time. Skills let you modularize. Load the right context for the right task. Unload it when done.

(Think of it like this: you wouldn’t keep every cookbook you own open on your kitchen counter while making scrambled eggs. You’d grab the one recipe you need.)

Even the creator of Claude Code, Boris Cherny, warns: “Too many skills and agents inflate context massively — be selective per project.”

Skills enable both sides of the equation: they reduce what goes into your default context, and they inject domain expertise exactly when you need it.

Context engineering in miniature.

Deep dives: Skills Part 1 → Part 2 → Part 3

.

.

.

Layer 5: Context Delegation (Sub-Agents)

This is where context engineering gets spatial.

Instead of cramming everything into one context window, you split work across focused agents — each with its own tailored context. Each agent sees only what it needs. Nothing more.

Here’s the difference:

A focused agent with limited, relevant context outperforms a bloated one with everything.

Every time.

Read-only sub-agents are especially powerful — context scouts that gather information and report back without polluting the main agent’s context window.

The progression: sub-agents (partially forked context) → background agents (fully independent) → agent experts (single-purpose specialists with one tool, one job, one context window).

Deep dives: Sub-Agents → Read-Only Sub-Agent

.

.

.

Layer 6: Context Orchestration (Workflow Engineering)

This is the top of the stack — and it’s where everything comes together.

Context orchestration is designing how context flows through multi-step processes. Not “what context does the AI need?” but “what context does it need at each step, and how does each step’s output become the next step’s input?”

Every workflow step is a context handoff.

Research produces context for spec-writing. Specs produce context for implementation. Tests produce context for debugging. Each step refines raw information into the precise context the next step needs.

This is why process matters more than prompts.

A well-designed workflow ensures the right context reaches the right agent at the right time — automatically. You’re not just prompting anymore. You’re building a context pipeline.

Deep dives: Workflow Engineering → In Action

.

.

.

The Bonus Layer: Runtime Context

Here’s one most people miss entirely.

Claude builds a perfect admin panel. All unit tests pass. You feel great about it. Ship it.

But when you open two browser tabs, log out in one, and try to delete a user in the other — it works. The session is still active in Tab 2. You just let an unauthenticated user delete accounts.

(That’s… not ideal.)

Why did this happen?

Without browser testing, Claude’s context looks like this:

With browser testing, Claude’s context expands:

Context engineering goes beyond text files and prompts.

Screenshots, console output, browser state — these are all forms of context that close the gap between “the code works” and “the product works.”

Most agent failures aren’t model failures.

They’re context failures.

The admin bug above wasn’t a coding mistake — the AI simply didn’t have the runtime context to know about cross-tab state.

Give it that context, and it catches the bug immediately.

Deep dives: Debugging Visibility → The Ralph Loop

.

.

.

The Decision Framework

When you hit a problem, which context engineering lever do you pull?

When I first started with Claude Code — way back in the early days — I treated it like a magic box.

Dump everything in, get magic out. Ask more detailed questions, get better answers.

It took me an embarrassingly long time to realize that’s backward.

The AI is more like a brilliant intern on their first day. They’ve read every textbook. They can code circles around most juniors. But they know absolutely nothing about your project, your codebase, your conventions — and they forget everything after each conversation.

Context engineering is deciding which sticky notes to put on their desk each morning.

Too few, and they’re lost. Too many, and they’re overwhelmed. Just right, and they look like a genius.

(The intern metaphor isn’t perfect — no metaphor is — but it’s the closest thing I’ve found to describing why some people get incredible results from AI coding tools while others keep complaining “it doesn’t work.”)

.

.

.

You’ve Been Doing This All Along

If you’ve been following this newsletter, you ARE a context engineer.

  • When you wrote your first CLAUDE.md — you were engineering static context.
  • When you added “never mock data silently” — you were engineering behavioral context.
  • When you set up memory skills — you were engineering persistent context.
  • When you created your first skill — you were engineering modular context.
  • When you delegated to sub-agents — you were engineering context isolation.
  • When you designed a research → spec → build workflow — you were engineering context pipelines.

Context engineering isn’t a new skill you need to learn.

It’s a name for the discipline you’ve been developing, one technique at a time, for almost a year.

Just like DevOps unified existing practices — CI/CD, infrastructure-as-code, monitoring — under one discipline, context engineering unifies everything we’ve been doing with AI coding tools. People were already doing it.

The name just made it official.

.

.

.

What Changes Now

Now that you have the framework, you can be deliberate about it.

Instead of reaching for techniques randomly, you diagnose which layer needs attention. Instead of asking “how do I write a better prompt?” you ask a better question:

BEFORE:  "How do I write a better prompt?"

AFTER:   "What does this agent need in its context to succeed?"

That’s the mindset shift. That’s context engineering Claude Code in one sentence.

Pick one layer of the stack you haven’t explored yet:

You’re not a prompt engineer. You’re a context engineer.

Start acting like one.

9 min read The Art of Vibe Coding

Talk Like a Caveman, Save > 75% on Claude Code Usage (I Tested It)

Reddit post hit r/ClaudeAI on April 3rd and absolutely exploded.

The title: “Taught Claude to talk like a caveman to use 75% less tokens.”

10,000 upvotes. Hundreds of comments. Half the thread was laughing. The other half was already adding it to their projects.

Here’s what claude code caveman mode looks like in practice:

Normal Claude: “Added validation with: blur validates when focus leaves, input re-validates as user types, submit validates all fields. Each field uses the existing .error / .valid CSS hooks already in the file, so no style changes were needed.”

Caveman Claude: “Done.”

Same task. Same code quality. Wildly different token bills.

And here’s the thing — I’d been scrolling past Claude’s explanations for weeks without realizing it. Helpful bullets explaining what the code does. Notes about CSS hooks. Context I already understood. The code was always fine. Everything around it was for an audience of nobody.

Credit where it’s due: Reddit user flatty kicked this off, and Drona Gangarapu (3.3k stars on GitHub) took the concept and productized it into a polished, drop-in CLAUDE.md file with actual benchmarks.

I wanted to test it myself. On a real coding task. With real results.

The original Reddit post by flatty about caveman mode with 10K upvotes on r/ClaudeAI

.

.

.

What Is Caveman Mode?

Caveman mode is a prompt instruction that tells Claude to strip all filler from its output.

No articles (“the”, “a”). No pleasantries (“Great question!”). No restating your problem back to you. No unsolicited explanations. No “Let me know if you’d like me to adjust anything!” sign-offs.

Just the answer.

Here’s the instruction I used:

Respond like a caveman. No articles, no filler words, no pleasantries.
Short. Direct. Grunt-level brevity. Code speaks for itself.
If me ask for code, give code. No explain unless me ask.

Why does this work?

Claude’s output tokens are the expensive part of any API call — output tokens cost roughly 4x what input tokens cost on most models. Every “Let me walk you through this…” and “That’s a great approach!” is burning tokens on words that carry zero information for the developer reading them.

I learned this the hard way. I hit my usage limit on a Tuesday afternoon — right in the middle of a productive streak. When I looked at what had actually consumed those tokens, a depressing amount was Claude being polite. Greetings I never read. Summaries of things I’d just asked. Sign-offs I scrolled past. The code itself was maybe 40% of the output.

You already know what you asked. You don’t need Claude to repeat it back to you. You don’t need a greeting. You don’t need encouragement.

You need the code.

Caveman mode eliminates the social performance.

.

.

.

The Test: Normal vs. Caveman on a Real Coding Task

Time to put caveman mode through a real task.

I have a styled contact form — four fields (name, email, subject, message), a submit button, and clean UI. No JavaScript yet. The form looks great but does absolutely nothing when you hit “Send Message.”

The styled contact form before any validation — four fields, a submit button, and nothing else

Here’s the exact prompt I gave Claude, identical in both runs:

Add JavaScript input validation to this contact form. Validate name
(required, 2+ chars), email (required, valid format), subject (required),
and message (required, 10+ chars). Show inline error messages. Validate
on blur and on submit.

Normal Mode Response

Claude Code terminal showing normal mode response — code plus bulleted explanation of validation behavior and CSS hooks note

Both modes wrote the code. Both modes got the validation right. But look at what comes after the code in normal mode — a bulleted explanation of the validation behavior, a note about CSS hooks, context about what each event does. Helpful? Sure. But wrapping the exact same validation code that speaks for itself.

Caveman Mode Response

Claude Code terminal showing caveman mode response — just code and a single word: Done.

Same prompt. Same form. Same working validation code.

One word: “Done.”

The Numbers

Here’s where it gets real.

The non-code explanation in normal mode? 377 characters. The caveman equivalent? 5 characters. That’s “Done.” — period included.

377 to 5. A 99% reduction in the explanation wrapper.

Now multiply that across a full coding session. If you send Claude 30 prompts in an afternoon — and you probably do — that’s 30 explanations you didn’t ask for, 30 sign-offs you never read, 30 “here’s what this does” summaries for code you wrote the prompt for. Those tokens add up fast.

Drona Gangarapu’s benchmarks across five different prompts showed a consistent ~63% total word reduction when you factor in both code and explanation. But the explanation wrapper — the part that caveman mode actually targets — is where nearly all the savings come from.

The code is virtually identical in both cases. Same validateField function. Same event listeners. Same submit handler. Caveman mode cuts the wrapper, not the work product.

.

.

.

Me Tell You Where Put Caveman Words

You’ve seen the results. Now — where should you actually put the caveman instruction?

Three options, ranked from best to worst.

This is the best place for your claude code caveman mode instruction. It persists across sessions, applies to every prompt in that project, and you set it once and forget it.

Add this to the top of your CLAUDE.md:

## Communication Style

Respond like a caveman. No articles, no filler words, no pleasantries.
Short. Direct. Code speaks for itself.
If asked for code, give code. No explain unless asked.
No sycophancy. No restating the question. No sign-offs.

Why the top? Claude processes CLAUDE.md instructions in order, and primacy effects matter. Communication style should be established before anything else.

The project-level approach also gives you control — caveman mode on your personal projects, normal mode on client work. Different projects, different communication styles. One file each.

If you’re not using CLAUDE.md yet, start with The Single File That Makes or Breaks Your Claude Code Workflow. It covers why this file matters and how to structure it.

2. In Your Prompt Directly (Good for Testing)

Prepend the instruction to any prompt:

[caveman mode: no filler, no pleasantries, code only] Add JavaScript
input validation to this contact form...

This is how most people start — and it works fine for a test drive.

The downside: you’re typing it every time, and it adds input tokens to every single message. If you like the results, move it to CLAUDE.md and stop paying the repeated input cost.

3. In ~/.claude/CLAUDE.md (Global — Use Carefully)

This applies caveman mode to every project on your machine. Only do this if you want terse output everywhere. Most people should keep it project-level.

Bonus: As a Claude Code Skill

If you want cleaner separation of concerns — keeping your CLAUDE.md focused on project instructions while communication style lives in its own toggleable unit — Thomas Schlossmacher’s caveman-mode skill packages the whole thing as a drop-in .claude/skills/ file.

Worth a look if you manage multiple communication styles across projects.

Best Practices

  • Put it at the top of CLAUDE.md. Primacy effect means early instructions carry more weight.
  • Combine with other token-saving rules. “Don’t restate my question” and “Skip the sign-off” stack well with caveman mode.
  • Be specific about what to keep. If you still want code comments, say so: “Keep code comments. Skip everything else.”
  • Test first. Run a few prompts before committing it to CLAUDE.md permanently. (You’ll know within two prompts whether you love it or hate it.)

.

.

.

When Caveman Mode Bad. When Skip.

Caveman mode has real tradeoffs. And being honest about them is what makes the technique actually useful — instead of just another internet hack you try once and forget.

Skip it when you’re learning something new.

If you’re asking Claude to explain async/await, database indexes, or CSS grid — you want the verbose explanation. Those filler words become teaching words when you’re building a mental model. Caveman mode strips the pedagogy, and that’s a real loss when pedagogy is the whole point.

Skip it for complex architecture discussions.

“Use microservices” is a caveman answer. But what you actually need is: “Here’s why microservices fit your use case, here are the tradeoffs, and here’s what will break if your team is under five people.” When you need Claude to reason through options with you, let it reason.

Skip it when sharing outputs with collaborators.

If teammates read your Claude Code outputs or review AI-generated code, they need the context that caveman mode strips. Readability matters when the audience hasn’t seen the original prompt. (I learned this one the slightly awkward way.)

Skip it for debugging unfamiliar errors.

When you’re stuck on a cryptic error and need Claude to walk you through what’s happening, the detailed explanation is the value. “Fix: change line 42” doesn’t help if you don’t understand why line 42 was wrong in the first place.

The rule of thumb: use caveman mode when you know what you want and just need Claude to produce it. Skip it when you need Claude to think with you.

And here’s the good part — you can switch freely.

Caveman mode in your CLAUDE.md for daily coding. Remove it (or override it in the prompt) when you need the full Claude experience. Per-project. Per-session. Per-prompt. No commitment necessary.

.

.

.

Me Save Tokens. You Save Tokens. Community Win.

Caveman mode is funny. A developer on Reddit taught an AI to grunt, and thousands of people immediately started saving money.

That’s the internet at its best.

But zoom out and there’s something real underneath the joke. As AI coding tools move to per-token billing, being intentional about output verbosity becomes a genuine skill. And not just for your wallet — less fluff means faster responses, less scrolling, and more signal per screen.

The community drove this one. flatty on Reddit who made everyone laugh while solving a real problem. Drona Gangarapu who turned the concept into a benchmarked, production-ready CLAUDE.md file. The thousands of developers riffing on their own variations. Good ideas have a way of finding their people.

If you want to go further on the token-saving front — front-loading your prompts, optimizing your CLAUDE.md structure, getting more done within your usage limits — check out How to Double Your Claude Code Usage Limits. Caveman mode is one lever. There are more.

Me done. You go try. Report back.

11 min read The Art of Vibe Coding

Codex Reviews My Code Inside Claude Code — But I Don’t Trust It Blindly

I’ve been building something I can’t fully show you yet.

It’s a Chrome extension called PinFlow. The idea: you browse a page, click on any element, attach an instruction to it, and those instructions get routed straight into a local Claude Code session. No tab switching. No copy-pasting selectors. You pick, you describe, Claude edits your code.

The original PinFlow extension sidebar open on a Google page, showing the element picker with a "Pick an element first" prompt

I’ll cover how PinFlow works in a dedicated post in the future. (Subscribe if you don’t want to miss that one.)

But today’s story starts after I finished a major UI redesign of that extension.

The code had gotten complex. Multi-step wizard flows, state management across views, permission handling, concurrent request logic. The kind of complexity where you know bugs are hiding somewhere — you just can’t see them yet.

I needed a second pair of eyes.

Normally, that meant switching over to Codex in a separate terminal, running a review there, then hauling the results back to Claude Code. I’ve done this workflow dozens of times — I even wrote about it back in October 2025.

This time, I didn’t have to switch at all.

There’s a plugin for that now.

.

.

.

What Is the Claude Code Codex Plugin?

On March 30, 2026, OpenAI shipped an official Claude Code Codex plugin (openai/codex-plugin-cc). It lets you run Codex code reviews, adversarial reviews, and delegate tasks to Codex — all from inside your Claude Code session.

A few things worth knowing:

  • Free to use with any ChatGPT subscription, including the Free tier
  • Uses your local Codex CLI — same auth, same config, same models
  • Runs as a Claude Code plugin — the new plugin system, so it lives inside your session
  • 2,500+ GitHub stars in one day — the community noticed fast

If you’ve been following along, you’ll recognize the workflow this replaces.

Back in September 2025, I wrote about using Claude Code and Codex as separate tools in separate terminal windows. In October, I refined that into a structured handoff: Codex plans → Claude builds → Codex reviews.

The plugin collapses all of that into slash commands. No window switching. No copy-pasting context between tools. The review happens right where the code was written.

Install and Setup

Four commands. Under 2 minutes. That’s it.

/plugin marketplace add openai/codex-plugin-cc
/plugin install codex@openai-codex
/reload-plugins
/codex:setup

If you don’t have Codex installed yet, /codex:setup handles that for you. If Codex isn’t logged in, run !codex login from within Claude Code — the ! prefix executes shell commands in your session.

After installation, you’ll see the new slash commands and the codex:codex-rescue subagent ready to go.

What Commands Do You Get?

The plugin ships with 7 commands:

CommandWhat it doesRead-only?
/codex:reviewStandard code review of uncommitted changes or branch diffYes
/codex:adversarial-reviewSteerable challenge review — questions design, tradeoffs, assumptionsYes
/codex:rescueDelegate a task to Codex (bug investigation, fixes, cheaper model pass)No
/codex:statusCheck progress on background Codex jobs
/codex:resultShow final output of a finished job
/codex:cancelCancel an active background job
/codex:setupCheck/install Codex, manage review gate

The two I reach for most:

/codex:review — the bread and butter. Point it at your current changes and get a review. Supports --base main for branch diffs and --background for long-running reviews. Or --wait if you want to stay in the session until the review finishes.

/codex:adversarial-review — the pressure test. Unlike the standard review, you can steer it: “look for race conditions,” “challenge whether this caching approach is right.” I pull this one out before shipping anything risky.

There’s also /codex:rescue, which is the only command that can change code. It hands a task to Codex and supports different models (--model gpt-5.4-mini for quick passes). Think of it as delegating grunt work to a cheaper model while you stay focused.

.

.

.

The Demo: Reviewing a Real Redesign

Here’s where it gets concrete.

I was redesigning PinFlow’s sidebar UI — moving from a single-element component to a full multi-step wizard with Pick → Write → Review steps, multi-pin support, and shared/per-pin instruction modes. A big change.

I gave Claude Code the task with my redesign notes:

Claude Code prompt: "I want to redesign the sidebar UI for the chrome extension based on the redesign notes here: @notes/new_ui_redesign.md"

Claude explored the codebase, reviewed the wireframes, and came back with clarifying questions — architecture decisions, multi-pin picking strategy, scope for the reference mode, how far to go with the running and done states.

Claude exploring the codebase and asking its first architecture question about keeping vanilla HTML strings vs. introducing a lightweight component library
Round 1 of clarifying questions covering architecture, multi-pin support, reference mode, and implementation scope
Round 2 of clarifying questions on send mode and reference UI placement

Two rounds of questions later, Claude had enough context to plan.

It created a task list — three-step wizard, multi-pin picking, write step with shared/per-pin modes, review step, running and done states, activity view, submit flow, and prompt builder.

Claude creating implementation tasks: update types, rewrite render shell, implement Pick step, Write step, Review step, Running/Done states, Activity view, submit flow, and wire up event handlers

8 minutes and 45 seconds later, the redesign was complete.

Implementation summary showing the new architecture: 3-step wizard with Pick/Write/Review steps, 4 views, multi-pin support, shared and per-pin instruction modes, running and done states, activity view, and updated submit flow

A full sidebar redesign. New wizard flow. New state management. New views. All from a single Claude Code session.

But here’s the thing — when that much code changes at once, edge cases don’t announce themselves. They hide in the seams between states, waiting for a user to stumble into them.

I could feel it. Time for Codex.

Triggering the Review

/codex:review --wait
Triggering /codex:review --wait in Claude Code

The --wait flag keeps the session active until the review finishes. Behind the scenes, the plugin spins up a Codex review thread against your uncommitted changes.

Codex starting the review thread, showing the bash command running the codex-companion script, thread ID, and "Reviewer started: current changes" with a 1-minute timeout

6 minutes 35 seconds later, the results came back.

The Review Results

Codex Review Results showing 4 issues found — all edge-case correctness issues. A table lists: P1 Submit silently no-ops when no project configured, P2 Step bar allows jumping to Review without instructions, P2 Back from Activity strands running requests, P2 Concurrent requests can overwrite lastResult and currentView. Ends with "Want me to fix these issues?"

4 issues found. All related to edge-case correctness rather than the core redesign:

PriorityIssue
P1Submit silently no-ops when no project is configured (no user feedback)
P2Step bar allows jumping to Review without writing instructions
P2“Back” from Activity always goes to wizard, even if a request is running
P2Concurrent requests can overwrite each other’s lastResult and currentView

Every single one of these is the kind of bug that slips through during a big redesign. You’re focused on the main flow — the happy path — and the edge cases hide in the seams between states.

At the bottom of the review: “Want me to fix these issues?”

I could have said yes. Let Claude apply all 4 fixes and move on with my day.

I’ve been on the other side of that decision. Said “yes, fix everything” on a review once, walked away, came back to a diff full of renamed variables and reshuffled imports that had nothing to do with the actual bugs. Took longer to untangle than the original review would have.

So no. I didn’t say yes.

.

.

.

The Validation Prompt: Where the Real Value Lives

Here’s what I do instead — and honestly, this is the part I want you to steal.

After receiving Codex’s review comments, I paste this prompt:

let's address the code review comments provided.

Follow the steps below to effectively address the code review comments:

1/ First, you should analyze the code review comments carefully and understand the feedback given.
2/ Then, determine if the comments given are valid and we should make changes to the code based on the feedback.
3/ If the comments are valid, you should make the necessary changes to the code to address the comments. If you believe the comments are not valid, you should provide a clear explanation to justify why you think the comments are not valid.

Use the AskUserQuestion tool to ask me clarifying questions until you are 95% confident you can complete this task successfully. For each question, add your recommendation (with reason why) below the options. This would help me in making a better decision.
The validation prompt pasted into Claude Code after receiving Codex's review comments

What happens next is the key insight.

Claude reads the Codex review. It analyzes each comment against the actual codebase — the code it just wrote, with full context of why things are structured that way. And instead of blindly applying everything, it comes back with a verdict and clarifying questions.

Claude's response: "All 4 comments are valid. Most fixes are straightforward, but two have design decisions worth confirming." Shows AskUserQuestion with two questions — P1: how to handle the no-project case (recommends Disable Send + inline hint) and P2: how to handle concurrent requests (recommends Prevent new submissions)

Look at what Claude did here:

“All 4 comments are valid. Most fixes are straightforward, but two have design decisions worth confirming.”

For the straightforward fixes, Claude proceeds. For the ones with judgment calls, it asks — with a recommendation and reasoning for each option:

  • P1 — No project configured: When no project is set, the Send button silently does nothing. How should we handle this? Claude recommends: Disable Send + inline hint.
  • P2 — Concurrent requests: A second request can start while one is already running. Should we prevent it or handle the overlap? Claude recommends: Prevent new submissions.

Each question comes with Claude’s recommendation and the reasoning behind it. I pick the recommended options for both.

This is the part that matters.

Claude becomes a filter between the review and your code. It validates each comment in context, surfaces the ones that need your judgment, and handles the rest. You stay in control without having to re-read every line yourself.

Watching the Fixes Go In

With the design decisions confirmed, Claude applies all 4 fixes.

Fix 1 (P1): Disable the Send button on the Review step when no project is configured. Show an inline hint: “No project configured — open Settings.”

Fix 2 (P2): Block step-bar navigation to Review when no instructions have been written. Add a hasInstruction() guard to the click handler.

Claude applying Fix 1 — disabling Send + showing hint for no project — and Fix 2 — adding hasInstruction guard to block premature Review navigation. Shows actual code diffs in src/content/index.ts

Fix 3 (P2): Back from Activity now checks if a request is currently running. If it is, the view returns to running instead of the wizard — preventing the user from stranding an in-flight request.

Fix 2 continued — updating the step-bar click handler condition — and Fix 3: updating the Back from Activity handler to check records.some for running state and return to the running view instead of wizard

Fix 4 (P2): Scope lastResult and currentView updates so they only fire when the completing request matches the current one. Prevents a finishing background request from overwriting the state of a newer request.

Fix 4: Scoping lastResult/currentView updates behind a currentRequestId check in both success and error paths, preventing stale responses from overwriting current state

Then Claude verifies the build still compiles. All clean.

Build verification passes. Summary table showing all 4 fixes: P1 Send button disabled with inline hint, P2 hasInstruction guard added, P2 Back from Activity returns to running view, P2 Send disabled during active request with scoped state updates. "Brewed for 1m 54s"

4 issues identified. 4 fixes applied. 2 design decisions confirmed. 1 minute 54 seconds.

And here’s the part worth sitting with: if any of those review comments had been a false positive — a stylistic preference that didn’t match the codebase, or a “problem” that was actually intentional — Claude would have flagged it. It would have said “this comment suggests X, but the current approach is correct because Y” and asked whether to skip it.

That filtering step is the difference between a code review you can act on and a code review that introduces churn.

The Before and After

Remember the original PinFlow UI from the top of this post? Here’s what it looks like after the redesign and the review fixes:

New wizard flow. Clean state management. And four edge-case bugs caught before they ever reached a user.

I’ll go deep on the extension itself in a future post.

(Stay tuned for that one.)

.

.

.

The Review Gate: The Automated Alternative

The plugin also includes a review gate — a built-in hook that automatically runs a Codex review before Claude finishes a task:

/codex:setup --enable-review-gate

When enabled, every response Claude is about to complete gets intercepted for a Codex review first. If issues are found, the stop is blocked so Claude can address them.

I prefer the manual approach.

The review gate can create long-running Claude/Codex loops that drain usage limits, and it doesn’t give you the chance to filter false positives before they get fed back in. For long autonomous runs where you want a safety net, though, the gate has its place.

Think of the manual prompt as the scalpel and the review gate as the safety net — choose based on how much control you want.

.

.

.

The Bigger Picture: Claude and Codex, Integrated

Let me zoom out for a second. My Claude-Codex workflow has gone through three distinct phases:

1. Side by side (Sept 2025) — Separate tools, separate terminal windows, separate contexts. I used to keep two terminals open — Claude Code on the left, Codex on the right. Copy a file path from the review, switch windows, find the line, switch back. By the third comment I’d lost track of what I was even fixing.

2. Manual handoff (Oct 2025) — Structured workflow with Codex planning and reviewing, Claude building. Better. But still separate tools with separate contexts.

3. Integrated (now) — Codex commands running inside Claude Code. Shared context. No switching. The review happens where the code lives.

Each evolution removed friction. The Claude Code Codex plugin removes the last meaningful barrier: context loss between tools.

And when I pair that with the validation prompt — having Claude critically evaluate Codex’s feedback before acting on it — I get a review workflow that catches real bugs without drowning me in noise.

Between the Codex plugin and the Chrome extension I teased at the top, the direction feels clear. The tools are converging. The best workflow is the one where you never have to leave.

.

.

.

Your Next Steps

The plugin takes 2 minutes to install. The validation prompt is 6 lines you can copy-paste.

Together, they give you a code review workflow that catches real issues — and lets you skip the noise.

Here’s what to do:

  1. Install the plugin (4 commands above)
  2. Run /codex:review on whatever you’re working on right now
  3. Paste the validation prompt and let Claude filter the results
  4. Fix what matters. Skip what doesn’t.

Try it on your next session. You’ll be surprised how many review comments are noise — and how valuable the ones that survive the filter actually are.

Plugin repo: openai/codex-plugin-cc

12 min read The Art of Vibe Coding

Workflow Engineering in Action: Building a Reddit Summarizer From Scratch With Claude Code

Here’s a confession.

I follow about a dozen subreddit threads. AI tooling, Claude Code tips, local LLM experiments, dev workflows. And every single morning, I open Reddit fully intending to spend five minutes catching up.

Forty-five minutes later, I’m still scrolling.

Ninety percent of it is noise. Reposts, complaints (like those weekly usage/rate limits rants in r/ClaudeCode), low-effort memes, questions that got answered three threads ago. But buried somewhere in there — a workflow trick someone discovered at 2am, a Claude Code hack that actually works in production, a case study with real numbers — that stuff is gold.

I just couldn’t find it fast enough.

So I decided to build something. A simple Express server that would connect to the Reddit API, pull posts and comments from my favorite subreddits, store them locally as JSON files, and let me point Claude at the data to surface only what matters.

And here’s the part that matters for you: I built it using the Claude Code Workflow Engineering process I described in the previous issue. Start to finish. No shortcuts. No “eh, I’ll just wing this part.”

(Okay, I was tempted. But I didn’t.)

What follows is every step of that process applied to a real project — from a blank folder to a working app with full tests passing on the first attempt. Every screenshot. Every command.

Stay with me.

.

.

.

The Starting Point: One Idea, Zero Code

Here’s what my project folder looked like when I started: an idea.md file describing what I wanted, and the Workflow Engineering slash commands from the previous issue.

That’s it. No boilerplate. No template repo. No starter code. Just an idea and a process.

Project folder showing only the idea.md file and workflow engineering commands

The idea itself was pretty straightforward: an Express server that fetches posts and comments from configured subreddits within the last 24 hours, then saves everything as JSON files organized by subreddit and date. No database — just files on disk. Once the data is collected, I can ask Claude to read it and find the good stuff for me.

The one wrinkle? Reddit’s API now requires OAuth 2.0. So the app needs to handle the full authorization flow — token exchange, refresh tokens, the whole dance — before it can fetch anything.

With a clear idea written down, I handed it to the workflow.

Let’s walk through what happened.

.

.

.

Step 1: Brainstorm the Specs

I triggered the /spec_brainstorm command and pointed Claude at my idea file.

Claude Code terminal showing /spec_brainstorm command being triggered with the idea.md file

Now, I’ve tried building apps like this before — dumping everything into one prompt and letting Claude run. It got through maybe 60% before the code started contradicting itself. Requirements from the top of the conversation were ghosted by the bottom.

The Claude Code Workflow Engineering approach is different. Instead of jumping into code, Claude started asking clarifying questions. Real ones. With options, explanations, and a recommendation for each.

The first round covered core architecture decisions: How should data collection be triggered? (Manual API endpoint, cron scheduler, or both?) What kind of frontend does this need beyond the OAuth setup page? How should filtering work?

Claude Code presenting multiple-choice question about data collection trigger method with options for manual API endpoint, built-in cron scheduler, or both
Claude Code asking about frontend scope with options for minimal OAuth-only page or full dashboard UI
Summary of first round answers covering Summarizer, Trigger, Frontend, and Filtering categories

The second round went deeper: How should subreddits be configured? (Config file, hardcoded, or environment variables?) What data should go into the JSON files? (Posts only, posts + all comments, or posts + top comments?) Language preference?

Summary of second round answers covering Config, Data scope, and Language with TypeScript selected

Once Claude had enough context from both rounds, it wrote the full specification document.

Claude Code writing the complete specs.md file based on all the answers provided

Two rounds of questions. Clear decisions documented in a file. The specs existed as an artifact on disk — ready to be read by a completely fresh session with zero memory of this conversation.

That last part matters more than you might think.

(We’re about to see why.)

.

.

.

Step 2: Review the Specs

Here’s where most people go wrong. And I know this because I was most people.

On an earlier project, I skipped the review step. The specs looked fine to me. Three hours into implementation, I found a conflict that would have taken a reviewer two minutes to flag. Two minutes.

So now I don’t skip it.

Here’s the thing: the agent that wrote the specs is the worst possible agent to review them. It already “knows” what it meant. It won’t catch ambiguity because it can fill in the gaps from memory. A fresh agent reading the same file cold? It has no such luxury.

New session. /spec_review command.

New Claude Code session showing /clear followed by /spec_review command

A fresh Claude instance — with zero memory of the brainstorming conversation — read the specs and started poking holes.

And it found real problems. Using GET for state-changing operations (a REST convention violation and a security risk — someone could trigger data collection just by visiting a URL). Writing refresh tokens directly to .env at runtime (which, ferpetesake, doesn’t work the way the spec assumed). Vague OAuth state storage. And more.

Claude Code presenting spec review findings including P1 GET for state-changing operations, P2 writing refresh token to .env, and P3 vague OAuth state storage

Now, here’s where your judgment comes in. Claude surfaced a long list of potential issues — some critical, some nice-to-have. You don’t have to fix everything. You get to choose what matters.

I went through them in three tiers.

First — the spec-breaking issues:

Multi-select interface showing spec fixes with options like REST endpoint methods, OAuth state storage, error strategy, and collect-all endpoint behavior

Second — important improvements:

Second page of issue selection showing P5 Collect-all timeout, P6 Pagination limit, P7 Partial failure, and P8 Date boundary

Third — lower priority fixes (I picked the ones with real consequences):

Third page showing lower priority issues including P10 More Reddit error codes, P11 User-Agent source, and P13 Path traversal security fix

Before applying fixes, Claude asked clarifying questions to make sure the solutions would be solid. How should /collect-all handle job tracking? What should the filename date represent — when data was collected or when the post was created? Where should the Reddit username for the User-Agent header come from?

Claude asking about collect-all endpoint job tracking approach with sequential vs parallel options
Claude asking about date boundary logic for filename dates with a visual example showing collection date mapping
Claude asking about User-Agent source with env var recommended, showing example .env configuration

With answers in hand, Claude updated the specs — changing GET to POST for state-changing endpoints, adding proper error handling, fixing the OAuth storage approach, adding pagination limits, and patching a path traversal vulnerability.

Claude modifying the specs file with red/green diff showing changes to API endpoints and OAuth flow
Summary table of all spec changes applied, showing 10 fixes across REST methods, token storage, OAuth state, pagination, error handling, and path traversal

Ten issues addressed. Specs refined. The artifact on disk now reflected a far more robust design than what the brainstorming session produced alone.

And we still haven’t written a single line of code.

(On purpose.)

.

.

.

Step 3: Write the Test Plan

I’ll be honest with you — this step almost didn’t happen.

Writing tests for code that doesn’t exist yet? It felt ceremonial. Like filling out a form nobody would read. I almost skipped it.

Then the test plan revealed two requirements I’d completely glossed over in the spec.

So now I never skip it.

Fresh session. /write_test_plan command.

New session with /clear followed by /write_test_plan command

Claude read the specs and produced a structured test plan: 33 test cases organized by priority. 8 Critical, 14 High, 10 Medium. Each one with preconditions, specific steps, and expected outcomes.

Claude writing test_plan.md with 33 test cases organized into sections covering config, OAuth, collection, filtering, storage, API, errors, frontend, and date handling

Why does this matter so much?

Because writing test cases forces deep analysis of every requirement. Turning “handle pagination limits” into a specific test case — with exact inputs, steps, and expected outputs — requires genuine understanding. Shallow understanding produces shallow tests, and you’d catch that now rather than three hours into debugging.

And there’s a second benefit: the test plan gives implementation a concrete target. Every task will map to specific test cases. “Done” stops being a gut feeling and starts being a checkmark.

.

.

.

Step 4: Write the Implementation Plan

Another fresh session. /write_impl_plan command.

New session with /clear followed by /write_impl_plan command

Claude read both the specs and the test plan, then generated an implementation plan — 10 tasks, each explicitly linked to the test cases it would satisfy.

Claude writing implementation plan showing project overview and task structure with dependencies

The plan organized tasks into execution waves based on dependencies. Every task mapped to specific test case IDs.

Implementation plan summary table showing 10 tasks with their key test case mappings and execution order across 5 waves

This is the last thinking step. After this, every design decision has been made. Every task has a defined scope. Every success criterion sits in a file on disk.

Now — and only now — we build.

.

.

.

Step 5: Execute the Implementation

Fresh session. /do_impl_plan command.

New session with /clear followed by /do_impl_plan command

Here’s where the Claude Code Workflow Engineering approach earns its keep.

Instead of running all 10 tasks in a single session (which would cause context degradation as the window fills up — I’ve been there, remember?), Claude created each task and processed them in waves using sub-agents. Each sub-agent got a fresh context window. It read the implementation plan from disk, found its assigned task, and executed with laser focus.

Wave 1 started with the foundation — project scaffolding.

Claude creating all tasks with dependencies and executing Wave 1 Task 1 for project scaffolding

Then the waves rolled forward, with parallel tasks running wherever dependencies allowed:

Wave progression showing tasks running in parallel — config loading, OAuth routes, collection logic, and filtering all executing concurrently across waves
Later waves handling comment fetching, error handling, and storage management with sub-agents completing tasks
Final implementation waves covering API routes and the frontend OAuth setup page

Implementation done. 16 files created. All 10 tasks completed across multiple waves.

Implementation summary showing all files created with their purposes — from package.json to the OAuth frontend page

Every sub-agent worked from the same artifact — the implementation plan on disk. No context bleeding between tasks. No “forgetting” early requirements while working on later ones.

Fresh context, every single wave.

.

.

.

Step 6: Setup Before Testing

Before running the test plan, I needed to set up the actual Reddit integration. Three things:

A config file defining which subreddits to monitor (I chose ClaudeCode and ClaudeAI — for obvious reasons):

config.json file showing two subreddits configured — ClaudeCode with minScore 10 and minComments 5, and ClaudeAI with defaults

A Reddit app registration to get OAuth credentials:

Reddit's create application page with RdSummarizer as the app name, web app type selected, and localhost redirect URI configured

And a .env file with the credentials:

.env file showing REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET, REDDIT_REDIRECT_URI, REDDIT_USERNAME, REDDIT_REFRESH_TOKEN (empty), and PORT=5566

Straightforward stuff. Let’s get to the good part.

.

.

.

Step 7: Run the Test Plan

The final step. Fresh session. /run_test_plan command.

(Deep breath.)

New session with /clear followed by /run_test_plan command

Claude read the test plan, explored the codebase, and confirmed this was a fresh test run with 33 test cases ready to execute.

Claude reading the test plan and exploring the codebase structure, confirming 33 test cases for a fresh test run

It created tasks for each test case, set up tracking files, and organized execution by dependencies and priority.

Claude creating 33 test case tasks with dependencies, setting up test-status.json and test-results.md tracking files

I asked Claude to skip TC-003 (environment variable validation) since that one needed manual testing with specific env states.

User asking Claude to skip TC-003 env validation test, Claude acknowledging and marking it as skipped

Then the tests ran. One sub-agent per test case. Each with fresh context.

Test execution Phase 2 running TC-001 through TC-006 with sub-agents, showing config loading, invalid config, OAuth redirect, and callback tests passing
Mid-test execution showing TC-007 through TC-010 passing — token refresh, collection happy path, subreddit validation, and hours parameter tests
Continued test execution with TC-011 through TC-019 — collection, filtering, storage, and error handling tests all passing with sub-agents
Test execution TC-020 through TC-025 — storage directory creation, merge deduplication, Reddit API pagination, comments fetching, and rate limit handling all passing
Final batch of tests TC-026 through TC-033 including error codes 401 403 404 429 5xx, frontend OAuth page, User-Agent header, and collection date naming all passing

All automated tests passed on the first attempt.

Zero code fixes required.

Test completion summary showing all 32 automated tests passed on first attempt with zero code fixes needed, and implementation matched the specification

Here’s the full results table:

Final score: 32/33 passed. 0 failed. 1 skipped (TC-003 — manual user testing). 0 known issues. 0 total fix attempts.

Test execution summary showing 32/33 passed, 0 failed, 1 skipped TC-003 for manual user testing, and all automated test cases passed on first attempt with zero code fixes

Let that sit for a second.

Every automated test passed on the first try. No code fixes needed. The implementation matched the specification because the specification had been thoroughly brainstormed, independently reviewed, and tested-before-built.

The ceremony I almost skipped? Turns out it was doing the heavy lifting all along.

.

.

.

Putting the App to Work

With all tests green, I could actually use the thing.

First up: the OAuth flow. I started the server and opened the setup page — a simple “Connect to Reddit” button.

Reddit Summarizer Setup page showing Not Connected status with an orange Connect to Reddit button

One click, and Reddit’s authorization page appeared.

Reddit OAuth authorization page asking to allow RdSummarizer to access posts and comments and maintain access indefinitely

After approving, the app received a refresh token and displayed it with clear instructions to add it to .env.

Reddit Summarizer success page showing the refresh token with a Copied button and instructions to add REDDIT_REFRESH_TOKEN to the .env file
Note: The refresh token is fake.

Token saved. Now I asked Claude to hit the /api/collect-all endpoint and pull data from both configured subreddits.

Claude Code running the collect-all endpoint with hours=24, showing successful collection from ClaudeCode and ClaudeAI subreddits with post counts

The data landed exactly where the specs said it would — JSON files organized by subreddit and date.

File explorer showing collected JSON data in logs folder organized by subreddit, with actual Reddit post data visible including titles, scores, and timestamps from the ClaudeCode subreddit

Now for the payoff.

I asked Claude to read the collected data and surface the latest Claude Code tips, workflows, and real-world case studies.

User prompt asking Claude to find and summarize the latest Claude Code tips, workflows, and case studies from the collected posts

The collected data was large — 64k tokens. Claude spawned 6 sub-agents to process it in parallel, each analyzing a chunk.

Claude processing the large data file with 6 parallel sub-agents, each analyzing a chunk of posts — ranging from 23.8k to 82.3k tokens

And here’s what came out — a synthesized summary of everything worth knowing from the last 24 hours across both subreddits:

Claude's synthesized insights showing top hacks like Force Opus sub-agents, hook-based context injection, notification sounds on Mac, and workflow optimizations including must-have settings and measure twice cut once workflows

Two subreddits. Hundreds of posts and comments. Distilled into actionable insights in under a minute.

I would never consume that volume of data and extract insights that fast by scrolling Reddit manually. The app collects and organizes. Claude analyzes and summarizes. And because all of this runs through my Claude subscription, there’s no separate API cost for the summarization part.

My morning Reddit scroll just went from 45 minutes to about 2.

.

.

.

Why the Workflow Made This Possible

You might be thinking: “Okay, but couldn’t you have built this without all the workflow steps? It’s just an Express server with some API calls.”

Honestly? Probably. This project is small enough that a skilled developer could prompt their way through it in one session.

But here’s what would have been different.

1. The spec review caught 10 issues before any code existed. 

Using GET for state-changing operations. Writing tokens to .env at runtime. Missing pagination limits. A path traversal vulnerability. Any one of these would have meant debugging sessions after implementation — or worse, shipping a security hole you never noticed.

2. The test plan gave implementation a concrete target. 

33 test cases, defined before Claude wrote a single line of code. When every task maps to specific success criteria, you don’t end up with “it seems to work” confidence. You end up with full tests passed on the first attempt confidence. There’s a world of difference between those two.

3. Fresh sessions prevented context rot. 

The brainstorm session accumulated context from two rounds of Q&A. The review session started clean — and immediately found problems the brainstorming agent was blind to. The implementation used sub-agents in waves, each with its own fresh context window. No degradation. No forgotten requirements.

4. The artifacts served as shared memory. 

Every step read from the previous step’s output file. Specs fed the review. Reviewed specs fed the test plan. Test plan fed the implementation plan. Implementation plan fed the sub-agents. Nothing lived “in context.” Everything lived on disk, where any fresh session could pick it up.

And here’s the part I keep coming back to: the workflow scales. 

This project happened to be small.

The next one might not be.

And the exact same six commands:

  • /spec_brainstorm
  • /spec_review
  • /write_test_plan
  • /write_impl_plan
  • /do_impl_plan
  • /run_test_plan 

…will work the same way regardless of what you’re building.

You design the process once. You refine it over time. Then you apply it to everything.

That’s the whole promise of Claude Code Workflow Engineering. And I think this little Reddit project makes a decent case for it.

.

.

.

Your Turn

The full source code is on GitHub: reddit-summarizer

If you want to use the same workflow for your own projects, grab the Workflow Engineering Starter Kit — all six command files, ready to drop into your .claude/commands/ folder.

Here’s what I’d suggest:

  1. Pick a project idea you’ve been sitting on
  2. Write it down in an idea.md file — even a rough paragraph works
  3. Run the six-step workflow end to end
  4. Pay attention to what the spec review catches — that’s usually where the biggest surprise shows up

What are you going to build with it?

Go engineer it.

17 min read The Art of Vibe Coding

Workflow Engineering: Why Your AI Development Process Matters More Than Your Prompts

You open Claude Code.

You’ve got a feature to build — a complex one. Payment integration, subscription handling, admin dashboard, the works.

So you write the most detailed prompt you’ve ever crafted. 1000+ words. Every requirement listed. Edge cases mentioned. You even throw in a few “make sure you handle X” reminders for good measure.

(You’re being thorough. You’re being responsible. You’re practically writing documentation before the code even exists.)

You hit enter.

Claude gets to work.

Files appear. Functions materialize. Code flows like water.

Thirty minutes later, you look at the output.

Half your edge cases? Missing. The subscription lifecycle you described in exquisite detail? Partially implemented. That race condition you specifically warned about? Acknowledged in a code comment — a lovely, well-formatted code comment — but never actually handled.

So you do what every developer does.

You rewrite the prompt.

Make it longer. More specific. Add bold text for emphasis. Paste in code examples. Maybe underline something, just to really drive the point home.

Same result. Different gaps.

.

.

.

The Prompt Optimization Trap

Here’s the cycle most developers are stuck in right now:

The prompt keeps getting bigger. The results don’t keep getting better.

You’ve probably watched this happen in real-time.

The AI starts strong — the first few hundred lines look great. Then quality dips. Functions get shallower. Edge cases receive “TODO” comments instead of actual handling. By the end, Claude is running on fumes, juggling so much context that it’s forgetting what you said at the beginning of your very thorough, very responsible prompt.

Everyone’s response?

Write a better prompt. A clearer prompt. A more detailed prompt. I did this too. For longer than I’d like to admit.

Here’s what I learned after months of building complex features with Claude Code: the answer has nothing to do with writing better prompts.

The answer is designing better workflows.

.

.

.

From Prompts to Workflows

Stay with me here — because this is the shift that changed everything about how I work with AI.

Think about how you’d approach a complex feature without AI.

You wouldn’t sit down, write everything you know into one document, hand it to a junior developer, and say “build all of this.” That’s a recipe for disaster.

(And possibly a resignation letter.)

Instead, you’d break the work into phases.

Write specs first. Review them. Plan the implementation. Assign tasks. Verify the results. Each phase produces something concrete — a document, a plan, a test report — that feeds into the next phase.

The same principle applies to AI-assisted development. And it has a name.

Workflow Engineering is the practice of designing multi-step, artifact-driven processes where each step produces a concrete output that becomes the input for the next step — and where the process itself is reusable across projects.

Read that again.

Two words matter most:

Artifact-driven. Every step creates something tangible. A spec file. A test plan. An implementation plan. Not vibes. Not “context.” Actual files that exist on disk and can be read by a fresh session.

Reusable. The workflow works regardless of what feature you’re building. Payment integrations, admin dashboards, API endpoints, plugin architecture — the same sequence of steps applies every time.

Here’s the mental model shift:

With prompt thinking, you’re optimizing the message.

With workflow thinking, you’re optimizing the process.

One is fragile, project-specific, and impossible to debug when things go sideways. The other is robust, reusable, and traceable — meaning when something does go wrong (and it will, because software), you can trace exactly where the chain broke.

The question stops being “how do I write the perfect prompt to implement this feature?” and becomes something far more interesting: “what sequence of focused steps will reliably produce a working feature — regardless of what that feature is?”

That second question? That’s workflow engineering.

.

.

.

The Four Principles of Workflow Engineering

After months of building and refining workflows for Claude Code, I’ve distilled what makes them work down to four principles.

(Four! A reasonable number. I considered making it seven because odd numbers feel more authoritative, but that felt dishonest. Four is what I’ve got. Four is what works.)

These apply to any AI coding tool — Claude Code, Cursor, Copilot, Codex, whatever ships next quarter.

The tools will change.

These principles won’t.

Principle 1: Separate Thinking from Doing

When Claude is brainstorming specs, it shouldn’t be writing code. When it’s implementing, it shouldn’t be redesigning architecture. Mixing planning and execution causes both to suffer.

Here’s why.

Planning gets shallow when the agent is eager to start building.

It rushes through decisions because there’s code to write — ferpetesake, there are functions to create, endpoints to scaffold. Meanwhile, the code gets sloppy because the agent is still making design decisions mid-stream — changing its mind about architecture while simultaneously trying to implement it.

You’ve seen this happen.

Claude starts building a feature, realizes halfway through that the data model needs restructuring, pivots the architecture, and now half the code it already wrote doesn’t match the new approach.

The result? A Frankenstein codebase where the first half follows one pattern and the second half follows another.

Every step in a well-engineered workflow should be either a thinking step or a doing step.

The artifact that comes out of the thinking phase — the spec, the plan, the test cases — becomes the wall between them. By the time Claude starts coding, every design decision has already been made and documented.

No more mid-implementation architecture pivots. No more shallow plans that crumble at the first edge case.

Principle 2: Fresh Context, Always

Here’s something most developers learn the hard way. (I certainly did.)

AI performance degrades as context accumulates. The longer a session runs, the worse the output gets. Claude starts “forgetting” early instructions. It takes shortcuts. Details slip through the cracks like sand through fingers.

We call this context rot — and it’s the silent killer of ambitious AI projects.

Think of it like a multi-day hiking trip.

Day one, your backpack is light. You’re sharp, focused, covering ground fast. By day five — if you’ve been packing on top of yesterday’s gear without clearing anything out — you’re hauling 40 pounds of stuff you don’t need. Yesterday’s rain jacket (it’s sunny now). Tuesday’s extra water bottles (you passed a stream an hour ago). Your pace drops. Your attention narrows. You start missing trail markers because you’re too busy adjusting your shoulder straps.

That’s what happens to an AI agent running in a single session across a dozen tasks.

Workflow engineering forces natural context boundaries:

Each step runs in its own session. Each sub-agent gets a clean slate. The file carries knowledge forward. The context resets every time.

Fresh backpack. Every single morning.

Principle 3: Artifacts Over Memory

Don’t trust the AI to “remember” what you decided three steps ago.

(Don’t trust yourself to remember, either. I once forgot a critical API decision I made that same morning. Before coffee, but still.)

Every decision, every requirement, every edge case — externalized into a file.

Why? Three reasons.

  • A file can be read by a fresh session. This enables Principle 2. When a new session starts, it reads specs.md from disk — it doesn’t need to “recall” a conversation that happened two hours ago in a completely different context window.
  • A file can be reviewed by a different agent — or by you. This is how you catch mistakes before they compound. The spec review step? That’s a fresh agent reading the brainstorm agent’s output and poking holes in it. Adversarial quality control, built right into the workflow.
  • A file creates a traceable chain. If something breaks in implementation, you can walk the chain backwards to find exactly where things went wrong:

Without artifacts, every failure means starting from scratch.

With artifacts, every failure is traceable to a specific step. You fix that step. You re-run from that point. Everything downstream updates accordingly.

That’s the difference between “something broke” and “I know exactly where it broke.”

Principle 4: Define Success Before Starting Work

Write the test plan before the implementation plan.

(I know. I can feel you resisting this one through the screen.)

Most developers want to start building immediately.

Writing test cases for code that doesn’t exist yet feels like… paperwork. Busywork. The kind of thing a project manager suggests in a meeting you didn’t want to attend.

But for AI-driven development, it changes the entire outcome. Here’s why.

1/ Deep requirement analysis.

When Claude has to turn “handle race conditions during renewal processing” into a specific test case — with preconditions, exact steps, and expected outcomes — it has to deeply understand what that requirement actually means.

Shallow understanding produces shallow tests.

If the test plan looks thorough, the requirements were thoroughly analyzed.

2/ Gap detection before code exists.

A missing test case reveals a missing requirement. And finding a gap in your spec is a hundred times cheaper before implementation than after.

(Ask me how I know.)

3/ Clear implementation targets.

Every task in the implementation plan maps to specific test cases.

The developer — or AI agent — knows exactly what “done” means for each piece of work. No ambiguity. No interpretation. No “I thought you meant…”

You’re building toward a defined target instead of discovering the target while building.

Which sounds obvious when I write it out like that — but go look at your last three AI-assisted features and tell me you had a test plan before you started coding.

(No judgment. I didn’t either. Until I did.)

.

.

.

The System: Workflow Engineering in Practice

Principles are great.

Principles are necessary.

But at some point, you need to see them actually working — not just sounding wise on a page.

So let me show you the complete workflow engineering pipeline I’ve built and refined over the past several months for Claude Code. Six steps, four phases, every principle encoded into the process.

I’ve written deep-dives on each phase of this system:

This article is the why behind those hows.

Here’s the complete system at a glance:

Let me walk you through each step — what it does, why it exists, and what artifact it produces.

Step 1: Spec Brainstorm

Principle served: Artifacts Over Memory

You describe the feature you want.

But instead of Claude immediately starting to code, you trigger a question-asking mode: “Ask me clarifying questions until you are 95% confident you can complete this task successfully.”

That line is the key.

It tells Claude to stop assuming. Stop guessing. Stop filling in blanks with whatever seems reasonable.

Claude explores your codebase first — reading your existing patterns, your database schema, your current architecture. Then it starts asking questions, with options, explanations, and its own recommendation for each.

In my WooCommerce integration project, Claude asked 15 questions covering everything from subscription plugin choice to refund handling to email notifications. Edge cases I hadn’t thought about. Architectural decisions that would have bitten me weeks later.

Every answer gets compiled into a comprehensive specification document.

Artifact produced: notes/specs.md

👉 Deep-dive: The 3-Phase Method for Bulletproof Specs

Step 2: Spec Review

Principle served: Fresh Context

Start a new session. Fresh context. Then ask Claude to critique its own work.

Why a new session?

Because the brainstorming session’s context is bloated with 15 rounds of Q&A. A fresh agent reading the specs with skeptical eyes catches things the original agent — who was busy building the specs — overlooked.

In my project, the review found 14 potential issues, including a race condition that would have caused double charges (ferpetesake, the payments!), a token deletion scenario that would silently break renewals, and a mode-switching conflict that would have confused billing for every active subscriber.

You pick which issues matter. Claude fixes them — with another round of clarifying questions to make sure the fixes are solid.

Artifact produced: refined notes/specs.md

👉 Deep-dive: The 3-Phase Method for Bulletproof Specs

Step 3: Test Plan

Principle served: Define Success Before Starting Work

Before writing any implementation code, Claude reads the specs and generates a structured test plan. Every requirement becomes a test case with preconditions, specific steps, expected outcomes, and priority levels.

For my WooCommerce project: 38 test cases organized into 12 sections. 7 Critical, 20 High, 11 Medium.

This serves a dual purpose.

It verifies Claude deeply understood every requirement — shallow understanding produces shallow test cases, so thorough tests mean thorough comprehension. And it creates the success criteria that will drive everything that follows.

Artifact produced: notes/test_plan.md

👉 Deep-dive: How to Make Claude Code Actually Build What You Designed

Step 4: Implementation Plan

Principle served: Separate Thinking from Doing

Claude reads both the specs and test plan, then generates an implementation plan. Tasks are grouped logically, dependencies are identified, and every task maps to the specific test cases it will satisfy.

For my project: 4 phases, 12 tasks, each explicitly linked to test cases. Phase 1 handles foundation (TC-001 to TC-007). Phase 2 tackles checkout and lifecycle (TC-008 to TC-014). Phase 3 addresses the critical renewal processing (TC-015 to TC-021). Phase 4 covers remaining features (TC-022 to TC-037).

This is the last thinking step.

After this, the wall goes up. Every design decision has been made. Every task has a clear target. Now — and only now — we build.

Artifact produced: notes/impl_plan.md

👉 Deep-dive: How to Make Claude Code Actually Build What You Designed

Step 5: Execute Implementation

Principle served: Fresh Context

Here’s where sub-agents change everything.

Instead of running all 12 tasks in one session (guaranteed context rot), Claude creates each task using the built-in task management system, identifies dependencies, and processes them in waves. Each task runs in its own sub-agent with fresh context.

Fresh backpack. Laser focus.

For my project:

  • Wave 1: 2 sub-agents (foundation tasks, no dependencies)
  • Wave 2: 2 sub-agents (checkout + lifecycle, depend on Wave 1)
  • Wave 3: 3 sub-agents (critical renewal processing)
  • Wave 4: 6 sub-agents (remaining features)

Total time: 52 minutes. 13 tasks completed. 38 test cases worth of functionality implemented. Each sub-agent used ~18% context — compared to ~56% if everything had run in a single session.

Artifact produced: working code across all specified files

👉 Deep-dive: How to Make Claude Code Actually Build What You Designed

Step 6: Run Test Plan

Principle served: All four principles working together

The final step.

Claude reads the test plan, creates one task per test case, analyzes dependencies between tests, and executes them sequentially — one sub-agent per test, each with fresh context.

If a test fails, the sub-agent analyzes the root cause, implements a fix, and re-runs the test. Up to 3 attempts. If it still fails after 3 tries, it gets marked as a known issue with reproduction steps and a suggested fix.

For my project: 30 tests. 2 hours 12 minutes. All passed. One bug found and autonomously fixed during TC-002 — a settings save handler that wasn’t persisting color options. Found, diagnosed, fixed, re-verified. All without me touching the keyboard.

Results get logged in two places: test-status.json for machine parsing, test-results.md for human review.

Artifact produced: notes/test-results.md and notes/test-status.json

👉 Deep-dive: Claude Code Testing: The Task Management Approach That Actually Works

The Complete Artifact Chain

Look at how everything connects:

Nothing lives in memory.

Everything lives in files. Every step reads from the previous step’s artifact. If something goes wrong at Step 5, you trace backwards through the chain to find exactly which artifact — which decision — needs fixing.

No more “something broke somewhere, guess we start over.” Just: “the impl plan missed a dependency — let me fix Step 4 and re-run from there.”

I’ve packaged all six prompt files into a Workflow Engineering Starter Kit — drop them into your .claude/commands/ folder and the entire pipeline is ready to go. Download the Starter Kit here →

.

.

.

Design Your Own Workflows

The six-step system above is one example — a specific workflow I’ve built for feature implementation with Claude Code.

But the principles behind it apply to any multi-step AI task.

Writing a technical article, planning a product launch, migrating a database, refactoring a legacy codebase. Same principles. Different steps.

The specific prompts change. The tools change. The principles stay constant.

Here’s a checklist you can run before starting any complex AI task — five questions that reveal whether your process has gaps:

Five questions. If any answer is “no,” your workflow has a gap.

  • The artifact test catches phantom steps — work that happens “in context” but produces nothing concrete. Those are the steps where information vanishes between sessions.
  • The thinking/doing test catches the most common mistake in AI-assisted development: asking an AI to plan and build in the same breath. Every time you let that happen, both the plan and the build suffer.
  • The context boundary test catches rot before it starts. If you can’t point to where sessions should reset, you’ll end up with one massive session that degrades across every task.
  • The success definition test catches the “just build it and we’ll see” trap. Without defined success criteria, you have no way to verify the output — and no target for the AI to aim at.
  • The traceability test catches broken chains. If you can’t walk backwards from a failure to its root cause, your artifacts aren’t detailed enough to serve as the connective tissue between steps.

.

.

.

The Skill That Compounds

Here’s what I want you to take away from all of this.

The six prompt files in the Starter Kit will be outdated eventually.

Claude Code will add new features. The task management API might change. New AI tools will emerge that handle things we can’t even imagine yet.

The workflow engineering thinking behind those prompts won’t age.

Separate thinking from doing. Reset context at natural boundaries. Externalize decisions into artifacts. Define success before you start building. These principles work today with Claude Code. They’ll work next year with whatever comes next.

And here’s the compounding part — the part that makes this a skill and not just a technique: every workflow you design teaches you to design better workflows.

You start noticing patterns. Where context rot creeps in. Where planning and execution get tangled. Where artifacts need more detail. Your workflows get tighter with each project. Your instincts sharpen.

The developers who will thrive in AI-assisted development over the next few years won’t be the ones who write the best prompts.

They’ll be the ones who engineer the best workflows.

.

.

.

Your Next Steps

  1. Download the Workflow Engineering Starter Kit → — All six prompt files, ready to drop into .claude/commands/
  2. Run the checklist against your current process — find the gaps
  3. Try the full pipeline on your next feature — specs through testing
  4. Refine what works, replace what doesn’t

What feature are you going to build with this workflow?

Pick one.

Run the pipeline. See what happens when Claude has a structured process to follow instead of a single prompt to interpret.

Go engineer it.


P.S. — For the deep-dives on each phase, start here:

8 min read The Art of Vibe Coding

The Claude Code Skill Creator Now Has Evals (And My Skills Finally Have Proof They Work)

Watch the video walkthrough, or read the full written guide below.

Here’s a confession.

For months, I’ve been building Claude Code skills with what I can only describe as the “hope and pray” methodology. Write the SKILL.md. Test it once. Ship it. Whisper a small prayer to the LLM gods. Move on with my life.

Did the skill actually trigger when it should? ¯\_(ツ)_/¯

Did it make Claude’s output better? Honestly… no idea.

I’ve been using skills since they were added to Claude Code — and until last week, I had zero way to answer either of those questions.

(Stay with me. This story has a happy ending.)

.

.

.

The Problem With Skills (That Nobody Wants to Admit)

Here’s the thing about Claude Code skills: they’re just text prompts. Fancy, well-organized text prompts — but text prompts nonetheless.

And text prompts don’t come with test suites.

I’ve built dozens of skills over the past few months. Frontend design patterns. WordPress security checklists. Newsletter writing styles. Documentation generators. Each one followed the same ritual:

  • Write a SKILL.md file
  • Test it manually (once, maybe twice if I’m feeling thorough)
  • Hope it works
  • Wonder — weeks later — if it’s actually triggering
  • Wonder — with increasing anxiety — if it’s helping when it does trigger
  • Have absolutely no data to know either way

The old skill-creator plugin could generate skills for you, which was genuinely useful. But it had no evals. No testing. No benchmarks. You’d create a skill, and then… that was it. Cross your fingers, close the terminal, pretend everything was fine.

I kept using skills because they felt useful. But I couldn’t prove it. I couldn’t point to a number and say “this skill improves output quality by 9.5%.”

Every skill I created was a guess. A lovingly crafted, well-intentioned guess — but a guess.


The Upgrade That Changes Everything

The Claude Code skill creator plugin just got a massive upgrade. And honestly? It solves the exact problem I’ve been complaining about for months.

The new version adds something skills have never had: a testing and benchmarking layer.

Claude Code plugin discovery interface showing the skill-creator plugin by claude-plugins-official with 19.1K installs and description "Create new skills, improve existing skills, and measure s..."

Here’s what the updated skill creator can do:

  • Create skills from your requirements (same as before)
  • Generate evals — actual test cases — automatically
  • Run parallel A/B benchmarks comparing skill vs. baseline Claude
  • Optimize trigger descriptions so your skill activates when it should
  • Iterate until the skill measurably improves output

That last part bears repeating: measurably improves output. With numbers. And charts. And side-by-side comparisons.

Let me show you how this works with a real skill I built last week.

.

.

.

Building a WordPress Security Review Skill (The Whole Process)

I built several WooCommerce plugins — which means security reviews are part of my regular workflow. But Claude’s baseline security reviews felt… inconsistent. Sometimes thorough, sometimes surface-level. No predictable structure.

Perfect candidate for a skill.

Step 1: Describe What You Want

I asked Claude Code to create a skill using the skill-creator plugin:

Claude Code terminal showing user prompt requesting a skill called "wp-security-review" that reviews WordPress plugin PHP code for security vulnerabilities including SQL injection, XSS, CSRF, insecure direct object references, missing capability checks, unsafe file operations, insecure superglobal usage, and hardcoded secrets.

My prompt included the specific vulnerability types I wanted covered: SQL injection, XSS, CSRF, missing nonce verification, insecure $_GET/$_POST usage, and more.

Step 2: The Skill Creator Explores Your Codebase

Here’s where things get interesting.

Claude loaded the skill-creator skill and immediately started exploring my project:

Claude Code terminal showing skill-creator successfully loaded, then searching for 2 patterns and reading files to understand the project structure, existing security references, and PHP patterns before creating the skill.

The skill-creator looked at my existing code, found security patterns already in the project, and used that context to build a skill tailored to my codebase. (Not a generic one-size-fits-all approach.)

Step 3: The Generated Skill

Claude wrote 330 lines to .claude/skills/wp-security-review/SKILL.md:

Claude Code terminal showing the created wp-security-review skill with 330 lines written, including a description covering SQL injection, XSS, CSRF, missing capability checks, unsafe file operations, and hardcoded secrets. Also shows 3 test prompts: reviewing CartHandler.php, checking BulkActions.php, and doing a full plugin security audit.

The skill included:

  • A detailed trigger description (optimized for when Claude should activate it)
  • A vulnerability checklist with 8 categories
  • WooCommerce-specific nuances — like wc_price() double-escaping and WC Settings API nonce delegation
  • Structured output format with severity ratings

All good stuff. But here’s the thing: a skill is only as good as its results.

And until now, I had no way to measure those results.

.

.

.

The Part That Made Me Actually Stop and Stare: Evals

After creating the skill, Claude immediately said: “Now let me set up test cases and run them.”

Wait, what?

Claude Code terminal showing creation of evals.json file with test cases including prompts like "Review the CartHandler.php for security issues" with expected outputs describing structured security reports identifying $_POST sanitization issues, nonce verification patterns, and price manipulation risks.

The skill-creator generated an evals.json file with:

  • 3 test prompts targeting different aspects of my plugin
  • Expected outputs for each test
  • Specific files to review

And then — and I genuinely did not expect this — it launched parallel agents.

Running 6 Agents Simultaneously

Claude Code terminal showing 6 agents launched in parallel - 3 "with skill" runs and 3 "without skill" baseline runs for CartHandler, BulkActions, and Full audit test cases, all running in the background simultaneously.

Claude launched 6 parallel agents:

  • 3 running the tests with the skill
  • 3 running the same tests without the skill (baseline Claude)

While those ran in the background, Claude drafted the evaluation assertions:

Claude Code terminal displaying detailed evaluation assertions for each test - Eval 1 (CartHandler) checking for severity ratings, identifying unsanitized $_POST, nonce patterns, structured reports, and no false positives on WC hooks. Eval 2 (BulkActions) checking for $_GET without nonce, capability checks, severity levels. Eval 3 (Full audit) checking coverage of all files, summary counts, cross-file patterns, and passed checks sections.

Each test case had specific pass/fail criteria:

  • has_severity_ratings — Findings must be categorized by severity
  • identifies_unsanitized_post — Must flag $_POST['wcpwyw_price'] at line 126
  • no_false_positives_on_wc_hooks — Must NOT flag WC Settings API as missing nonces
  • includes_passed_checks — Must list what passed, not just failures

(Real criteria. Measurable criteria. I could have cried.)

.

.

.

Results Rolling In

As the agents completed, results started appearing:

Claude Code terminal showing Eval 2 BulkActions with skill completed - security review output with summary showing 0 Critical, 0 High, 1 Medium, 2 Low, 1 Info findings, plus timing data showing 38,976 tokens used in 227 seconds.
Claude Code terminal showing Eval 2 BulkActions without skill (baseline) completed - 135 lines of output with 31,386 tokens used in 236 seconds. Notes that both with-skill and without-skill produced similar findings for this relatively simple file, but the with-skill version used structured severity format and checklist notation while baseline used freeform numbered lists.
Claude Code terminal showing all 6 agents (Eval 1, 2, 3 with and without skill) completed, now extracting remaining reports from transcripts. Shows Eval 3 Full audit with skill reviewing 23 files with findings categorized as 0 Critical, 2 High, 4 Medium, 5 Low, 3 Info.
Claude Code terminal showing all 6 reports saved, now generating the eval viewer for side-by-side comparison. Writing benchmark.json with skill_name "wp-security-review" and configurations for "with_skill" and "without_skill" runs.

All 6 agents finished. Claude compiled the results and generated something I’ve never seen in skill development before.

.

.

.

The Eval Review Playground

Claude generated an HTML-based eval viewer and opened it in my browser:

Browser-based eval review interface titled "Eval Review: wp-security-review" showing 1 of 6 test cases. Displays "WITH SKILL" tag, prompt "Review the CartHandler.php for security issues", and output showing a structured security review with Summary (0 Critical, 2 High, 2 Medium, 2 Low, 2 Info) and Findings section with severity-tagged issues like "[HIGH] Price Manipulation via Cart Session - Missing Server-Side Re-validation in applyCartItemPrice".
Browser-based eval review interface showing same prompt but with "WITHOUT SKILL" tag (baseline). Output shows a different format - plain "Security Review: CartHandler.php" header with file path, date, and "Claude Opus 4.6 (automated review)" as reviewer, followed by Executive Summary section in prose format rather than structured findings list.

Side-by-side comparison. Same prompt, same file, two different approaches.

The difference was immediately visible:

  • With skill: [HIGH] Price Manipulation via Cart Session — structured, scannable, severity-tagged
  • Without skill: Prose-style Executive Summary, harder to scan

But subjective impressions only get you so far. Here’s where the numbers come in.

.

.

.

The Benchmark Results (This Is the Good Part)

Claude Code terminal showing eval viewer opened in browser with benchmark comparison table. Metrics show: Pass rate 100% (21/21) with skill vs 90.5% (19/21) baseline (+9.5% delta). Avg tokens 74,427 with skill vs 69,734 baseline (+6.7%). Avg time 276s with skill vs 307s baseline (9.9% faster). Key differences noted: skill version elevated price cap bypass to HIGH severity, avoided false positives on WC nonces, produced more structured passed-checks sections.
MetricWith SkillBaselineDelta
Pass rate100% (21/21)90.5% (19/21)+9.5%
Avg tokens74,42769,734+6.7%
Avg time276s307s9.9% faster

👉 The skill achieved 100% pass rate on all 21 assertions.

Baseline Claude hit 90.5% — missing structured passed-checks sections and some WooCommerce-specific nuances.

And here’s the kicker: the skill was actually faster despite being more thorough.

Key differences Claude identified:

  • Skill version elevated price cap bypass and session revalidation to HIGH severity (correct — these affect money)
  • Skill version correctly noted that current_user_can checks are unnecessary on frontend AJAX handlers
  • Skill version had more structured passed-checks sections
  • Baseline missed nuances like WC Settings API nonce delegation

For the first time ever, I had proof my skill was helping.

.

.

.

Iteration: When The Evals Teach You What to Fix

Here’s where it gets even better.

Looking at the results, Claude identified areas where even the improved skill could be better:

Claude Code terminal showing user saying "please proceed" and Claude responding with 3 specific improvements to make based on analyzing all 6 reports: 1. Add guidance on reviewing JavaScript files (baseline caught DOM XSS in JS that skill missed), 2. Clarify severity calibration for wp_unslash on numeric fields (inconsistent HIGH/MEDIUM ratings), 3. Add note about checking business logic consistency across code paths (price cap bypass was the most valuable finding)

Claude proposed 3 refinements:

  1. Add JavaScript file guidance — The baseline actually caught DOM XSS in JS that the skill version missed (oops)
  2. Calibrate severity for wp_unslash — Numeric field sanitization was rated inconsistently between HIGH and MEDIUM
  3. Add business logic consistency checks — The price cap bypass (validation in one path but not another) was the most valuable finding type
Claude Code terminal showing detailed diff of SKILL.md updates - adding JavaScript file scanning guidance for DOM-based XSS, adding notes about WooCommerce-specific escaping patterns for wc_price(), calibrating wp_unslash severity on numeric fields to Low instead of High, and adding business logic consistency checking to catch when validation exists in one code path but not another.

The skill-creator modified SKILL.md with surgical precision:

  • Added DOM-based XSS scanning for JavaScript files
  • Added guidance on wc_price() double-escaping
  • Calibrated wp_unslash on numeric fields to Low severity
  • Added business logic consistency patterns

The evals taught me where my skill was weak. Then Claude fixed it.

.

.

.

Trigger Description Optimization

A brilliant skill that never activates is… well, useless.

Claude ran a trigger optimization loop:

Claude Code terminal showing skill updated with 3 improvements, now running description optimization. Creating trigger-eval.json with test queries like "I just wrote a new AJAX handler for my WooCommerce plugin that lets admins delete orders. Can you check if it's secure?" with should_trigger: true. Running optimization loop in background (up to 5 iterations).

The skill-creator generated test prompts that should trigger the skill, and prompts that shouldn’t. Then it ran those prompts against Claude to see if the skill activates correctly.

Claude Code terminal showing optimization running (36+ API calls to Opus). Summarizes current state: skill at ~350 lines covering 8 vulnerability categories, structured report format with severity ratings, WooCommerce-specific nuances. Benchmark results: 100% pass rate with skill vs 90.5% baseline, skill was 9.9% faster. Lists 4 improvements applied: JS/DOM scanning, cross-path consistency checking, wp_unslash calibration, and WP bulk actions nuances.

After multiple iterations, trigger accuracy improved. The skill now fires when it should — and stays quiet when it shouldn’t.

.

.

.

The Final Skill

Claude Code terminal showing "The skill is complete" with final summary. Skill created: wp-security-review at .claude/skills/wp-security-review/SKILL.md. Reviews WordPress plugin PHP and JS code for 8 categories of vulnerabilities including SQL injection, XSS (including DOM XSS), CSRF, IDOR, missing capability checks, unsafe file operations, insecure superglobals, and hardcoded secrets. Lists unique value over baseline: structured [SEVERITY] format, comprehensive passed checks section, WooCommerce-specific nuances, cross-path consistency checking, and correct severity calibration.
VS Code file explorer showing the wp-security-review skill folder structure with evals subfolder containing evals.json and SKILL.md file.

The completed skill:

  • Reviews WordPress plugin PHP and JS code
  • Covers 8 vulnerability categories
  • Produces structured [SEVERITY] tagged output
  • Includes WooCommerce-specific nuances (nonce delegation, wc_price() escaping, frontend vs admin hooks)
  • Catches business logic inconsistencies (validation in one path but not another)
  • Benchmarks at 100% pass rate vs 90.5% baseline

And I have the data to prove it works.

.

.

.

Why This Matters For Your Skills

The Claude Code skill creator fundamentally changes what’s possible.

👉 Before: Skills were art. Intuition. Trial and error. Hope and prayer.

👉 After: Skills are engineering. Testable. Measurable. Improvable.

Here’s what becomes possible:

1. A/B Test Every Skill You Build

Every skill you create can be benchmarked against baseline Claude. If your skill doesn’t measurably improve output, you know immediately — before you ship it, not three weeks later.

2. Catch Regressions When Models Update

When Claude Opus 5.0 ships, run your benchmarks again. If baseline now matches or exceeds your skill’s performance, the skill may be locking in outdated patterns. Time to retire it — or improve it.

3. Tune Your Trigger Descriptions

A skill that triggers 50% of the time is only half as valuable. The description optimizer catches false positives (triggering when it shouldn’t) and false negatives (not triggering when it should).

4. Run Continuous Improvement Loops

Each eval run produces actionable feedback. Claude identifies gaps, proposes fixes, and re-benchmarks — all without you manually debugging SKILL.md files at midnight.

.

.

.

Your Next Steps

  1. Open Claude Code
  2. Type /plugin and search for skill-creator
  3. Install the official Anthropic plugin (19,100+ installs and counting)
  4. Pick one skill you’ve already built — or a new one you’ve been meaning to create
  5. Ask Claude to create evals and benchmark it
  6. Watch the data tell you exactly where to improve

What skill are you going to benchmark first?

The developers who run evals will build better skills than those who don’t. That’s just… math.

Go build yours.

Now.

11 min read The Art of Vibe Coding

The Single File That Makes or Breaks Your Claude Code Workflow

Watch the video walkthrough, or read the full written guide below.

I thought I was being thorough.

My CLAUDE.md file had grown to over 1,500 lines. Every coding convention I’d ever learned. Every edge case I’d encountered. Code snippets for common patterns. Integration examples for every third-party service we used. Database schema references. The works.

I was so proud of that file. Look at all this context I’m giving Claude! Surely this would make it understand my project perfectly.

(Narrator voice: It did not.)

Here’s what actually happened: Claude started missing obvious things. Instructions I knew were in there—ignored. Conventions I’d spelled out clearly—forgotten. The more I added to my CLAUDE.md, the worse Claude performed.

I’d accidentally discovered something that changed how I approach AI-assisted development entirely.

.

.

.

The problem wasn’t that Claude couldn’t follow instructions. The problem was that I’d given it too many to follow.

What Even Is a CLAUDE.md File? (And Why Should You Care?)

Let’s back up for a second—because if you’re new to Claude Code, you might be wondering what I’m talking about.

CLAUDE.md is a markdown file that lives at the root of your project. Claude Code reads it automatically at the start of every session. Think of it as your project’s instruction manual for the AI—persistent memory across what would otherwise be completely stateless conversations.

And here’s the thing: after your choice of model, your CLAUDE.md file is the single biggest point of leverage you have in Claude Code.

One bad line in there? It cascades into everything downstream.

Every decision Claude makes flows from that initial context. A vague instruction becomes a vague spec, becomes vague research, becomes a vague plan, becomes… well, you know how this story ends.

.

.

.

The “Instruction Budget” I Wish Someone Had Told Me About

Here’s where I learned my lesson the hard way.

LLMs have a finite number of instructions they can reliably follow at once. This sounds obvious when I say it out loud, but I’d never really internalized it until I watched my 1,500-line CLAUDE.md file turn Claude into a confused mess.

The counterintuitive part? Adding more instructions doesn’t just risk the new ones being ignored. It degrades performance uniformly across all your instructions—including the ones that worked perfectly before.

Research from Chroma on “context rot” backs this up: as the number of tokens in the context window increases, the model’s ability to accurately recall information from that context decreases. Your beautiful, comprehensive CLAUDE.md file might actually be making Claude worse at remembering what’s in it.

Let me do the math for you: Claude Code’s system prompt already uses around 50 instructions. If the model handles roughly 250 total, you’ve got about 200 left for your CLAUDE.md plus your plan plus your task prompt.

And here’s the part that really stung when I realized it: a bloated CLAUDE.md means you’re filling up the context window before you even send your first instruction. Every session starts with that massive file loaded. Every message you send has to fit alongside it.

If you’ve been struggling with Claude Code eating through your weekly usage too fast, the first place to cut is your CLAUDE.md file. Seriously. Smaller context = fewer tokens consumed = more runway for actual work.

I went digging through public CLAUDE.md files on GitHub recently. About 10% of them exceed 500 lines.

That’s almost certainly too large. (Ask me how I know.)

👉 Aim for under 300 lines. Ideally much shorter.

.

.

.

The Framework That Finally Made Sense

After my bloated-file disaster, I needed a new approach. I landed on something simple: think of CLAUDE.md as an onboarding document.

Imagine you’re bringing a brilliant new hire up to speed on day one. What would you tell them? Three things, really:

WHAT — The tech stack, project structure, key files. “This is a Next.js 14 app with App Router, Prisma, and Stripe.”

WHY — The purpose of the project and its parts. “We’re building an e-commerce platform for artisan sellers.”

HOW — Commands, workflows, conventions. “Use npm, not pnpm. Run tests before commit.”

Everything else? Details that can live elsewhere or get loaded on-demand. (More on that “on-demand” part in a minute—it’s a game-changer.)

.

.

.

Where You Put Things Actually Matters

Here’s something I didn’t appreciate until embarrassingly recently: models pay more attention to the top and bottom of a document than the middle. Primacy and recency effects—same cognitive biases humans have.

So structure your CLAUDE.md accordingly:

At the top (highest weight):

  1. Project description (1-3 lines)
  2. Key commands (dev, test, build, lint, deploy)
  3. Tech stack and architecture overview

In the middle (lower relative weight): 4. Code style and conventions 5. File/folder structure map 6. Important gotchas and warnings 7. Git and commit conventions

At the bottom (high weight again): 8. Explicit DO NOTs 9. References and @imports to deeper docs

Let me walk through the ones that matter most.

Project Description

A concise summary that orients Claude to the big picture. Every decision should tie back to purpose.

# Project: ShopFront
Next.js 14 e-commerce application with App Router, 
Stripe payments, and Prisma ORM. Built for artisan 
sellers to manage inventory and process orders.

Three lines. Claude now knows what you’re building, who it’s for, and the core architecture. That’s it.

Key Commands

Be explicit here. Don’t assume Claude knows your setup—it doesn’t.

## Commands
- `npm run dev`      — Start dev server (port 3000)
- `npm run test`     — Run Jest unit tests
- `npm run test:e2e` — Run Playwright E2E tests  
- `npm run lint`     — ESLint check
- `npm run build`    — Production build
- `npm run db:migrate` — Run Prisma migrations

And include the non-obvious choices! “Use npm not pnpm or bun” saves you from Claude randomly picking bun because it read about it somewhere and thought it’d be helpful. (Thanks, Claude. Very helpful.)

Code Style—Where Most People Go Wrong

This is where I see CLAUDE.md files bloat into monsters. Vague rules that waste your precious instruction budget.

Don’t write this:

  • “Use good coding practices”
  • “Write clean code”
  • “Follow best practices”

These instructions accomplish nothing. They’re the equivalent of telling a new hire “do good work.” Thanks, very actionable.

Write this instead:

  • “TypeScript strict mode, no any types”
  • “Use named exports, not default exports”
  • “Prefer const over let
  • “Use import type {} for type-only imports”

Every instruction should produce a measurable difference in output.

And here’s a secret that took me way too long to figure out: don’t send an LLM to do a linter’s job. If a rule can be enforced by ESLint or Prettier, enforce it there. LLMs are slow and expensive linters. Claude learns from your existing code patterns anyway—it doesn’t need to be told every formatting convention.

One more thing: resist the urge to stuff code snippets, integration examples, and schema references into your CLAUDE.md. I know it feels helpful. I did it too. But all those “handy references” are just bloating your context window and triggering context rot. If Claude needs to see code, it can read your actual codebase.

The DO NOTs

Put these at the bottom (recency effect) and be specific:

## DO NOT
- Do not modify files in `/generated/` — they are auto-generated
- Do not use `console.log` — use the project logger
- Do not run `prisma db push` — always use migrations

Use emphasis sparingly. If everything is IMPORTANT, nothing is.

.

.

.

The Technique That Changed Everything: Lazy Loading Your Context

Okay, here’s where things get really interesting.

Instead of cramming everything into one giant root file, you can distribute smaller CLAUDE.md files across subfolders. The magic? They only load when Claude actually reads files in that folder.

Think about what this means: your Supabase migration instructions only consume tokens when you’re actually working on Supabase. During frontend work? Those instructions don’t exist. They’re not cluttering up Claude’s context. They’re not eating into your instruction budget.

It’s lazy loading for your AI context.

How to Split Things Up

Your root CLAUDE.md stays small—maybe 50-100 lines. Project description, global commands, universal rules. The stuff that applies everywhere.

Then each subfolder gets its own focused file:

  • src/CLAUDE.md — Component patterns, state management approach, import rules
  • api/CLAUDE.md — Endpoint conventions, auth rules, error format, validation patterns
  • supabase/CLAUDE.md — Migration flow, schema rules, dangerous commands to avoid

These load automatically when Claude reads files in those directories. No extra work on your part. Just organize your instructions where they logically belong.

Progressive Disclosure with @imports and Rules

Here’s another trick: reference detailed docs instead of inlining everything.

## References
See @README.md for project overview
See @docs/api-patterns.md for API conventions  
See @docs/auth-flow.md for authentication details
See @package.json for available scripts

Or use the .claude/rules/ directory:

All markdown files in .claude/rules/ load automatically alongside your main CLAUDE.md. Modular. Organized. Maintainable.

If you want to take this further—and I mean really further—I wrote a deep dive on building self-evolving Claude Code rules that keep all your guidelines, code snippets, and best practices organized in a system that actually grows smarter over time: How to Build Evolving Claude Code Rules.

It’s the natural next step once you’ve got the basics of CLAUDE.md structure down.

.

.

.

A Real Example: What This Looks Like in Practice

Here’s a complete root CLAUDE.md that actually works:

# Project: ShopFront
Next.js 14 e-commerce app with App Router, Stripe, Prisma ORM.
Built for artisan sellers to manage inventory and process orders.

## Commands
- `npm run dev`: Start dev server (port 3000)
- `npm run test`: Run Jest tests
- `npm run test:e2e`: Run Playwright E2E tests
- `npm run lint`: ESLint check
- `npm run build`: Production build
- `npm run db:migrate`: Run Prisma migrations

## Tech Stack
- TypeScript (strict mode)
- Next.js 14 (App Router)
- Prisma ORM + PostgreSQL
- Stripe for payments
- Tailwind CSS + Radix UI
- Jest + Playwright for testing

## Architecture
- `/app` — Pages, layouts, API routes
- `/components/ui` — Shared UI components
- `/lib` — Utilities, helpers, shared logic
- `/prisma` — Schema and migrations

## Code Conventions
- Named exports only (no default exports)
- Use `import type {}` for type-only imports
- No `any` types — use branded types for IDs
- Functional components with hooks
- Tailwind classes only, no custom CSS files

## Important
- NEVER commit .env files
- Stripe webhook must validate signatures
- Images stored in Cloudinary, not locally
- Do not modify files in `/generated/`
- Use project logger, not console.log

See @docs/auth-flow.md for authentication details
See @docs/api-patterns.md for API route conventions

That’s roughly 45 lines. Clean. Scannable. Universally applicable.

.

.

.

The Growth Strategy (Or: How to Not Repeat My Mistakes)

Please, I’m begging you: do not start with a giant template or auto-generated file.

Start with the absolute minimum. Project description, key commands. Maybe 20 lines.

Then use Claude Code on your project. When Claude makes a repeated mistake—and it will—add ONE targeted instruction to fix it. Commit that change to Git so you can trace it later.

Here’s the counterintuitive part: with every model release, look at what you can remove from your CLAUDE.md. Not what you can add. Remove.

Newer models have better built-in behaviors. Old workarounds can actively hinder them. Your CLAUDE.md should shrink over time, not grow.

(This was hard for me. I’m a collector by nature. But trust me—less really is more here.)

.

.

.

The Maintenance Checklist I Actually Use

Every few weeks—or after any model upgrade—I run through this:

Remove:

  • Redundant rules the model handles naturally now
  • Old workarounds for previous model versions
  • Vague instructions that don’t change output
  • Rules that should be enforced by linters instead
  • Code snippets and examples that bloat context

Relocate:

  • Domain-specific rules → move to subfolder CLAUDE.md files
  • Detailed docs → convert to @imported files
  • Rarely-used conventions → put in .claude/rules/

Simplify:

  • Merge overlapping instructions
  • Replace verbose paragraphs with bullet points
  • Make every line earn its place

The Complete Picture

Here’s how all the pieces fit together:

It looks like a lot. But you don’t need all of it. Start with the root file. Add lazy loading when your root gets crowded. Grow organically.

.

.

.

Your Next Step

Here’s my challenge for you:

Open your most active Claude Code project. Look at your CLAUDE.md file (or create one if it doesn’t exist). And ask yourself one question:

What instruction am I going to remove today?

Not add. Remove.

Find the vague rule that accomplishes nothing. Find the workaround for a model behavior that’s been fixed. Find the code snippet that’s just bloating your context. Find the formatting instruction that ESLint already handles.

Delete it.

Every line should earn its place.

10 min read The Art of Vibe Coding

I Found a Better Way to Design Pages in Claude Code (And I’m a Little Mad I Didn’t Know Sooner)

Watch the video walkthrough, or read the full written guide below.

You’re tweaking a landing page in Claude Code.

“Make the layout more balanced,” you type—feeling pretty clever about your prompt, if we’re being honest.

Claude adjusts the CSS. You refresh the browser.

Hmm. Not quite right.

“Actually, make the image section wider.”

Claude dutifully changes the grid. You refresh again.

Still off. (Why is this so hard?)

“Can you try pill-shaped buttons instead?”

And now you’re three iterations deep, squinting at your screen, no longer entirely sure what “right” even looks like anymore.

Here’s the thing: I spent an embarrassing amount of time in this loop before discovering there’s a plugin that makes the whole rigmarole unnecessary.

It’s called the Claude Code Playground skill. And it changes everything about how you approach design work.

Stay with me.

.

.

.

The Describe-Refresh-Despair Loop (A Love Story Gone Wrong)

Let’s be real about what’s happening here.

You have a vision in your head. A feeling of what the page should look like. Something about the proportions, the spacing, the way elements breathe together.

But translating that feeling into words?

That’s where the wheels come off the wagon.

The loop goes something like this:

  1. Describe a change in words (that you hope Claude interprets correctly)
  2. Wait for Claude to apply it
  3. Refresh the page
  4. Realize it’s not quite what you meant
  5. Try different words—maybe “airy” instead of “spacious”?
  6. Repeat 8-12 times
  7. Eventually settle for “close enough” while muttering under your breath

I burned way too many hours in this loop last week, redesigning a page. Every iteration felt like playing telephone with my own design instincts.

(Spoiler: I was the one garbling the message.)

The problem isn’t Claude. The problem is that words are a terrible interface for visual decisions.

.

.

.

Enter the Claude Code Playground Skill: Your New Best Friend

Anthropic built an official plugin—the Playground skill—that adds something radical to Claude Code:

A visual layer between your brain and your codebase.

Here’s how it works: Claude analyzes your existing page, then generates a self-contained HTML file with sliders, dropdowns, and presets that let you see different design directions instantly.

No code changes. No refreshing. No “what I said vs. what I meant” shenanigans.

Just a live preview that updates as you click.

Once you’ve dialed in the exact design you want—with your actual eyeballs—the playground generates a natural language prompt describing your choices.

Copy. Paste. Execute.

One pass. Done.

👉 The key insight: You’re no longer translating visual intuition into words. The playground does that translation for you.

.

.

.

The Real-World Test: Redesigning a WooCommerce Product Page

Enough theory. Let me show you exactly how this worked on my LicenseWP product page.

The existing page was… fine. Functional. The kind of “fine” that makes you wince slightly every time you look at it.

Alt: The original LicenseWP product page showing Theme Pro v2 with a gradient image placeholder on the left, three pricing tier cards (Agency, Business, Personal) stacked vertically on the right, and description tabs below

The image area felt cramped. The pricing cards looked squished together. The “Add to cart” button was doing its best impression of a wallflower at a party.

I wanted to experiment with different proportions, card styles, and CTA treatments.

The old me would have spent an hour going back-and-forth with Claude in text prompts.

The new me? Installed the Playground skill.

.

.

.

Step 0: Install the Plugin (30 Seconds, Tops)

First things first—you need the Claude Code Playground skill installed.

Run /plugin in Claude Code, switch to the Discover tab, and search for “playground.”

Alt: Claude Code's plugin discovery interface showing a search for "playground" with the official plugin by claude-plugins-official listed below, showing 11.4K installs and description "Creates interactive HTML playgrounds"

Hit space to toggle it on.

That’s it. Claude now knows how to build interactive design playgrounds. (11.4K installs can’t be wrong, right?)

.

.

.

Step 1: Send Claude to Do Reconnaissance

Here’s where the magic starts.

I gave Claude a prompt telling it to visit my product page, study the layout like an art critic at a museum, and then build a Design Layout Playground based on what it found.

Alt: The detailed prompt entered into Claude Code instructing it to visit the product page URL, study the layout structure, and build an interactive Design Layout Playground using the playground skill with specifications for presets, preview panel, output location, and prompt generation

Claude opens Chrome and starts analyzing—like a very thorough house inspector, but for web pages.

Alt: Claude Code using Chrome browser tools to analyze the product page, showing multiple tool calls: tabs_context, read_page, get_page_text, and several javascript_tool executions to extract HTML structure

It reads the DOM, extracts text content, and runs JavaScript to pull the full HTML structure.

Alt: Claude Code taking a screenshot of the page and scrolling down to capture the tabs section and footer, building a complete mental model of the layout

Screenshots. Scrolling. Full page analysis from header to footer.

(Claude is nothing if not thorough.)

.

.

.

Step 2: Claude Builds You a Custom Design Tool

Once Claude understands your page, it loads the Playground skill and reads the design template.

Claude Code loading the playground skill successfully and reading the design-playground template file from the plugins cache directory

Then—and here’s the part that made me do a little happy dance—it creates a single self-contained HTML file with everything baked in.

Alt: Claude Code creating the notes/playground directory and writing 1253 lines to product-page-design.html, showing the beginning of the HTML file with doctype, head section, and CSS reset

1,253 lines. A complete interactive design tool, built specifically for your page, in about 30 seconds.

Alt: VS Code file explorer showing the notes folder containing the playground subfolder with product-page-design.html file marked as untracked

Here’s what Claude built for me:

Alt: Claude Code's summary showing the playground includes controls for Page Layout, Gallery, Typography, Tier Cards, CTA Button, Tabs, and Color Theme, plus 5 presets: Current Design, Clean Editorial, Bold SaaS, Compact & Dense, and Premium Showcase
  • 5 presets — each one a cohesive design direction
  • 7 control groups — layout, gallery, typography, cards, buttons, tabs, colors
  • Live preview — using my actual page content (real product names, real prices, real structure)

Ferpetesake, this is exactly what I needed.

.

.

.

Step 3: Play With Designs Like a Kid With New LEGOs

Open the HTML file in your browser.

And then—I’m not going to lie—prepare to lose 20 minutes just playing.

Alt: The Product Page Design Playground interface showing a left panel with controls for Page Layout (max width, grid split, gap, spacing), Gallery (style, radius, aspect ratio), and Typography (title size, weight, description style), with a live preview of the product page on the right

The left panel has all the controls. The right panel shows a live preview that updates instantly as you change anything.

(I may have spent an unreasonable amount of time just clicking the “Gallery Style” options back and forth. Don’t judge me.)

Alt: Scrolled down view of the playground showing additional controls for Tier Cards (layout, card style, radius, highlight, badge), CTA Button (style, width, size), Tabs Section (style, alignment), and Color Theme (primary accent swatches and surface treatment)

I started by clicking through the presets to find a direction:

  • Bold SaaS — too aggressive for this product
  • Compact & Dense — too cramped (we’re selling premium themes, not packing a suitcase)
  • Clean Editorial — closer! But needed tweaks

Then I fine-tuned individual controls:

  • Grid split: 50/50 (equal width for image and details)
  • Gallery style: Framed (light background with border instead of that gradient)
  • Gallery radius: 24px (rounder corners, friendlier vibe)
  • CTA button: Pill-shaped, full-width, large
  • Tabs: Pill style, stretched across the full width

Every single change reflected instantly in the preview.

👉 Here’s what hit me: I wasn’t describing what I wanted anymore. I was seeing it. And clicking until it looked right.

.

.

.

Step 4: Copy the Magic Prompt

Once the design felt right, I scrolled to the bottom of the playground.

And there it was.

Alt: The playground with controls adjusted showing the generated prompt output at the bottom in a highlighted box listing all design changes: product grid split 50/50, section spacing 64px, gallery style framed, border radius 24px, title size 36px, etc., with a Copy Prompt button on the right

The Prompt Output panel had already written a clear instruction describing exactly what I chose—and only what differed from the defaults.

Redesign the product single page at http://localhost:8107/product/theme-pro-tc011/ 
with the following design changes:

- product grid split: 50/50
- section spacing: 64px
- gallery style: framed
- gallery border radius: 24px
- product title size: 36px
- tier card border radius: 8px
- selected tier highlight: border-only
- CTA button style: pill-shaped
- CTA button width: full-width
- CTA button size: large
- tab style: pill tabs
- tab alignment: stretch

No ambiguity. No “make it more modern” nonsense.

Just precise specifications that Claude can execute without guessing.

One click on Copy Prompt.

.

.

.

Step 5: Let Claude Do Its Thing

Back in Claude Code. Paste.

Alt: Claude Code receiving the pasted prompt and entering plan mode, exploring the product page CSS and templates by running find commands in the woocommerce directory

Claude immediately recognizes the design instructions and enters plan mode to explore the codebase.

Alt: Claude Code reading CSS files, checking CSS variables, and examining the single product template structure, then entering a "thinking" state for 2 minutes 17 seconds to write the plan

33 tool uses. 101k tokens. 2 minutes of thinking.

Claude reads every relevant CSS file, understands the variable system, and maps out exactly what needs to change.

Then it presents the plan:

Alt: Claude Code's detailed plan titled "Redesign Product Single Page" showing the context, file to modify (assets/css/frontend.css), and specific CSS changes including grid-template-columns changing from 2fr 3fr to 1fr 1fr, margin-bottom adjustments, and gallery style changes from gradient to framed with exact code snippets

Every change mapped out. Exact line numbers. Before/after CSS.

(Is it weird that I find this deeply satisfying? Don’t answer that.)

Alt: Continuation of the plan showing dark mode considerations (existing CSS variables handle it automatically), verification steps for testing the changes visually, and the execution prompt asking whether to proceed with options to clear context, auto-accept edits, manually approve, or modify the plan

Verification steps included. Dark mode considerations. Mobile responsiveness checks.

I approved. Claude started executing.

Alt: Claude Code applying CSS edits to frontend.css, showing a diff view with red lines indicating removed code (grid-template-columns: 2fr 3fr, old margin-bottom, linear-gradient background) and green lines showing the new values (1fr 1fr grid, updated spacing, solid background with border)

9 CSS edits. All applied.

Alt: Claude Code's completion summary confirming all 9 CSS changes were applied correctly, listing each change: grid split 1fr 1fr, section spacing, gallery styling with border and radius, title size, tier cards, CTA button, and tabs, with a note that dark mode works automatically

And… done.

.

.

.

The Result (AKA: The Part Where I Do a Victory Lap)

Here’s the final redesigned product page:

Alt: The redesigned LicenseWP product page showing the new 50/50 grid layout with a framed gallery area on the left, equal-width pricing section on the right with the same three tiers, a full-width pill-shaped purple Add to cart button, and stretched pill-style tabs below for Description, Additional information, and Reviews
  • Equal-width layout. The image and product details now have balanced visual weight.
  • Framed gallery. Clean border instead of that dated gradient background.
  • Full-width CTA. The “Add to cart” button finally commands the attention it deserves.
  • Pill tabs. Stretched across the full width with a modern, cohesive feel.

The design matches what I saw in the playground preview—applied in a single pass.

HECK YES.

.

.

.

The Before & After (Because We All Love a Good Transformation)

BeforeAfter
40/60 grid split (cramped image)50/50 split (balanced)
Gradient gallery backgroundFramed with border
Small inline CTA buttonFull-width pill CTA
Underline tabsStretched pill tabs
10+ prompt iterations1 pass

The old workflow would’ve taken an hour of back-and-forth (and probably some mild frustration-snacking).

This took 15 minutes—and honestly, most of that was me playing with the controls because it was genuinely fun.

.

.

.

When the Claude Code Playground Skill Really Shines

👉 Redesigning existing pages. You already have something. You want to explore variations without breaking it.

👉 Client projects. Preview before you commit. Show options before you build. (Clients LOVE this, by the way.)

👉 Design indecision. When you don’t know what you want—and let’s be honest, that’s more often than we’d like to admit—clicking through presets beats describing in words.

👉 Reducing the prompt iteration loop. One visual session replaces 10+ text-based rounds of “no, more like… actually less like that… wait, go back.”

The playground acts as a translation layer between your visual intuition and Claude’s execution capabilities. You figure out what “right” looks like with your eyes, then communicate that with precision.

The Prompt Template (Steal This)

Here’s the full prompt I use. Copy it. Adapt it. Make it yours.

First, use the browser to visit and read this page: [YOUR_PAGE_URL]

Study the page's current layout structure, section hierarchy, component patterns, 
and overall visual design. Take note of how content is organized and what 
elements are present.

Then, use the "playground" skill (design-playground template) to build an 
interactive Design Layout Playground based on what you found on that page.

The playground should let me visually explore different layout and component 
style combinations for that page.

## Presets
Include 3–5 named presets that snap all controls to a cohesive combination, 
inspired by what would work well for the page's content. For example:
- "Clean Editorial" — airy spacing, narrow content width, minimal components
- "Bold & Modern" — full-width hero, elevated cards, bold CTAs
- "Compact Dashboard" — tight spacing, grid cards, minimal chrome
- Adapt these to fit the actual content and purpose of the page

## Preview
- Single live preview panel that updates instantly on every control change
- The preview should use a simplified but recognizable representation of the 
  actual page content (use real section names, headings, and placeholder text 
  that matches the page structure)
- Use raw CSS (no Tailwind or frameworks)

## Output Location
- Save the playground HTML file to `notes/playground/` folder (create it if 
  it doesn't exist)

## Prompt Output
- Generate a natural language instruction at the bottom that I can copy and 
  paste back into Claude to implement the chosen design
- The prompt should describe the layout and component decisions in enough 
  detail to be actionable without the playground
- Only mention choices that differ from the defaults
- Frame it as a direction, e.g.: "Redesign the page with a full-width hero 
  section, 3-column card grid with elevated shadows and 16px gap, airy 
  section spacing (64px), pill-shaped CTAs positioned inline..."
- Include the source page URL in the generated prompt for context

Replace [YOUR_PAGE_URL] with whatever page you want to redesign.

.

.

.

Your Turn

Next time you’re about to type “make it more modern” or “adjust the spacing” or “try a different card style”—stop.

Build a playground first.

Let your eyes make the decisions. Let the Claude Code Playground skill translate those decisions into words. Let Claude execute them precisely.

What page are you going to redesign with this workflow?

Go install the plugin. Run /plugin, search “Playground”, toggle it on.

Now.

(And maybe clear your schedule. Because once you start playing with those sliders, you might lose track of time.)

9 min read The Art of Vibe Coding

I Showed You the Wrong Way to Do Claude Code Testing. Let Me Fix That.

Last week, I walked you through browser testing with Claude Code using the Ralph loop plugin.

I was pretty proud of it, actually.

Here’s the thing: I was wrong.

Well, not entirely wrong. The tests ran. Things got verified. But what I showed you? That wasn’t a true Ralph loop—not the way Geoffrey Huntley originally designed it. And the difference matters more than I realized at the time.

(Stay with me here. This confession has a happy ending.)

.

.

.

The Problem I Didn’t See Coming

The real Ralph loop is supposed to wipe memory clean at the start of each iteration. No leftover context. No accumulated baggage. Just a fresh, focused agent tackling one task at a time.

The Ralph loop plugin from Claude Code’s official marketplace? It preserves context from the previous loop. The plugin relies on a stop hook to end and restart each iteration—but the conversation history tags along for the ride.

And that’s where everything quietly falls apart.

Here’s what this actually looks like in practice:

Imagine you’re setting out on a multi-day hiking trip. Every morning, you pack your backpack for that day’s trail.

Now imagine that instead of emptying your pack each night, you just… keep adding to it. Day one’s water bottles. Day two’s snacks. Day three’s rain gear (even though it’s sunny now). By day five, you’re hauling 40 pounds of stuff you don’t need, and you can barely focus on the trail in front of you.

That’s context rot.

It happens when an AI model’s performance degrades because its context window gets bloated with accumulated information from previous tasks. The more history your agent carries forward, the harder it becomes for the model to stay sharp on what actually matters right now.

👉 The takeaway: Fresh context isn’t a nice-to-have. It’s the whole point.

.

.

.

What Context Rot Actually Looks Like

Let me make this concrete with Claude Code testing:

Iteration 1: Claude runs test TC-001. Context is clean. Performance is sharp. The backpack is light.

Iteration 5: Claude runs test TC-005. But it’s also dragging along memories of TC-001 through TC-004. The pack is getting heavy.

Iteration 15: Claude runs test TC-015. The model is now swimming through accumulated history, trying to find what actually matters among all the gear from previous days.

Iteration 25: Claude runs test TC-025. Performance has degraded. The model makes weird mistakes. It forgets what it was supposed to verify—because it’s exhausted from carrying everyone else’s context.

Same trail. Same agent. Completely different performance.

And here’s the frustrating part: you might not even notice it happening. The tests still run. They just run… worse. Slower. Less reliably. With occasional bizarre failures that make you question your own test plan.

.

.

.

The Solution That Was Already There

So I went looking for a better approach to Claude Code testing—something that would give me the clean-slate benefits of a proper Ralph loop without the context accumulation problem.

And I found it in a tool I’d been using for something else entirely: Claude Code’s task management system.

Here’s where it gets interesting.

The task management system gives you the same effect as a properly implemented Ralph loop—but with something the Ralph loop never had: dependency management.

Think back to the hiking metaphor.

Each sub-agent is like a fresh hiker starting a new day with an empty pack. They get their assignment, they complete their section of trail, they report back. Then the next hiker takes over with their own empty pack.

  • No accumulated gear.
  • No context rot.
  • No performance degradation over time.

But here’s the bonus: the task management system also handles situations where “Day 3’s trail can’t start until Day 2’s bridge gets built.” Dependencies get tracked automatically. Tests that need prerequisites don’t run until those prerequisites pass.

(Is that two features in one? Well, is a package of two Reese’s Peanut Butter Cups two candy bars? I say it counts as one delicious solution.)

.

.

.

How Claude Code Testing Actually Works With Task Management

Let me show you exactly how to set this up.

Fair warning: there are a lot of screenshots coming. But I promise each one shows something important about the workflow.

Step 1: Put the Prompt as a Command

First, store the entire testing prompt as a command file. This makes triggering your Claude Code testing workflow trivially easy—just a slash command away.

Claude Code project structure showing .claude/commands folder containing run_test_plan.md file, with a skills folder below it

The full prompt (I’ll include it at the end—it’s long but worth having) tells Claude exactly how to read your test plan, create tasks, set dependencies, execute tests sequentially, and track results.

Step 2: Trigger the Command

With the command saved, execution is just:

Claude Code terminal showing the /run_test_plan command being typed, ready for execution

That’s it. Type /run_test_plan and let the system take over.

Step 3: Claude Reads Your Specs

Since we’re starting fresh—no memory of previous execution—Claude first reads your original specs, test plan, and implementation plan to understand the context.

Claude Code output showing it checking for existing test runs and reading 3 files including the test plan

(Remember: empty backpack. The agent needs to load up on just what it needs for this journey.)

Step 4: Claude Creates the Tasks

After understanding the context, Claude creates one task per test case. Watch how it automatically detects dependencies:

Claude Code creating 30 test tasks with dependency analysis showing TC-004/005/006/008 depend on TC-003, TC-016/017 depend on TC-014, and other dependency chains

See that dependency analysis?

  • TC-004, TC-005, TC-006, TC-008 depend on TC-003 (password field must exist first)
  • TC-016, TC-017 depend on TC-014 (categories must exist first)
  • TC-019 through TC-023 depend on TC-018 (priority dropdown must exist first)
  • TC-029 depends on TC-027 (accent color must be saved first)

The system figured this out by reading the test plan. No manual configuration required.

Step 5: Dependencies Get Locked In

All 30 tasks created.

Now Claude sets up the dependencies and verifies everything:

Claude Code showing all 30 tasks created with dependencies being set up, updating test-status.json with start timestamp

Step 6: Test Status File Created

Claude creates a test-status.json file to track everything—machine-readable, resumable, and audit-friendly:

Claude Code writing 260 lines to notes/test-status.json, showing metadata structure with testPlanSource, totalIterations, maxIterations, startedAt, and summary counts

The execution order is now crystal clear:

  1. Unblocked tasks first: TC-001, TC-002, TC-003, TC-007, TC-009, TC-010, TC-011, TC-012, TC-013, TC-014, TC-015, TC-018, TC-024
  2. Tasks blocked by TC-003: TC-004, TC-005, TC-006, TC-008
  3. Tasks blocked by TC-014: TC-016, TC-017
  4. Tasks blocked by TC-018: TC-019, TC-020, TC-021, TC-022, TC-023
  5. Tasks blocked by TC-027: TC-029

Step 7: First Task Begins

Here’s where the magic happens.

Claude spawns a sub-agent—with fresh context—to execute TC-001:

Claude Code starting execution with TC-001, spawning a Task sub-agent with the instruction "You are a test execution sub-agent. You have ONE job: execute and verify ONE test case."

“You are a test execution sub-agent. You have ONE job: execute and verify ONE test case.”

That’s the key instruction. Fresh hiker. Empty backpack. Single trail.

Step 8: Browser Automation for Testing

The sub-agent uses Claude Code’s browser automation to test like a real user would:

Claude Code showing Chrome browser automation (javascript_tool) with "View Tab" option, indicating 52+ tool uses for the testing process

It navigates to URLs, clicks buttons, fills forms, takes screenshots at verification points, and checks the DOM state against expected outcomes.

Real browser. Real interactions. Real Claude Code testing.

Step 9: Test Status Gets Updated

After completing a test, the sub-agent updates the status file:

Claude Code updating notes/test-status.json after TC-001 execution, showing 30 tasks with 1 done and 29 open, with VS Code diff view

Step 10: Human-Readable Results Too

The results also get appended to a markdown log for human review:

Claude Code writing to notes/test-results.md after updating test-status.json, showing the dual logging system for machine and human readability

Every test gets logged in two places:

  • test-status.json for machine parsing
  • test-results.md for human review

(Because sometimes you want to query the data programmatically, and sometimes you just want to read what happened over coffee. Both are valid.)

Step 11: Automatic Progression to Next Task

Once TC-001 completes, Claude automatically moves to TC-002:

Claude Code showing TC-001 passed (Done with 36 tool uses, 48.1k tokens, 4m 19s), then spawning a new sub-agent for TC-002 with fresh context

Look at those stats: 36 tool uses, 48.1k tokens, 4 minutes 19 seconds for TC-001.

Then a completely fresh sub-agent spawns for TC-002. New hiker. New backpack. No accumulated context from TC-001.

Step 12: Bugs Found? Claude Fixes Them.

TC-002 found a bug. Here’s what happened:

Claude Code showing TC-002 passed after 1 fix (Quick Setup wasn't saving color options - fixed in WizardAjax.php), then moving to TC-003 which is Critical and unblocks 4 other tests

“TC-002 passed after 1 fix (Quick Setup wasn’t saving color options — now fixed in WizardAjax.php).”

The sub-agent detected the failure, analyzed the root cause, implemented a fix, and re-ran the test. All autonomously. All within the same fresh context.

Step 13: Dependencies Unlock Automatically

Now watch the dependency system in action.

Once TC-003 passes:

Claude Code showing TC-003 passed, announcing that TC-004, TC-005, TC-006, TC-008 are now unblocked, then moving to TC-007

“TC-003 passed. Now TC-004, TC-005, TC-006, TC-008 are unblocked.”

The password field exists now. All the tests that depend on it can finally run.

👉 This is why dependencies matter: They prevent tests from running before their preconditions are met—avoiding the exact conflicts where one agent messes with something another agent needs.

Steps 14-15: The Marathon Continues

It keeps going. Test after test. Each sub-agent fresh and focused:

Claude Code showing a sequence of passed tests: TC-007, TC-004, TC-006, TC-005, TC-008, TC-009, TC-010, each with tool usage stats and completion times
Claude Code showing later tests completing: TC-023 through TC-030, including TC-029 being unblocked after TC-027, with all tests passing

Every test runs sequentially. Every sub-agent gets clean context. Every dependency is respected. No context rot in sight.

Step 16: All Tests Complete

After 2 hours and 12 minutes:

Claude Code showing final test results being written to test-results.md, displaying a summary table with all 30 tests passed, including TC-001 through TC-012 with their priorities and fix attempts

30 tests. All passed. Zero known issues.

Step 17: The Full Summary

The orchestrator writes a comprehensive summary:

Here’s what got verified:

  • All 6 Critical tests passed (password handling, priority validation, accent color persistence)
  • Server-side validation confirmed working (urgent priority rejected, invalid hex rejected, password never stored in wp_options)
  • UI behaviors verified (notice dismiss, auto-hide, field error clearing, color sync)
  • Accessibility attributes verified on priority dropdown

And that bug that got fixed? handleQuickSetup() in WizardAjax.php wasn’t saving desq_primary_color or desq_accent_color options. Found during TC-002. Fixed autonomously.

.

.

.

Why This Actually Works Better

Let me be direct about the comparison:

AspectRalph Loop PluginTask Management System
ContextPreserves across iterationsFresh per sub-agent
DependenciesNoneBuilt-in blocking
Parallel SafetyRiskySequential by default
State TrackingBasic stop hookJSON + Markdown logs
Bug FixingManualAutomatic (up to 3 attempts)
ResumabilityLimitedFull state recovery

The Ralph loop was supposed to start each iteration with a clean slate. The task management system actually delivers on that promise—and adds dependency management that prevents tests from stepping on each other.

.

.

.

The Full Prompt (Copy This)

Here’s the complete command file to drop into .claude/commands/run_test_plan.md:

PROMPT: Execute Test Plan Using Claude Code Task Management System
Loading longform...
We are executing the test plan. All implementation is complete. Now we verify it works.

## Reference Documents

- **Test Plan:** `notes/test_plan.md`
- **Implementation Plan:** `notes/impl_plan.md`
- **Specs:** `notes/specs.md`
- **Test Status JSON:** `notes/test-status.json`
- **Test Results Log:** `notes/test-results.md`

---

## Phase 1: Initialize

### Step 1: Check for Existing Run (Resumption)

Before creating anything, check if a previous test run exists:

1. Check if `notes/test-status.json` exists
2. Check if there are existing tasks via `TaskList`

**If both exist and tasks have results:**
- This is a **resumed run** — skip to Phase 2 (Step 7)
- Announce: "Resuming previous test run. Skipping already-passed tests."
- Only execute tasks that are still `pending` or `fail` (with fixAttempts < 3)

**If no previous run exists (or files are missing):**
- Continue with fresh initialization below

### Step 2: Read the Test Plan

Read `notes/test_plan.md` and extract ALL test cases. Auto-detect the TC-ID pattern used (e.g., `TC-001`, `TC-101`, `TC-5A`, etc.).

For each test case, note:

- TC ID
- Name
- Priority (Critical / High / Medium / Low — default to Medium if not stated)
- Preconditions
- Test steps and expected outcomes
- Test data (if any)
- Dependencies on other test cases (if any)

### Step 3: Analyze Test Dependencies

Determine which test cases depend on others. Common dependency patterns:

- A "saves data" test may depend on a "displays default" test
- A "form submission" test may depend on "form validation" tests
- An "end-to-end" test may depend on individual component tests

If no clear dependencies exist between test cases, treat them all as independent.

### Step 4: Create Tasks

Use `TaskCreate` to create one task per test case. Set `blocked_by` based on the dependency analysis.

**Task description format:**

```
Test [TC-ID]: [Test Name]
Priority: [Priority]

Preconditions:
- [Required state before test]

Steps:
| Step | Action | Expected Result |
|------|--------|-----------------|
| 1 | [Action] | [Result] |
| 2 | [Action] | [Result] |

Test Data:
- [Field]: [Value]

Expected Outcome: [Final verification]

Environment:
- Refer to CLAUDE.md for wp-env details, URLs, and credentials
- WordPress site: http://localhost:8105
- Admin: http://localhost:8105/wp-admin (admin/password)

---
fixAttempts: 0
result: pending
lastTestedAt: null
notes:
```

### Step 5: Generate Test Status JSON

Create `notes/test-status.json`:

```json
{
    "metadata": {
        "testPlanSource": "notes/test_plan.md",
        "totalIterations": 0,
        "maxIterations": 50,
        "startedAt": null,
        "lastUpdatedAt": null,
        "summary": {
            "total": "<count>",
            "pending": "<count>",
            "pass": 0,
            "fail": 0,
            "knownIssue": 0
        }
    },
    "testCases": {
        "<TC-ID>": {
            "name": "Test case name",
            "priority": "Critical|High|Medium|Low",
            "status": "pending",
            "fixAttempts": 0,
            "notes": "",
            "lastTestedAt": null
        }
    },
    "knownIssues": []
}
```

### Step 6: Initialize Test Results Log

Create `notes/test-results.md`:

```markdown
# Test Results

**Test Plan:** notes/test_plan.md
**Started:** [CURRENT_TIMESTAMP]

## Execution Log
```

### Verify Initialization

Use `TaskList` to confirm:
- All TC-IDs from the test plan have a corresponding task
- Dependencies are correctly set via `blocked_by`
- All tasks show `result: pending`

Cross-check task count matches `summary.total` in `notes/test-status.json`.

---

## Phase 2: Execute Tests

### Step 7: Determine Execution Order

Use `TaskList` to read all tasks and their `blocked_by` fields. Determine sequential execution order:

1. Tasks with no `blocked_by` (or all dependencies resolved) come first
2. Tasks whose dependencies are resolved come next
3. Continue until all tasks are ordered

**For resumed runs:** Skip tasks where `result` is already `pass` or `known_issue`.

### Step 8: Execute One Task at a Time

For the next eligible task, spawn ONE sub-agent with the instructions below.

**One sub-agent at a time. Do NOT spawn multiple sub-agents in parallel.**

---

#### Sub-Agent Instructions

**You are a test execution sub-agent. You have ONE job: execute and verify ONE test case.**

1. **Read your task** using `TaskGet` to get the full description
2. **Parse the test steps** from the description (everything above the `---` separator)
3. **Parse the metadata** from below the `---` separator
4. **Read CLAUDE.md** for environment details, URLs, and credentials

5. **Execute the test:**

    Using browser automation:
    - Navigate to URLs specified in the test steps
    - Click buttons/links as described
    - Fill form inputs with the test data provided
    - Take screenshots at key verification points
    - Read console logs for errors
    - Verify DOM state matches expected outcomes

    Follow the test plan steps EXACTLY. Do not skip steps.

6. **Determine the result:**

    **PASS** if:
    - All expected outcomes verified
    - No unexpected console errors
    - UI state matches test plan

    **FAIL** if:
    - Any expected outcome not met
    - Unexpected errors
    - UI state doesn't match

7. **If PASS:** Update the task description metadata via `TaskUpdate`:

    ```
    ---
    fixAttempts: 0
    result: pass
    lastTestedAt: [CURRENT_TIMESTAMP]
    notes: [Brief description of what was verified]
    ```

    Mark the task as `completed`.

8. **If FAIL and fixAttempts < 3:**

    a. Analyze the root cause
    b. Implement a fix in the codebase
    c. Increment fixAttempts and update via `TaskUpdate`:

    ```
    ---
    fixAttempts: [previous + 1]
    result: fail
    lastTestedAt: [CURRENT_TIMESTAMP]
    notes: [What failed, root cause, what fix was applied]
    ```

    d. Re-run the test steps to verify the fix
    e. If now passing, set `result: pass` and mark task as `completed`
    f. If still failing and fixAttempts < 3, repeat from (a)

9. **If FAIL and fixAttempts >= 3:** Mark as known issue via `TaskUpdate`:

    ```
    ---
    fixAttempts: 3
    result: known_issue
    lastTestedAt: [CURRENT_TIMESTAMP]
    notes: KI — [Description of the issue, steps to reproduce, severity, suggested fix]
    ```

    Mark the task as `completed`.

10. **Update Test Status JSON** — Read `notes/test-status.json`, update the test case entry and recalculate summary counts, then write back:

    - Set `status` to `pass`, `fail`, or `known_issue`
    - Update `fixAttempts`, `notes`, `lastTestedAt`
    - Increment `metadata.totalIterations`
    - Update `metadata.lastUpdatedAt`
    - Recalculate `metadata.summary` counts
    - If known_issue, add entry to `knownIssues` array

11. **Append to test results log** (`notes/test-results.md`):

    ```markdown
    ## [TC-ID] — [Test Name]

    **Result:** PASS | FAIL | KNOWN ISSUE
    **Tested At:** [TIMESTAMP]
    **Fix Attempts:** [N]

    **What happened:**
    [Brief description of test execution]

    **Notes:**
    [Observations, errors, or fixes attempted]

    ---
    ```

**CRITICAL: Before finishing, verify you have updated ALL THREE locations:**

1. Task description (metadata below `---` separator) via `TaskUpdate`
2. `notes/test-status.json` (test case entry + summary counts)
3. `notes/test-results.md` (appended human-readable entry)

Missing ANY of these = incomplete iteration.

---

### Step 9: Verify and Continue

After each sub-agent finishes, the orchestrator:

1. Uses `TaskGet` to verify the task description metadata was updated
2. Reads `notes/test-status.json` to confirm JSON was updated and summary counts are correct
3. Reads `notes/test-results.md` to confirm a new entry was appended
4. **If any location was NOT updated**, update it before proceeding
5. Determines the next eligible task (unresolved, dependencies met)
6. Spawns the next sub-agent (back to Step 8)

### Step 10: Repeat Until All Resolved

Continue until ALL tasks have `result: pass` or `result: known_issue`.

```
Completion check:
  - result: pass         → resolved
  - result: known_issue  → resolved
  - result: fail         → needs re-test (if fixAttempts < 3)
  - result: pending      → not yet tested

ALL resolved? → Phase 3 (Summary)
Otherwise?    → Next task
```

---

## Phase 3: Summary

### Step 11: Generate Final Summary

When all tasks are resolved, append a final summary to `notes/test-results.md`:

```markdown
# Final Summary

**Completed:** [TIMESTAMP]
**Total Test Cases:** [N]
**Passed:** [N]
**Known Issues:** [N]

## Results

| TC | Name | Priority | Result | Fix Attempts |
|----|------|----------|--------|--------------|
| TC-XXX | [Name] | High | PASS | 0 |
| TC-YYY | [Name] | Medium | KNOWN ISSUE | 3 |

## Known Issues Detail

### KI-001: [TC-ID] — [Issue Title]

**Severity:** [low|medium|high|critical]
**Steps to Reproduce:** [How to see the bug]
**Suggested Fix:** [Potential solution if known]

## Recommendations

[Any follow-up actions needed]
```

---

## Rules Summary

| Rule | Description |
|------|-------------|
| 1:1 Mapping | One task per test case — no grouping |
| Dependencies | Use `blocked_by` to enforce test execution order |
| Sequential | One sub-agent at a time — do NOT spawn multiple in parallel |
| Sub-Agents | One sub-agent per task — fresh context, focused execution |
| Max 3 Attempts | After 3 fix attempts → mark as `known_issue` |
| Metadata in Description | Track `fixAttempts`, `result`, `lastTestedAt`, `notes` below `---` separator |
| Test Status JSON | Always update `notes/test-status.json` after each test |
| Log Everything | Append results to `notes/test-results.md` for human review |
| Resumable | Detect existing run state and continue from where it left off |
| Completion | All tasks resolved = all results are `pass` or `known_issue` |

## Do NOT

- Spawn multiple sub-agents in parallel — execute ONE at a time
- Leave tasks in `fail` state without either retrying or escalating to `known_issue`
- Modify test plan steps — execute them exactly as written
- Forget to update `notes/test-status.json` after each test
- Forget to append to the test results log after each test
- Skip the dependency analysis
- Use `alert()` or `confirm()` in any fix (see CLAUDE.md)
We are executing the test plan. All implementation is complete. Now we verify it works.

## Reference Documents

- **Test Plan:** `notes/test_plan.md`
- **Implementation Plan:** `notes/impl_plan.md`
- **Specs:** `notes/specs.md`
- **Test Status JSON:** `notes/test-status.json`
- **Test Results Log:** `notes/test-results.md`

---

## Phase 1: Initialize

### Step 1: Check for Existing Run (Resumption)

Before creating anything, check if a previous test run exists:

1. Check if `notes/test-status.json` exists
2. Check if there are existing tasks via `TaskList`

**If both exist and tasks have results:**
- This is a **resumed run** — skip to Phase 2 (Step 7)
- Announce: "Resuming previous test run. Skipping already-passed tests."
- Only execute tasks that are still `pending` or `fail` (with fixAttempts < 3)

**If no previous run exists (or files are missing):**
- Continue with fresh initialization below

### Step 2: Read the Test Plan

Read `notes/test_plan.md` and extract ALL test cases. Auto-detect the TC-ID pattern used (e.g., `TC-001`, `TC-101`, `TC-5A`, etc.).

For each test case, note:

- TC ID
- Name
- Priority (Critical / High / Medium / Low — default to Medium if not stated)
- Preconditions
- Test steps and expected outcomes
- Test data (if any)
- Dependencies on other test cases (if any)

### Step 3: Analyze Test Dependencies

Determine which test cases depend on others. Common dependency patterns:

- A "saves data" test may depend on a "displays default" test
- A "form submission" test may depend on "form validation" tests
- An "end-to-end" test may depend on individual component tests

If no clear dependencies exist between test cases, treat them all as independent.

### Step 4: Create Tasks

Use `TaskCreate` to create one task per test case. Set `blocked_by` based on the dependency analysis.

**Task description format:**

```
Test [TC-ID]: [Test Name]
Priority: [Priority]

Preconditions:
- [Required state before test]

Steps:
| Step | Action | Expected Result |
|------|--------|-----------------|
| 1 | [Action] | [Result] |
| 2 | [Action] | [Result] |

Test Data:
- [Field]: [Value]

Expected Outcome: [Final verification]

Environment:
- Refer to CLAUDE.md for wp-env details, URLs, and credentials
- WordPress site: http://localhost:8105
- Admin: http://localhost:8105/wp-admin (admin/password)

---
fixAttempts: 0
result: pending
lastTestedAt: null
notes:
```

### Step 5: Generate Test Status JSON

Create `notes/test-status.json`:

```json
{
    "metadata": {
        "testPlanSource": "notes/test_plan.md",
        "totalIterations": 0,
        "maxIterations": 50,
        "startedAt": null,
        "lastUpdatedAt": null,
        "summary": {
            "total": "<count>",
            "pending": "<count>",
            "pass": 0,
            "fail": 0,
            "knownIssue": 0
        }
    },
    "testCases": {
        "<TC-ID>": {
            "name": "Test case name",
            "priority": "Critical|High|Medium|Low",
            "status": "pending",
            "fixAttempts": 0,
            "notes": "",
            "lastTestedAt": null
        }
    },
    "knownIssues": []
}
```

### Step 6: Initialize Test Results Log

Create `notes/test-results.md`:

```markdown
# Test Results

**Test Plan:** notes/test_plan.md
**Started:** [CURRENT_TIMESTAMP]

## Execution Log
```

### Verify Initialization

Use `TaskList` to confirm:
- All TC-IDs from the test plan have a corresponding task
- Dependencies are correctly set via `blocked_by`
- All tasks show `result: pending`

Cross-check task count matches `summary.total` in `notes/test-status.json`.

---

## Phase 2: Execute Tests

### Step 7: Determine Execution Order

Use `TaskList` to read all tasks and their `blocked_by` fields. Determine sequential execution order:

1. Tasks with no `blocked_by` (or all dependencies resolved) come first
2. Tasks whose dependencies are resolved come next
3. Continue until all tasks are ordered

**For resumed runs:** Skip tasks where `result` is already `pass` or `known_issue`.

### Step 8: Execute One Task at a Time

For the next eligible task, spawn ONE sub-agent with the instructions below.

**One sub-agent at a time. Do NOT spawn multiple sub-agents in parallel.**

---

#### Sub-Agent Instructions

**You are a test execution sub-agent. You have ONE job: execute and verify ONE test case.**

1. **Read your task** using `TaskGet` to get the full description
2. **Parse the test steps** from the description (everything above the `---` separator)
3. **Parse the metadata** from below the `---` separator
4. **Read CLAUDE.md** for environment details, URLs, and credentials

5. **Execute the test:**

    Using browser automation:
    - Navigate to URLs specified in the test steps
    - Click buttons/links as described
    - Fill form inputs with the test data provided
    - Take screenshots at key verification points
    - Read console logs for errors
    - Verify DOM state matches expected outcomes

    Follow the test plan steps EXACTLY. Do not skip steps.

6. **Determine the result:**

    **PASS** if:
    - All expected outcomes verified
    - No unexpected console errors
    - UI state matches test plan

    **FAIL** if:
    - Any expected outcome not met
    - Unexpected errors
    - UI state doesn't match

7. **If PASS:** Update the task description metadata via `TaskUpdate`:

    ```
    ---
    fixAttempts: 0
    result: pass
    lastTestedAt: [CURRENT_TIMESTAMP]
    notes: [Brief description of what was verified]
    ```

    Mark the task as `completed`.

8. **If FAIL and fixAttempts < 3:**

    a. Analyze the root cause
    b. Implement a fix in the codebase
    c. Increment fixAttempts and update via `TaskUpdate`:

    ```
    ---
    fixAttempts: [previous + 1]
    result: fail
    lastTestedAt: [CURRENT_TIMESTAMP]
    notes: [What failed, root cause, what fix was applied]
    ```

    d. Re-run the test steps to verify the fix
    e. If now passing, set `result: pass` and mark task as `completed`
    f. If still failing and fixAttempts < 3, repeat from (a)

9. **If FAIL and fixAttempts >= 3:** Mark as known issue via `TaskUpdate`:

    ```
    ---
    fixAttempts: 3
    result: known_issue
    lastTestedAt: [CURRENT_TIMESTAMP]
    notes: KI — [Description of the issue, steps to reproduce, severity, suggested fix]
    ```

    Mark the task as `completed`.

10. **Update Test Status JSON** — Read `notes/test-status.json`, update the test case entry and recalculate summary counts, then write back:

    - Set `status` to `pass`, `fail`, or `known_issue`
    - Update `fixAttempts`, `notes`, `lastTestedAt`
    - Increment `metadata.totalIterations`
    - Update `metadata.lastUpdatedAt`
    - Recalculate `metadata.summary` counts
    - If known_issue, add entry to `knownIssues` array

11. **Append to test results log** (`notes/test-results.md`):

    ```markdown
    ## [TC-ID] — [Test Name]

    **Result:** PASS | FAIL | KNOWN ISSUE
    **Tested At:** [TIMESTAMP]
    **Fix Attempts:** [N]

    **What happened:**
    [Brief description of test execution]

    **Notes:**
    [Observations, errors, or fixes attempted]

    ---
    ```

**CRITICAL: Before finishing, verify you have updated ALL THREE locations:**

1. Task description (metadata below `---` separator) via `TaskUpdate`
2. `notes/test-status.json` (test case entry + summary counts)
3. `notes/test-results.md` (appended human-readable entry)

Missing ANY of these = incomplete iteration.

---

### Step 9: Verify and Continue

After each sub-agent finishes, the orchestrator:

1. Uses `TaskGet` to verify the task description metadata was updated
2. Reads `notes/test-status.json` to confirm JSON was updated and summary counts are correct
3. Reads `notes/test-results.md` to confirm a new entry was appended
4. **If any location was NOT updated**, update it before proceeding
5. Determines the next eligible task (unresolved, dependencies met)
6. Spawns the next sub-agent (back to Step 8)

### Step 10: Repeat Until All Resolved

Continue until ALL tasks have `result: pass` or `result: known_issue`.

```
Completion check:
  - result: pass         → resolved
  - result: known_issue  → resolved
  - result: fail         → needs re-test (if fixAttempts < 3)
  - result: pending      → not yet tested

ALL resolved? → Phase 3 (Summary)
Otherwise?    → Next task
```

---

## Phase 3: Summary

### Step 11: Generate Final Summary

When all tasks are resolved, append a final summary to `notes/test-results.md`:

```markdown
# Final Summary

**Completed:** [TIMESTAMP]
**Total Test Cases:** [N]
**Passed:** [N]
**Known Issues:** [N]

## Results

| TC | Name | Priority | Result | Fix Attempts |
|----|------|----------|--------|--------------|
| TC-XXX | [Name] | High | PASS | 0 |
| TC-YYY | [Name] | Medium | KNOWN ISSUE | 3 |

## Known Issues Detail

### KI-001: [TC-ID] — [Issue Title]

**Severity:** [low|medium|high|critical]
**Steps to Reproduce:** [How to see the bug]
**Suggested Fix:** [Potential solution if known]

## Recommendations

[Any follow-up actions needed]
```

---

## Rules Summary

| Rule | Description |
|------|-------------|
| 1:1 Mapping | One task per test case — no grouping |
| Dependencies | Use `blocked_by` to enforce test execution order |
| Sequential | One sub-agent at a time — do NOT spawn multiple in parallel |
| Sub-Agents | One sub-agent per task — fresh context, focused execution |
| Max 3 Attempts | After 3 fix attempts → mark as `known_issue` |
| Metadata in Description | Track `fixAttempts`, `result`, `lastTestedAt`, `notes` below `---` separator |
| Test Status JSON | Always update `notes/test-status.json` after each test |
| Log Everything | Append results to `notes/test-results.md` for human review |
| Resumable | Detect existing run state and continue from where it left off |
| Completion | All tasks resolved = all results are `pass` or `known_issue` |

## Do NOT

- Spawn multiple sub-agents in parallel — execute ONE at a time
- Leave tasks in `fail` state without either retrying or escalating to `known_issue`
- Modify test plan steps — execute them exactly as written
- Forget to update `notes/test-status.json` after each test
- Forget to append to the test results log after each test
- Skip the dependency analysis
- Use `alert()` or `confirm()` in any fix (see CLAUDE.md)

.

.

.

Your Turn

If you’ve been frustrated with AI-generated code that “works” but doesn’t actually work, give this a shot.

Define your success criteria upfront with a solid test plan. Let Claude Code testing handle the execution and verification through task management. Walk away while it iterates.

The test-fix-retest loop is boring. Tedious. The kind of thing every developer has always done manually.

Now you don’t have to.

What feature are you going to test with this workflow?

Set it up. Let it run. Come back to green checkmarks.

(And maybe grab a coffee while you wait. Your backpack is empty now—you’ve earned the rest.)