Skip to content

Category: The Art of Vibe Coding

9 min read The Art of Vibe Coding

Talk Like a Caveman, Save > 75% on Claude Code Usage (I Tested It)

Reddit post hit r/ClaudeAI on April 3rd and absolutely exploded.

The title: “Taught Claude to talk like a caveman to use 75% less tokens.”

10,000 upvotes. Hundreds of comments. Half the thread was laughing. The other half was already adding it to their projects.

Here’s what claude code caveman mode looks like in practice:

Normal Claude: “Added validation with: blur validates when focus leaves, input re-validates as user types, submit validates all fields. Each field uses the existing .error / .valid CSS hooks already in the file, so no style changes were needed.”

Caveman Claude: “Done.”

Same task. Same code quality. Wildly different token bills.

And here’s the thing — I’d been scrolling past Claude’s explanations for weeks without realizing it. Helpful bullets explaining what the code does. Notes about CSS hooks. Context I already understood. The code was always fine. Everything around it was for an audience of nobody.

Credit where it’s due: Reddit user flatty kicked this off, and Drona Gangarapu (3.3k stars on GitHub) took the concept and productized it into a polished, drop-in CLAUDE.md file with actual benchmarks.

I wanted to test it myself. On a real coding task. With real results.

The original Reddit post by flatty about caveman mode with 10K upvotes on r/ClaudeAI

.

.

.

What Is Caveman Mode?

Caveman mode is a prompt instruction that tells Claude to strip all filler from its output.

No articles (“the”, “a”). No pleasantries (“Great question!”). No restating your problem back to you. No unsolicited explanations. No “Let me know if you’d like me to adjust anything!” sign-offs.

Just the answer.

Here’s the instruction I used:

Respond like a caveman. No articles, no filler words, no pleasantries.
Short. Direct. Grunt-level brevity. Code speaks for itself.
If me ask for code, give code. No explain unless me ask.

Why does this work?

Claude’s output tokens are the expensive part of any API call — output tokens cost roughly 4x what input tokens cost on most models. Every “Let me walk you through this…” and “That’s a great approach!” is burning tokens on words that carry zero information for the developer reading them.

I learned this the hard way. I hit my usage limit on a Tuesday afternoon — right in the middle of a productive streak. When I looked at what had actually consumed those tokens, a depressing amount was Claude being polite. Greetings I never read. Summaries of things I’d just asked. Sign-offs I scrolled past. The code itself was maybe 40% of the output.

You already know what you asked. You don’t need Claude to repeat it back to you. You don’t need a greeting. You don’t need encouragement.

You need the code.

Caveman mode eliminates the social performance.

.

.

.

The Test: Normal vs. Caveman on a Real Coding Task

Time to put caveman mode through a real task.

I have a styled contact form — four fields (name, email, subject, message), a submit button, and clean UI. No JavaScript yet. The form looks great but does absolutely nothing when you hit “Send Message.”

The styled contact form before any validation — four fields, a submit button, and nothing else

Here’s the exact prompt I gave Claude, identical in both runs:

Add JavaScript input validation to this contact form. Validate name
(required, 2+ chars), email (required, valid format), subject (required),
and message (required, 10+ chars). Show inline error messages. Validate
on blur and on submit.

Normal Mode Response

Claude Code terminal showing normal mode response — code plus bulleted explanation of validation behavior and CSS hooks note

Both modes wrote the code. Both modes got the validation right. But look at what comes after the code in normal mode — a bulleted explanation of the validation behavior, a note about CSS hooks, context about what each event does. Helpful? Sure. But wrapping the exact same validation code that speaks for itself.

Caveman Mode Response

Claude Code terminal showing caveman mode response — just code and a single word: Done.

Same prompt. Same form. Same working validation code.

One word: “Done.”

The Numbers

Here’s where it gets real.

The non-code explanation in normal mode? 377 characters. The caveman equivalent? 5 characters. That’s “Done.” — period included.

377 to 5. A 99% reduction in the explanation wrapper.

Now multiply that across a full coding session. If you send Claude 30 prompts in an afternoon — and you probably do — that’s 30 explanations you didn’t ask for, 30 sign-offs you never read, 30 “here’s what this does” summaries for code you wrote the prompt for. Those tokens add up fast.

Drona Gangarapu’s benchmarks across five different prompts showed a consistent ~63% total word reduction when you factor in both code and explanation. But the explanation wrapper — the part that caveman mode actually targets — is where nearly all the savings come from.

The code is virtually identical in both cases. Same validateField function. Same event listeners. Same submit handler. Caveman mode cuts the wrapper, not the work product.

.

.

.

Me Tell You Where Put Caveman Words

You’ve seen the results. Now — where should you actually put the caveman instruction?

Three options, ranked from best to worst.

This is the best place for your claude code caveman mode instruction. It persists across sessions, applies to every prompt in that project, and you set it once and forget it.

Add this to the top of your CLAUDE.md:

## Communication Style

Respond like a caveman. No articles, no filler words, no pleasantries.
Short. Direct. Code speaks for itself.
If asked for code, give code. No explain unless asked.
No sycophancy. No restating the question. No sign-offs.

Why the top? Claude processes CLAUDE.md instructions in order, and primacy effects matter. Communication style should be established before anything else.

The project-level approach also gives you control — caveman mode on your personal projects, normal mode on client work. Different projects, different communication styles. One file each.

If you’re not using CLAUDE.md yet, start with The Single File That Makes or Breaks Your Claude Code Workflow. It covers why this file matters and how to structure it.

2. In Your Prompt Directly (Good for Testing)

Prepend the instruction to any prompt:

[caveman mode: no filler, no pleasantries, code only] Add JavaScript
input validation to this contact form...

This is how most people start — and it works fine for a test drive.

The downside: you’re typing it every time, and it adds input tokens to every single message. If you like the results, move it to CLAUDE.md and stop paying the repeated input cost.

3. In ~/.claude/CLAUDE.md (Global — Use Carefully)

This applies caveman mode to every project on your machine. Only do this if you want terse output everywhere. Most people should keep it project-level.

Bonus: As a Claude Code Skill

If you want cleaner separation of concerns — keeping your CLAUDE.md focused on project instructions while communication style lives in its own toggleable unit — Thomas Schlossmacher’s caveman-mode skill packages the whole thing as a drop-in .claude/skills/ file.

Worth a look if you manage multiple communication styles across projects.

Best Practices

  • Put it at the top of CLAUDE.md. Primacy effect means early instructions carry more weight.
  • Combine with other token-saving rules. “Don’t restate my question” and “Skip the sign-off” stack well with caveman mode.
  • Be specific about what to keep. If you still want code comments, say so: “Keep code comments. Skip everything else.”
  • Test first. Run a few prompts before committing it to CLAUDE.md permanently. (You’ll know within two prompts whether you love it or hate it.)

.

.

.

When Caveman Mode Bad. When Skip.

Caveman mode has real tradeoffs. And being honest about them is what makes the technique actually useful — instead of just another internet hack you try once and forget.

Skip it when you’re learning something new.

If you’re asking Claude to explain async/await, database indexes, or CSS grid — you want the verbose explanation. Those filler words become teaching words when you’re building a mental model. Caveman mode strips the pedagogy, and that’s a real loss when pedagogy is the whole point.

Skip it for complex architecture discussions.

“Use microservices” is a caveman answer. But what you actually need is: “Here’s why microservices fit your use case, here are the tradeoffs, and here’s what will break if your team is under five people.” When you need Claude to reason through options with you, let it reason.

Skip it when sharing outputs with collaborators.

If teammates read your Claude Code outputs or review AI-generated code, they need the context that caveman mode strips. Readability matters when the audience hasn’t seen the original prompt. (I learned this one the slightly awkward way.)

Skip it for debugging unfamiliar errors.

When you’re stuck on a cryptic error and need Claude to walk you through what’s happening, the detailed explanation is the value. “Fix: change line 42” doesn’t help if you don’t understand why line 42 was wrong in the first place.

The rule of thumb: use caveman mode when you know what you want and just need Claude to produce it. Skip it when you need Claude to think with you.

And here’s the good part — you can switch freely.

Caveman mode in your CLAUDE.md for daily coding. Remove it (or override it in the prompt) when you need the full Claude experience. Per-project. Per-session. Per-prompt. No commitment necessary.

.

.

.

Me Save Tokens. You Save Tokens. Community Win.

Caveman mode is funny. A developer on Reddit taught an AI to grunt, and thousands of people immediately started saving money.

That’s the internet at its best.

But zoom out and there’s something real underneath the joke. As AI coding tools move to per-token billing, being intentional about output verbosity becomes a genuine skill. And not just for your wallet — less fluff means faster responses, less scrolling, and more signal per screen.

The community drove this one. flatty on Reddit who made everyone laugh while solving a real problem. Drona Gangarapu who turned the concept into a benchmarked, production-ready CLAUDE.md file. The thousands of developers riffing on their own variations. Good ideas have a way of finding their people.

If you want to go further on the token-saving front — front-loading your prompts, optimizing your CLAUDE.md structure, getting more done within your usage limits — check out How to Double Your Claude Code Usage Limits. Caveman mode is one lever. There are more.

Me done. You go try. Report back.

11 min read The Art of Vibe Coding

Codex Reviews My Code Inside Claude Code — But I Don’t Trust It Blindly

I’ve been building something I can’t fully show you yet.

It’s a Chrome extension called PinFlow. The idea: you browse a page, click on any element, attach an instruction to it, and those instructions get routed straight into a local Claude Code session. No tab switching. No copy-pasting selectors. You pick, you describe, Claude edits your code.

The original PinFlow extension sidebar open on a Google page, showing the element picker with a "Pick an element first" prompt

I’ll cover how PinFlow works in a dedicated post in the future. (Subscribe if you don’t want to miss that one.)

But today’s story starts after I finished a major UI redesign of that extension.

The code had gotten complex. Multi-step wizard flows, state management across views, permission handling, concurrent request logic. The kind of complexity where you know bugs are hiding somewhere — you just can’t see them yet.

I needed a second pair of eyes.

Normally, that meant switching over to Codex in a separate terminal, running a review there, then hauling the results back to Claude Code. I’ve done this workflow dozens of times — I even wrote about it back in October 2025.

This time, I didn’t have to switch at all.

There’s a plugin for that now.

.

.

.

What Is the Claude Code Codex Plugin?

On March 30, 2026, OpenAI shipped an official Claude Code Codex plugin (openai/codex-plugin-cc). It lets you run Codex code reviews, adversarial reviews, and delegate tasks to Codex — all from inside your Claude Code session.

A few things worth knowing:

  • Free to use with any ChatGPT subscription, including the Free tier
  • Uses your local Codex CLI — same auth, same config, same models
  • Runs as a Claude Code plugin — the new plugin system, so it lives inside your session
  • 2,500+ GitHub stars in one day — the community noticed fast

If you’ve been following along, you’ll recognize the workflow this replaces.

Back in September 2025, I wrote about using Claude Code and Codex as separate tools in separate terminal windows. In October, I refined that into a structured handoff: Codex plans → Claude builds → Codex reviews.

The plugin collapses all of that into slash commands. No window switching. No copy-pasting context between tools. The review happens right where the code was written.

Install and Setup

Four commands. Under 2 minutes. That’s it.

/plugin marketplace add openai/codex-plugin-cc
/plugin install codex@openai-codex
/reload-plugins
/codex:setup

If you don’t have Codex installed yet, /codex:setup handles that for you. If Codex isn’t logged in, run !codex login from within Claude Code — the ! prefix executes shell commands in your session.

After installation, you’ll see the new slash commands and the codex:codex-rescue subagent ready to go.

What Commands Do You Get?

The plugin ships with 7 commands:

CommandWhat it doesRead-only?
/codex:reviewStandard code review of uncommitted changes or branch diffYes
/codex:adversarial-reviewSteerable challenge review — questions design, tradeoffs, assumptionsYes
/codex:rescueDelegate a task to Codex (bug investigation, fixes, cheaper model pass)No
/codex:statusCheck progress on background Codex jobs
/codex:resultShow final output of a finished job
/codex:cancelCancel an active background job
/codex:setupCheck/install Codex, manage review gate

The two I reach for most:

/codex:review — the bread and butter. Point it at your current changes and get a review. Supports --base main for branch diffs and --background for long-running reviews. Or --wait if you want to stay in the session until the review finishes.

/codex:adversarial-review — the pressure test. Unlike the standard review, you can steer it: “look for race conditions,” “challenge whether this caching approach is right.” I pull this one out before shipping anything risky.

There’s also /codex:rescue, which is the only command that can change code. It hands a task to Codex and supports different models (--model gpt-5.4-mini for quick passes). Think of it as delegating grunt work to a cheaper model while you stay focused.

.

.

.

The Demo: Reviewing a Real Redesign

Here’s where it gets concrete.

I was redesigning PinFlow’s sidebar UI — moving from a single-element component to a full multi-step wizard with Pick → Write → Review steps, multi-pin support, and shared/per-pin instruction modes. A big change.

I gave Claude Code the task with my redesign notes:

Claude Code prompt: "I want to redesign the sidebar UI for the chrome extension based on the redesign notes here: @notes/new_ui_redesign.md"

Claude explored the codebase, reviewed the wireframes, and came back with clarifying questions — architecture decisions, multi-pin picking strategy, scope for the reference mode, how far to go with the running and done states.

Claude exploring the codebase and asking its first architecture question about keeping vanilla HTML strings vs. introducing a lightweight component library
Round 1 of clarifying questions covering architecture, multi-pin support, reference mode, and implementation scope
Round 2 of clarifying questions on send mode and reference UI placement

Two rounds of questions later, Claude had enough context to plan.

It created a task list — three-step wizard, multi-pin picking, write step with shared/per-pin modes, review step, running and done states, activity view, submit flow, and prompt builder.

Claude creating implementation tasks: update types, rewrite render shell, implement Pick step, Write step, Review step, Running/Done states, Activity view, submit flow, and wire up event handlers

8 minutes and 45 seconds later, the redesign was complete.

Implementation summary showing the new architecture: 3-step wizard with Pick/Write/Review steps, 4 views, multi-pin support, shared and per-pin instruction modes, running and done states, activity view, and updated submit flow

A full sidebar redesign. New wizard flow. New state management. New views. All from a single Claude Code session.

But here’s the thing — when that much code changes at once, edge cases don’t announce themselves. They hide in the seams between states, waiting for a user to stumble into them.

I could feel it. Time for Codex.

Triggering the Review

/codex:review --wait
Triggering /codex:review --wait in Claude Code

The --wait flag keeps the session active until the review finishes. Behind the scenes, the plugin spins up a Codex review thread against your uncommitted changes.

Codex starting the review thread, showing the bash command running the codex-companion script, thread ID, and "Reviewer started: current changes" with a 1-minute timeout

6 minutes 35 seconds later, the results came back.

The Review Results

Codex Review Results showing 4 issues found — all edge-case correctness issues. A table lists: P1 Submit silently no-ops when no project configured, P2 Step bar allows jumping to Review without instructions, P2 Back from Activity strands running requests, P2 Concurrent requests can overwrite lastResult and currentView. Ends with "Want me to fix these issues?"

4 issues found. All related to edge-case correctness rather than the core redesign:

PriorityIssue
P1Submit silently no-ops when no project is configured (no user feedback)
P2Step bar allows jumping to Review without writing instructions
P2“Back” from Activity always goes to wizard, even if a request is running
P2Concurrent requests can overwrite each other’s lastResult and currentView

Every single one of these is the kind of bug that slips through during a big redesign. You’re focused on the main flow — the happy path — and the edge cases hide in the seams between states.

At the bottom of the review: “Want me to fix these issues?”

I could have said yes. Let Claude apply all 4 fixes and move on with my day.

I’ve been on the other side of that decision. Said “yes, fix everything” on a review once, walked away, came back to a diff full of renamed variables and reshuffled imports that had nothing to do with the actual bugs. Took longer to untangle than the original review would have.

So no. I didn’t say yes.

.

.

.

The Validation Prompt: Where the Real Value Lives

Here’s what I do instead — and honestly, this is the part I want you to steal.

After receiving Codex’s review comments, I paste this prompt:

let's address the code review comments provided.

Follow the steps below to effectively address the code review comments:

1/ First, you should analyze the code review comments carefully and understand the feedback given.
2/ Then, determine if the comments given are valid and we should make changes to the code based on the feedback.
3/ If the comments are valid, you should make the necessary changes to the code to address the comments. If you believe the comments are not valid, you should provide a clear explanation to justify why you think the comments are not valid.

Use the AskUserQuestion tool to ask me clarifying questions until you are 95% confident you can complete this task successfully. For each question, add your recommendation (with reason why) below the options. This would help me in making a better decision.
The validation prompt pasted into Claude Code after receiving Codex's review comments

What happens next is the key insight.

Claude reads the Codex review. It analyzes each comment against the actual codebase — the code it just wrote, with full context of why things are structured that way. And instead of blindly applying everything, it comes back with a verdict and clarifying questions.

Claude's response: "All 4 comments are valid. Most fixes are straightforward, but two have design decisions worth confirming." Shows AskUserQuestion with two questions — P1: how to handle the no-project case (recommends Disable Send + inline hint) and P2: how to handle concurrent requests (recommends Prevent new submissions)

Look at what Claude did here:

“All 4 comments are valid. Most fixes are straightforward, but two have design decisions worth confirming.”

For the straightforward fixes, Claude proceeds. For the ones with judgment calls, it asks — with a recommendation and reasoning for each option:

  • P1 — No project configured: When no project is set, the Send button silently does nothing. How should we handle this? Claude recommends: Disable Send + inline hint.
  • P2 — Concurrent requests: A second request can start while one is already running. Should we prevent it or handle the overlap? Claude recommends: Prevent new submissions.

Each question comes with Claude’s recommendation and the reasoning behind it. I pick the recommended options for both.

This is the part that matters.

Claude becomes a filter between the review and your code. It validates each comment in context, surfaces the ones that need your judgment, and handles the rest. You stay in control without having to re-read every line yourself.

Watching the Fixes Go In

With the design decisions confirmed, Claude applies all 4 fixes.

Fix 1 (P1): Disable the Send button on the Review step when no project is configured. Show an inline hint: “No project configured — open Settings.”

Fix 2 (P2): Block step-bar navigation to Review when no instructions have been written. Add a hasInstruction() guard to the click handler.

Claude applying Fix 1 — disabling Send + showing hint for no project — and Fix 2 — adding hasInstruction guard to block premature Review navigation. Shows actual code diffs in src/content/index.ts

Fix 3 (P2): Back from Activity now checks if a request is currently running. If it is, the view returns to running instead of the wizard — preventing the user from stranding an in-flight request.

Fix 2 continued — updating the step-bar click handler condition — and Fix 3: updating the Back from Activity handler to check records.some for running state and return to the running view instead of wizard

Fix 4 (P2): Scope lastResult and currentView updates so they only fire when the completing request matches the current one. Prevents a finishing background request from overwriting the state of a newer request.

Fix 4: Scoping lastResult/currentView updates behind a currentRequestId check in both success and error paths, preventing stale responses from overwriting current state

Then Claude verifies the build still compiles. All clean.

Build verification passes. Summary table showing all 4 fixes: P1 Send button disabled with inline hint, P2 hasInstruction guard added, P2 Back from Activity returns to running view, P2 Send disabled during active request with scoped state updates. "Brewed for 1m 54s"

4 issues identified. 4 fixes applied. 2 design decisions confirmed. 1 minute 54 seconds.

And here’s the part worth sitting with: if any of those review comments had been a false positive — a stylistic preference that didn’t match the codebase, or a “problem” that was actually intentional — Claude would have flagged it. It would have said “this comment suggests X, but the current approach is correct because Y” and asked whether to skip it.

That filtering step is the difference between a code review you can act on and a code review that introduces churn.

The Before and After

Remember the original PinFlow UI from the top of this post? Here’s what it looks like after the redesign and the review fixes:

New wizard flow. Clean state management. And four edge-case bugs caught before they ever reached a user.

I’ll go deep on the extension itself in a future post.

(Stay tuned for that one.)

.

.

.

The Review Gate: The Automated Alternative

The plugin also includes a review gate — a built-in hook that automatically runs a Codex review before Claude finishes a task:

/codex:setup --enable-review-gate

When enabled, every response Claude is about to complete gets intercepted for a Codex review first. If issues are found, the stop is blocked so Claude can address them.

I prefer the manual approach.

The review gate can create long-running Claude/Codex loops that drain usage limits, and it doesn’t give you the chance to filter false positives before they get fed back in. For long autonomous runs where you want a safety net, though, the gate has its place.

Think of the manual prompt as the scalpel and the review gate as the safety net — choose based on how much control you want.

.

.

.

The Bigger Picture: Claude and Codex, Integrated

Let me zoom out for a second. My Claude-Codex workflow has gone through three distinct phases:

1. Side by side (Sept 2025) — Separate tools, separate terminal windows, separate contexts. I used to keep two terminals open — Claude Code on the left, Codex on the right. Copy a file path from the review, switch windows, find the line, switch back. By the third comment I’d lost track of what I was even fixing.

2. Manual handoff (Oct 2025) — Structured workflow with Codex planning and reviewing, Claude building. Better. But still separate tools with separate contexts.

3. Integrated (now) — Codex commands running inside Claude Code. Shared context. No switching. The review happens where the code lives.

Each evolution removed friction. The Claude Code Codex plugin removes the last meaningful barrier: context loss between tools.

And when I pair that with the validation prompt — having Claude critically evaluate Codex’s feedback before acting on it — I get a review workflow that catches real bugs without drowning me in noise.

Between the Codex plugin and the Chrome extension I teased at the top, the direction feels clear. The tools are converging. The best workflow is the one where you never have to leave.

.

.

.

Your Next Steps

The plugin takes 2 minutes to install. The validation prompt is 6 lines you can copy-paste.

Together, they give you a code review workflow that catches real issues — and lets you skip the noise.

Here’s what to do:

  1. Install the plugin (4 commands above)
  2. Run /codex:review on whatever you’re working on right now
  3. Paste the validation prompt and let Claude filter the results
  4. Fix what matters. Skip what doesn’t.

Try it on your next session. You’ll be surprised how many review comments are noise — and how valuable the ones that survive the filter actually are.

Plugin repo: openai/codex-plugin-cc

12 min read The Art of Vibe Coding

Workflow Engineering in Action: Building a Reddit Summarizer From Scratch With Claude Code

Here’s a confession.

I follow about a dozen subreddit threads. AI tooling, Claude Code tips, local LLM experiments, dev workflows. And every single morning, I open Reddit fully intending to spend five minutes catching up.

Forty-five minutes later, I’m still scrolling.

Ninety percent of it is noise. Reposts, complaints (like those weekly usage/rate limits rants in r/ClaudeCode), low-effort memes, questions that got answered three threads ago. But buried somewhere in there — a workflow trick someone discovered at 2am, a Claude Code hack that actually works in production, a case study with real numbers — that stuff is gold.

I just couldn’t find it fast enough.

So I decided to build something. A simple Express server that would connect to the Reddit API, pull posts and comments from my favorite subreddits, store them locally as JSON files, and let me point Claude at the data to surface only what matters.

And here’s the part that matters for you: I built it using the Claude Code Workflow Engineering process I described in the previous issue. Start to finish. No shortcuts. No “eh, I’ll just wing this part.”

(Okay, I was tempted. But I didn’t.)

What follows is every step of that process applied to a real project — from a blank folder to a working app with full tests passing on the first attempt. Every screenshot. Every command.

Stay with me.

.

.

.

The Starting Point: One Idea, Zero Code

Here’s what my project folder looked like when I started: an idea.md file describing what I wanted, and the Workflow Engineering slash commands from the previous issue.

That’s it. No boilerplate. No template repo. No starter code. Just an idea and a process.

Project folder showing only the idea.md file and workflow engineering commands

The idea itself was pretty straightforward: an Express server that fetches posts and comments from configured subreddits within the last 24 hours, then saves everything as JSON files organized by subreddit and date. No database — just files on disk. Once the data is collected, I can ask Claude to read it and find the good stuff for me.

The one wrinkle? Reddit’s API now requires OAuth 2.0. So the app needs to handle the full authorization flow — token exchange, refresh tokens, the whole dance — before it can fetch anything.

With a clear idea written down, I handed it to the workflow.

Let’s walk through what happened.

.

.

.

Step 1: Brainstorm the Specs

I triggered the /spec_brainstorm command and pointed Claude at my idea file.

Claude Code terminal showing /spec_brainstorm command being triggered with the idea.md file

Now, I’ve tried building apps like this before — dumping everything into one prompt and letting Claude run. It got through maybe 60% before the code started contradicting itself. Requirements from the top of the conversation were ghosted by the bottom.

The Claude Code Workflow Engineering approach is different. Instead of jumping into code, Claude started asking clarifying questions. Real ones. With options, explanations, and a recommendation for each.

The first round covered core architecture decisions: How should data collection be triggered? (Manual API endpoint, cron scheduler, or both?) What kind of frontend does this need beyond the OAuth setup page? How should filtering work?

Claude Code presenting multiple-choice question about data collection trigger method with options for manual API endpoint, built-in cron scheduler, or both
Claude Code asking about frontend scope with options for minimal OAuth-only page or full dashboard UI
Summary of first round answers covering Summarizer, Trigger, Frontend, and Filtering categories

The second round went deeper: How should subreddits be configured? (Config file, hardcoded, or environment variables?) What data should go into the JSON files? (Posts only, posts + all comments, or posts + top comments?) Language preference?

Summary of second round answers covering Config, Data scope, and Language with TypeScript selected

Once Claude had enough context from both rounds, it wrote the full specification document.

Claude Code writing the complete specs.md file based on all the answers provided

Two rounds of questions. Clear decisions documented in a file. The specs existed as an artifact on disk — ready to be read by a completely fresh session with zero memory of this conversation.

That last part matters more than you might think.

(We’re about to see why.)

.

.

.

Step 2: Review the Specs

Here’s where most people go wrong. And I know this because I was most people.

On an earlier project, I skipped the review step. The specs looked fine to me. Three hours into implementation, I found a conflict that would have taken a reviewer two minutes to flag. Two minutes.

So now I don’t skip it.

Here’s the thing: the agent that wrote the specs is the worst possible agent to review them. It already “knows” what it meant. It won’t catch ambiguity because it can fill in the gaps from memory. A fresh agent reading the same file cold? It has no such luxury.

New session. /spec_review command.

New Claude Code session showing /clear followed by /spec_review command

A fresh Claude instance — with zero memory of the brainstorming conversation — read the specs and started poking holes.

And it found real problems. Using GET for state-changing operations (a REST convention violation and a security risk — someone could trigger data collection just by visiting a URL). Writing refresh tokens directly to .env at runtime (which, ferpetesake, doesn’t work the way the spec assumed). Vague OAuth state storage. And more.

Claude Code presenting spec review findings including P1 GET for state-changing operations, P2 writing refresh token to .env, and P3 vague OAuth state storage

Now, here’s where your judgment comes in. Claude surfaced a long list of potential issues — some critical, some nice-to-have. You don’t have to fix everything. You get to choose what matters.

I went through them in three tiers.

First — the spec-breaking issues:

Multi-select interface showing spec fixes with options like REST endpoint methods, OAuth state storage, error strategy, and collect-all endpoint behavior

Second — important improvements:

Second page of issue selection showing P5 Collect-all timeout, P6 Pagination limit, P7 Partial failure, and P8 Date boundary

Third — lower priority fixes (I picked the ones with real consequences):

Third page showing lower priority issues including P10 More Reddit error codes, P11 User-Agent source, and P13 Path traversal security fix

Before applying fixes, Claude asked clarifying questions to make sure the solutions would be solid. How should /collect-all handle job tracking? What should the filename date represent — when data was collected or when the post was created? Where should the Reddit username for the User-Agent header come from?

Claude asking about collect-all endpoint job tracking approach with sequential vs parallel options
Claude asking about date boundary logic for filename dates with a visual example showing collection date mapping
Claude asking about User-Agent source with env var recommended, showing example .env configuration

With answers in hand, Claude updated the specs — changing GET to POST for state-changing endpoints, adding proper error handling, fixing the OAuth storage approach, adding pagination limits, and patching a path traversal vulnerability.

Claude modifying the specs file with red/green diff showing changes to API endpoints and OAuth flow
Summary table of all spec changes applied, showing 10 fixes across REST methods, token storage, OAuth state, pagination, error handling, and path traversal

Ten issues addressed. Specs refined. The artifact on disk now reflected a far more robust design than what the brainstorming session produced alone.

And we still haven’t written a single line of code.

(On purpose.)

.

.

.

Step 3: Write the Test Plan

I’ll be honest with you — this step almost didn’t happen.

Writing tests for code that doesn’t exist yet? It felt ceremonial. Like filling out a form nobody would read. I almost skipped it.

Then the test plan revealed two requirements I’d completely glossed over in the spec.

So now I never skip it.

Fresh session. /write_test_plan command.

New session with /clear followed by /write_test_plan command

Claude read the specs and produced a structured test plan: 33 test cases organized by priority. 8 Critical, 14 High, 10 Medium. Each one with preconditions, specific steps, and expected outcomes.

Claude writing test_plan.md with 33 test cases organized into sections covering config, OAuth, collection, filtering, storage, API, errors, frontend, and date handling

Why does this matter so much?

Because writing test cases forces deep analysis of every requirement. Turning “handle pagination limits” into a specific test case — with exact inputs, steps, and expected outputs — requires genuine understanding. Shallow understanding produces shallow tests, and you’d catch that now rather than three hours into debugging.

And there’s a second benefit: the test plan gives implementation a concrete target. Every task will map to specific test cases. “Done” stops being a gut feeling and starts being a checkmark.

.

.

.

Step 4: Write the Implementation Plan

Another fresh session. /write_impl_plan command.

New session with /clear followed by /write_impl_plan command

Claude read both the specs and the test plan, then generated an implementation plan — 10 tasks, each explicitly linked to the test cases it would satisfy.

Claude writing implementation plan showing project overview and task structure with dependencies

The plan organized tasks into execution waves based on dependencies. Every task mapped to specific test case IDs.

Implementation plan summary table showing 10 tasks with their key test case mappings and execution order across 5 waves

This is the last thinking step. After this, every design decision has been made. Every task has a defined scope. Every success criterion sits in a file on disk.

Now — and only now — we build.

.

.

.

Step 5: Execute the Implementation

Fresh session. /do_impl_plan command.

New session with /clear followed by /do_impl_plan command

Here’s where the Claude Code Workflow Engineering approach earns its keep.

Instead of running all 10 tasks in a single session (which would cause context degradation as the window fills up — I’ve been there, remember?), Claude created each task and processed them in waves using sub-agents. Each sub-agent got a fresh context window. It read the implementation plan from disk, found its assigned task, and executed with laser focus.

Wave 1 started with the foundation — project scaffolding.

Claude creating all tasks with dependencies and executing Wave 1 Task 1 for project scaffolding

Then the waves rolled forward, with parallel tasks running wherever dependencies allowed:

Wave progression showing tasks running in parallel — config loading, OAuth routes, collection logic, and filtering all executing concurrently across waves
Later waves handling comment fetching, error handling, and storage management with sub-agents completing tasks
Final implementation waves covering API routes and the frontend OAuth setup page

Implementation done. 16 files created. All 10 tasks completed across multiple waves.

Implementation summary showing all files created with their purposes — from package.json to the OAuth frontend page

Every sub-agent worked from the same artifact — the implementation plan on disk. No context bleeding between tasks. No “forgetting” early requirements while working on later ones.

Fresh context, every single wave.

.

.

.

Step 6: Setup Before Testing

Before running the test plan, I needed to set up the actual Reddit integration. Three things:

A config file defining which subreddits to monitor (I chose ClaudeCode and ClaudeAI — for obvious reasons):

config.json file showing two subreddits configured — ClaudeCode with minScore 10 and minComments 5, and ClaudeAI with defaults

A Reddit app registration to get OAuth credentials:

Reddit's create application page with RdSummarizer as the app name, web app type selected, and localhost redirect URI configured

And a .env file with the credentials:

.env file showing REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET, REDDIT_REDIRECT_URI, REDDIT_USERNAME, REDDIT_REFRESH_TOKEN (empty), and PORT=5566

Straightforward stuff. Let’s get to the good part.

.

.

.

Step 7: Run the Test Plan

The final step. Fresh session. /run_test_plan command.

(Deep breath.)

New session with /clear followed by /run_test_plan command

Claude read the test plan, explored the codebase, and confirmed this was a fresh test run with 33 test cases ready to execute.

Claude reading the test plan and exploring the codebase structure, confirming 33 test cases for a fresh test run

It created tasks for each test case, set up tracking files, and organized execution by dependencies and priority.

Claude creating 33 test case tasks with dependencies, setting up test-status.json and test-results.md tracking files

I asked Claude to skip TC-003 (environment variable validation) since that one needed manual testing with specific env states.

User asking Claude to skip TC-003 env validation test, Claude acknowledging and marking it as skipped

Then the tests ran. One sub-agent per test case. Each with fresh context.

Test execution Phase 2 running TC-001 through TC-006 with sub-agents, showing config loading, invalid config, OAuth redirect, and callback tests passing
Mid-test execution showing TC-007 through TC-010 passing — token refresh, collection happy path, subreddit validation, and hours parameter tests
Continued test execution with TC-011 through TC-019 — collection, filtering, storage, and error handling tests all passing with sub-agents
Test execution TC-020 through TC-025 — storage directory creation, merge deduplication, Reddit API pagination, comments fetching, and rate limit handling all passing
Final batch of tests TC-026 through TC-033 including error codes 401 403 404 429 5xx, frontend OAuth page, User-Agent header, and collection date naming all passing

All automated tests passed on the first attempt.

Zero code fixes required.

Test completion summary showing all 32 automated tests passed on first attempt with zero code fixes needed, and implementation matched the specification

Here’s the full results table:

Final score: 32/33 passed. 0 failed. 1 skipped (TC-003 — manual user testing). 0 known issues. 0 total fix attempts.

Test execution summary showing 32/33 passed, 0 failed, 1 skipped TC-003 for manual user testing, and all automated test cases passed on first attempt with zero code fixes

Let that sit for a second.

Every automated test passed on the first try. No code fixes needed. The implementation matched the specification because the specification had been thoroughly brainstormed, independently reviewed, and tested-before-built.

The ceremony I almost skipped? Turns out it was doing the heavy lifting all along.

.

.

.

Putting the App to Work

With all tests green, I could actually use the thing.

First up: the OAuth flow. I started the server and opened the setup page — a simple “Connect to Reddit” button.

Reddit Summarizer Setup page showing Not Connected status with an orange Connect to Reddit button

One click, and Reddit’s authorization page appeared.

Reddit OAuth authorization page asking to allow RdSummarizer to access posts and comments and maintain access indefinitely

After approving, the app received a refresh token and displayed it with clear instructions to add it to .env.

Reddit Summarizer success page showing the refresh token with a Copied button and instructions to add REDDIT_REFRESH_TOKEN to the .env file
Note: The refresh token is fake.

Token saved. Now I asked Claude to hit the /api/collect-all endpoint and pull data from both configured subreddits.

Claude Code running the collect-all endpoint with hours=24, showing successful collection from ClaudeCode and ClaudeAI subreddits with post counts

The data landed exactly where the specs said it would — JSON files organized by subreddit and date.

File explorer showing collected JSON data in logs folder organized by subreddit, with actual Reddit post data visible including titles, scores, and timestamps from the ClaudeCode subreddit

Now for the payoff.

I asked Claude to read the collected data and surface the latest Claude Code tips, workflows, and real-world case studies.

User prompt asking Claude to find and summarize the latest Claude Code tips, workflows, and case studies from the collected posts

The collected data was large — 64k tokens. Claude spawned 6 sub-agents to process it in parallel, each analyzing a chunk.

Claude processing the large data file with 6 parallel sub-agents, each analyzing a chunk of posts — ranging from 23.8k to 82.3k tokens

And here’s what came out — a synthesized summary of everything worth knowing from the last 24 hours across both subreddits:

Claude's synthesized insights showing top hacks like Force Opus sub-agents, hook-based context injection, notification sounds on Mac, and workflow optimizations including must-have settings and measure twice cut once workflows

Two subreddits. Hundreds of posts and comments. Distilled into actionable insights in under a minute.

I would never consume that volume of data and extract insights that fast by scrolling Reddit manually. The app collects and organizes. Claude analyzes and summarizes. And because all of this runs through my Claude subscription, there’s no separate API cost for the summarization part.

My morning Reddit scroll just went from 45 minutes to about 2.

.

.

.

Why the Workflow Made This Possible

You might be thinking: “Okay, but couldn’t you have built this without all the workflow steps? It’s just an Express server with some API calls.”

Honestly? Probably. This project is small enough that a skilled developer could prompt their way through it in one session.

But here’s what would have been different.

1. The spec review caught 10 issues before any code existed. 

Using GET for state-changing operations. Writing tokens to .env at runtime. Missing pagination limits. A path traversal vulnerability. Any one of these would have meant debugging sessions after implementation — or worse, shipping a security hole you never noticed.

2. The test plan gave implementation a concrete target. 

33 test cases, defined before Claude wrote a single line of code. When every task maps to specific success criteria, you don’t end up with “it seems to work” confidence. You end up with full tests passed on the first attempt confidence. There’s a world of difference between those two.

3. Fresh sessions prevented context rot. 

The brainstorm session accumulated context from two rounds of Q&A. The review session started clean — and immediately found problems the brainstorming agent was blind to. The implementation used sub-agents in waves, each with its own fresh context window. No degradation. No forgotten requirements.

4. The artifacts served as shared memory. 

Every step read from the previous step’s output file. Specs fed the review. Reviewed specs fed the test plan. Test plan fed the implementation plan. Implementation plan fed the sub-agents. Nothing lived “in context.” Everything lived on disk, where any fresh session could pick it up.

And here’s the part I keep coming back to: the workflow scales. 

This project happened to be small.

The next one might not be.

And the exact same six commands:

  • /spec_brainstorm
  • /spec_review
  • /write_test_plan
  • /write_impl_plan
  • /do_impl_plan
  • /run_test_plan 

…will work the same way regardless of what you’re building.

You design the process once. You refine it over time. Then you apply it to everything.

That’s the whole promise of Claude Code Workflow Engineering. And I think this little Reddit project makes a decent case for it.

.

.

.

Your Turn

The full source code is on GitHub: reddit-summarizer

If you want to use the same workflow for your own projects, grab the Workflow Engineering Starter Kit — all six command files, ready to drop into your .claude/commands/ folder.

Here’s what I’d suggest:

  1. Pick a project idea you’ve been sitting on
  2. Write it down in an idea.md file — even a rough paragraph works
  3. Run the six-step workflow end to end
  4. Pay attention to what the spec review catches — that’s usually where the biggest surprise shows up

What are you going to build with it?

Go engineer it.

17 min read The Art of Vibe Coding

Workflow Engineering: Why Your AI Development Process Matters More Than Your Prompts

You open Claude Code.

You’ve got a feature to build — a complex one. Payment integration, subscription handling, admin dashboard, the works.

So you write the most detailed prompt you’ve ever crafted. 1000+ words. Every requirement listed. Edge cases mentioned. You even throw in a few “make sure you handle X” reminders for good measure.

(You’re being thorough. You’re being responsible. You’re practically writing documentation before the code even exists.)

You hit enter.

Claude gets to work.

Files appear. Functions materialize. Code flows like water.

Thirty minutes later, you look at the output.

Half your edge cases? Missing. The subscription lifecycle you described in exquisite detail? Partially implemented. That race condition you specifically warned about? Acknowledged in a code comment — a lovely, well-formatted code comment — but never actually handled.

So you do what every developer does.

You rewrite the prompt.

Make it longer. More specific. Add bold text for emphasis. Paste in code examples. Maybe underline something, just to really drive the point home.

Same result. Different gaps.

.

.

.

The Prompt Optimization Trap

Here’s the cycle most developers are stuck in right now:

The prompt keeps getting bigger. The results don’t keep getting better.

You’ve probably watched this happen in real-time.

The AI starts strong — the first few hundred lines look great. Then quality dips. Functions get shallower. Edge cases receive “TODO” comments instead of actual handling. By the end, Claude is running on fumes, juggling so much context that it’s forgetting what you said at the beginning of your very thorough, very responsible prompt.

Everyone’s response?

Write a better prompt. A clearer prompt. A more detailed prompt. I did this too. For longer than I’d like to admit.

Here’s what I learned after months of building complex features with Claude Code: the answer has nothing to do with writing better prompts.

The answer is designing better workflows.

.

.

.

From Prompts to Workflows

Stay with me here — because this is the shift that changed everything about how I work with AI.

Think about how you’d approach a complex feature without AI.

You wouldn’t sit down, write everything you know into one document, hand it to a junior developer, and say “build all of this.” That’s a recipe for disaster.

(And possibly a resignation letter.)

Instead, you’d break the work into phases.

Write specs first. Review them. Plan the implementation. Assign tasks. Verify the results. Each phase produces something concrete — a document, a plan, a test report — that feeds into the next phase.

The same principle applies to AI-assisted development. And it has a name.

Workflow Engineering is the practice of designing multi-step, artifact-driven processes where each step produces a concrete output that becomes the input for the next step — and where the process itself is reusable across projects.

Read that again.

Two words matter most:

Artifact-driven. Every step creates something tangible. A spec file. A test plan. An implementation plan. Not vibes. Not “context.” Actual files that exist on disk and can be read by a fresh session.

Reusable. The workflow works regardless of what feature you’re building. Payment integrations, admin dashboards, API endpoints, plugin architecture — the same sequence of steps applies every time.

Here’s the mental model shift:

With prompt thinking, you’re optimizing the message.

With workflow thinking, you’re optimizing the process.

One is fragile, project-specific, and impossible to debug when things go sideways. The other is robust, reusable, and traceable — meaning when something does go wrong (and it will, because software), you can trace exactly where the chain broke.

The question stops being “how do I write the perfect prompt to implement this feature?” and becomes something far more interesting: “what sequence of focused steps will reliably produce a working feature — regardless of what that feature is?”

That second question? That’s workflow engineering.

.

.

.

The Four Principles of Workflow Engineering

After months of building and refining workflows for Claude Code, I’ve distilled what makes them work down to four principles.

(Four! A reasonable number. I considered making it seven because odd numbers feel more authoritative, but that felt dishonest. Four is what I’ve got. Four is what works.)

These apply to any AI coding tool — Claude Code, Cursor, Copilot, Codex, whatever ships next quarter.

The tools will change.

These principles won’t.

Principle 1: Separate Thinking from Doing

When Claude is brainstorming specs, it shouldn’t be writing code. When it’s implementing, it shouldn’t be redesigning architecture. Mixing planning and execution causes both to suffer.

Here’s why.

Planning gets shallow when the agent is eager to start building.

It rushes through decisions because there’s code to write — ferpetesake, there are functions to create, endpoints to scaffold. Meanwhile, the code gets sloppy because the agent is still making design decisions mid-stream — changing its mind about architecture while simultaneously trying to implement it.

You’ve seen this happen.

Claude starts building a feature, realizes halfway through that the data model needs restructuring, pivots the architecture, and now half the code it already wrote doesn’t match the new approach.

The result? A Frankenstein codebase where the first half follows one pattern and the second half follows another.

Every step in a well-engineered workflow should be either a thinking step or a doing step.

The artifact that comes out of the thinking phase — the spec, the plan, the test cases — becomes the wall between them. By the time Claude starts coding, every design decision has already been made and documented.

No more mid-implementation architecture pivots. No more shallow plans that crumble at the first edge case.

Principle 2: Fresh Context, Always

Here’s something most developers learn the hard way. (I certainly did.)

AI performance degrades as context accumulates. The longer a session runs, the worse the output gets. Claude starts “forgetting” early instructions. It takes shortcuts. Details slip through the cracks like sand through fingers.

We call this context rot — and it’s the silent killer of ambitious AI projects.

Think of it like a multi-day hiking trip.

Day one, your backpack is light. You’re sharp, focused, covering ground fast. By day five — if you’ve been packing on top of yesterday’s gear without clearing anything out — you’re hauling 40 pounds of stuff you don’t need. Yesterday’s rain jacket (it’s sunny now). Tuesday’s extra water bottles (you passed a stream an hour ago). Your pace drops. Your attention narrows. You start missing trail markers because you’re too busy adjusting your shoulder straps.

That’s what happens to an AI agent running in a single session across a dozen tasks.

Workflow engineering forces natural context boundaries:

Each step runs in its own session. Each sub-agent gets a clean slate. The file carries knowledge forward. The context resets every time.

Fresh backpack. Every single morning.

Principle 3: Artifacts Over Memory

Don’t trust the AI to “remember” what you decided three steps ago.

(Don’t trust yourself to remember, either. I once forgot a critical API decision I made that same morning. Before coffee, but still.)

Every decision, every requirement, every edge case — externalized into a file.

Why? Three reasons.

  • A file can be read by a fresh session. This enables Principle 2. When a new session starts, it reads specs.md from disk — it doesn’t need to “recall” a conversation that happened two hours ago in a completely different context window.
  • A file can be reviewed by a different agent — or by you. This is how you catch mistakes before they compound. The spec review step? That’s a fresh agent reading the brainstorm agent’s output and poking holes in it. Adversarial quality control, built right into the workflow.
  • A file creates a traceable chain. If something breaks in implementation, you can walk the chain backwards to find exactly where things went wrong:

Without artifacts, every failure means starting from scratch.

With artifacts, every failure is traceable to a specific step. You fix that step. You re-run from that point. Everything downstream updates accordingly.

That’s the difference between “something broke” and “I know exactly where it broke.”

Principle 4: Define Success Before Starting Work

Write the test plan before the implementation plan.

(I know. I can feel you resisting this one through the screen.)

Most developers want to start building immediately.

Writing test cases for code that doesn’t exist yet feels like… paperwork. Busywork. The kind of thing a project manager suggests in a meeting you didn’t want to attend.

But for AI-driven development, it changes the entire outcome. Here’s why.

1/ Deep requirement analysis.

When Claude has to turn “handle race conditions during renewal processing” into a specific test case — with preconditions, exact steps, and expected outcomes — it has to deeply understand what that requirement actually means.

Shallow understanding produces shallow tests.

If the test plan looks thorough, the requirements were thoroughly analyzed.

2/ Gap detection before code exists.

A missing test case reveals a missing requirement. And finding a gap in your spec is a hundred times cheaper before implementation than after.

(Ask me how I know.)

3/ Clear implementation targets.

Every task in the implementation plan maps to specific test cases.

The developer — or AI agent — knows exactly what “done” means for each piece of work. No ambiguity. No interpretation. No “I thought you meant…”

You’re building toward a defined target instead of discovering the target while building.

Which sounds obvious when I write it out like that — but go look at your last three AI-assisted features and tell me you had a test plan before you started coding.

(No judgment. I didn’t either. Until I did.)

.

.

.

The System: Workflow Engineering in Practice

Principles are great.

Principles are necessary.

But at some point, you need to see them actually working — not just sounding wise on a page.

So let me show you the complete workflow engineering pipeline I’ve built and refined over the past several months for Claude Code. Six steps, four phases, every principle encoded into the process.

I’ve written deep-dives on each phase of this system:

This article is the why behind those hows.

Here’s the complete system at a glance:

Let me walk you through each step — what it does, why it exists, and what artifact it produces.

Step 1: Spec Brainstorm

Principle served: Artifacts Over Memory

You describe the feature you want.

But instead of Claude immediately starting to code, you trigger a question-asking mode: “Ask me clarifying questions until you are 95% confident you can complete this task successfully.”

That line is the key.

It tells Claude to stop assuming. Stop guessing. Stop filling in blanks with whatever seems reasonable.

Claude explores your codebase first — reading your existing patterns, your database schema, your current architecture. Then it starts asking questions, with options, explanations, and its own recommendation for each.

In my WooCommerce integration project, Claude asked 15 questions covering everything from subscription plugin choice to refund handling to email notifications. Edge cases I hadn’t thought about. Architectural decisions that would have bitten me weeks later.

Every answer gets compiled into a comprehensive specification document.

Artifact produced: notes/specs.md

👉 Deep-dive: The 3-Phase Method for Bulletproof Specs

Step 2: Spec Review

Principle served: Fresh Context

Start a new session. Fresh context. Then ask Claude to critique its own work.

Why a new session?

Because the brainstorming session’s context is bloated with 15 rounds of Q&A. A fresh agent reading the specs with skeptical eyes catches things the original agent — who was busy building the specs — overlooked.

In my project, the review found 14 potential issues, including a race condition that would have caused double charges (ferpetesake, the payments!), a token deletion scenario that would silently break renewals, and a mode-switching conflict that would have confused billing for every active subscriber.

You pick which issues matter. Claude fixes them — with another round of clarifying questions to make sure the fixes are solid.

Artifact produced: refined notes/specs.md

👉 Deep-dive: The 3-Phase Method for Bulletproof Specs

Step 3: Test Plan

Principle served: Define Success Before Starting Work

Before writing any implementation code, Claude reads the specs and generates a structured test plan. Every requirement becomes a test case with preconditions, specific steps, expected outcomes, and priority levels.

For my WooCommerce project: 38 test cases organized into 12 sections. 7 Critical, 20 High, 11 Medium.

This serves a dual purpose.

It verifies Claude deeply understood every requirement — shallow understanding produces shallow test cases, so thorough tests mean thorough comprehension. And it creates the success criteria that will drive everything that follows.

Artifact produced: notes/test_plan.md

👉 Deep-dive: How to Make Claude Code Actually Build What You Designed

Step 4: Implementation Plan

Principle served: Separate Thinking from Doing

Claude reads both the specs and test plan, then generates an implementation plan. Tasks are grouped logically, dependencies are identified, and every task maps to the specific test cases it will satisfy.

For my project: 4 phases, 12 tasks, each explicitly linked to test cases. Phase 1 handles foundation (TC-001 to TC-007). Phase 2 tackles checkout and lifecycle (TC-008 to TC-014). Phase 3 addresses the critical renewal processing (TC-015 to TC-021). Phase 4 covers remaining features (TC-022 to TC-037).

This is the last thinking step.

After this, the wall goes up. Every design decision has been made. Every task has a clear target. Now — and only now — we build.

Artifact produced: notes/impl_plan.md

👉 Deep-dive: How to Make Claude Code Actually Build What You Designed

Step 5: Execute Implementation

Principle served: Fresh Context

Here’s where sub-agents change everything.

Instead of running all 12 tasks in one session (guaranteed context rot), Claude creates each task using the built-in task management system, identifies dependencies, and processes them in waves. Each task runs in its own sub-agent with fresh context.

Fresh backpack. Laser focus.

For my project:

  • Wave 1: 2 sub-agents (foundation tasks, no dependencies)
  • Wave 2: 2 sub-agents (checkout + lifecycle, depend on Wave 1)
  • Wave 3: 3 sub-agents (critical renewal processing)
  • Wave 4: 6 sub-agents (remaining features)

Total time: 52 minutes. 13 tasks completed. 38 test cases worth of functionality implemented. Each sub-agent used ~18% context — compared to ~56% if everything had run in a single session.

Artifact produced: working code across all specified files

👉 Deep-dive: How to Make Claude Code Actually Build What You Designed

Step 6: Run Test Plan

Principle served: All four principles working together

The final step.

Claude reads the test plan, creates one task per test case, analyzes dependencies between tests, and executes them sequentially — one sub-agent per test, each with fresh context.

If a test fails, the sub-agent analyzes the root cause, implements a fix, and re-runs the test. Up to 3 attempts. If it still fails after 3 tries, it gets marked as a known issue with reproduction steps and a suggested fix.

For my project: 30 tests. 2 hours 12 minutes. All passed. One bug found and autonomously fixed during TC-002 — a settings save handler that wasn’t persisting color options. Found, diagnosed, fixed, re-verified. All without me touching the keyboard.

Results get logged in two places: test-status.json for machine parsing, test-results.md for human review.

Artifact produced: notes/test-results.md and notes/test-status.json

👉 Deep-dive: Claude Code Testing: The Task Management Approach That Actually Works

The Complete Artifact Chain

Look at how everything connects:

Nothing lives in memory.

Everything lives in files. Every step reads from the previous step’s artifact. If something goes wrong at Step 5, you trace backwards through the chain to find exactly which artifact — which decision — needs fixing.

No more “something broke somewhere, guess we start over.” Just: “the impl plan missed a dependency — let me fix Step 4 and re-run from there.”

I’ve packaged all six prompt files into a Workflow Engineering Starter Kit — drop them into your .claude/commands/ folder and the entire pipeline is ready to go. Download the Starter Kit here →

.

.

.

Design Your Own Workflows

The six-step system above is one example — a specific workflow I’ve built for feature implementation with Claude Code.

But the principles behind it apply to any multi-step AI task.

Writing a technical article, planning a product launch, migrating a database, refactoring a legacy codebase. Same principles. Different steps.

The specific prompts change. The tools change. The principles stay constant.

Here’s a checklist you can run before starting any complex AI task — five questions that reveal whether your process has gaps:

Five questions. If any answer is “no,” your workflow has a gap.

  • The artifact test catches phantom steps — work that happens “in context” but produces nothing concrete. Those are the steps where information vanishes between sessions.
  • The thinking/doing test catches the most common mistake in AI-assisted development: asking an AI to plan and build in the same breath. Every time you let that happen, both the plan and the build suffer.
  • The context boundary test catches rot before it starts. If you can’t point to where sessions should reset, you’ll end up with one massive session that degrades across every task.
  • The success definition test catches the “just build it and we’ll see” trap. Without defined success criteria, you have no way to verify the output — and no target for the AI to aim at.
  • The traceability test catches broken chains. If you can’t walk backwards from a failure to its root cause, your artifacts aren’t detailed enough to serve as the connective tissue between steps.

.

.

.

The Skill That Compounds

Here’s what I want you to take away from all of this.

The six prompt files in the Starter Kit will be outdated eventually.

Claude Code will add new features. The task management API might change. New AI tools will emerge that handle things we can’t even imagine yet.

The workflow engineering thinking behind those prompts won’t age.

Separate thinking from doing. Reset context at natural boundaries. Externalize decisions into artifacts. Define success before you start building. These principles work today with Claude Code. They’ll work next year with whatever comes next.

And here’s the compounding part — the part that makes this a skill and not just a technique: every workflow you design teaches you to design better workflows.

You start noticing patterns. Where context rot creeps in. Where planning and execution get tangled. Where artifacts need more detail. Your workflows get tighter with each project. Your instincts sharpen.

The developers who will thrive in AI-assisted development over the next few years won’t be the ones who write the best prompts.

They’ll be the ones who engineer the best workflows.

.

.

.

Your Next Steps

  1. Download the Workflow Engineering Starter Kit → — All six prompt files, ready to drop into .claude/commands/
  2. Run the checklist against your current process — find the gaps
  3. Try the full pipeline on your next feature — specs through testing
  4. Refine what works, replace what doesn’t

What feature are you going to build with this workflow?

Pick one.

Run the pipeline. See what happens when Claude has a structured process to follow instead of a single prompt to interpret.

Go engineer it.


P.S. — For the deep-dives on each phase, start here:

8 min read The Art of Vibe Coding

The Claude Code Skill Creator Now Has Evals (And My Skills Finally Have Proof They Work)

Watch the video walkthrough, or read the full written guide below.

Here’s a confession.

For months, I’ve been building Claude Code skills with what I can only describe as the “hope and pray” methodology. Write the SKILL.md. Test it once. Ship it. Whisper a small prayer to the LLM gods. Move on with my life.

Did the skill actually trigger when it should? ¯\_(ツ)_/¯

Did it make Claude’s output better? Honestly… no idea.

I’ve been using skills since they were added to Claude Code — and until last week, I had zero way to answer either of those questions.

(Stay with me. This story has a happy ending.)

.

.

.

The Problem With Skills (That Nobody Wants to Admit)

Here’s the thing about Claude Code skills: they’re just text prompts. Fancy, well-organized text prompts — but text prompts nonetheless.

And text prompts don’t come with test suites.

I’ve built dozens of skills over the past few months. Frontend design patterns. WordPress security checklists. Newsletter writing styles. Documentation generators. Each one followed the same ritual:

  • Write a SKILL.md file
  • Test it manually (once, maybe twice if I’m feeling thorough)
  • Hope it works
  • Wonder — weeks later — if it’s actually triggering
  • Wonder — with increasing anxiety — if it’s helping when it does trigger
  • Have absolutely no data to know either way

The old skill-creator plugin could generate skills for you, which was genuinely useful. But it had no evals. No testing. No benchmarks. You’d create a skill, and then… that was it. Cross your fingers, close the terminal, pretend everything was fine.

I kept using skills because they felt useful. But I couldn’t prove it. I couldn’t point to a number and say “this skill improves output quality by 9.5%.”

Every skill I created was a guess. A lovingly crafted, well-intentioned guess — but a guess.


The Upgrade That Changes Everything

The Claude Code skill creator plugin just got a massive upgrade. And honestly? It solves the exact problem I’ve been complaining about for months.

The new version adds something skills have never had: a testing and benchmarking layer.

Claude Code plugin discovery interface showing the skill-creator plugin by claude-plugins-official with 19.1K installs and description "Create new skills, improve existing skills, and measure s..."

Here’s what the updated skill creator can do:

  • Create skills from your requirements (same as before)
  • Generate evals — actual test cases — automatically
  • Run parallel A/B benchmarks comparing skill vs. baseline Claude
  • Optimize trigger descriptions so your skill activates when it should
  • Iterate until the skill measurably improves output

That last part bears repeating: measurably improves output. With numbers. And charts. And side-by-side comparisons.

Let me show you how this works with a real skill I built last week.

.

.

.

Building a WordPress Security Review Skill (The Whole Process)

I built several WooCommerce plugins — which means security reviews are part of my regular workflow. But Claude’s baseline security reviews felt… inconsistent. Sometimes thorough, sometimes surface-level. No predictable structure.

Perfect candidate for a skill.

Step 1: Describe What You Want

I asked Claude Code to create a skill using the skill-creator plugin:

Claude Code terminal showing user prompt requesting a skill called "wp-security-review" that reviews WordPress plugin PHP code for security vulnerabilities including SQL injection, XSS, CSRF, insecure direct object references, missing capability checks, unsafe file operations, insecure superglobal usage, and hardcoded secrets.

My prompt included the specific vulnerability types I wanted covered: SQL injection, XSS, CSRF, missing nonce verification, insecure $_GET/$_POST usage, and more.

Step 2: The Skill Creator Explores Your Codebase

Here’s where things get interesting.

Claude loaded the skill-creator skill and immediately started exploring my project:

Claude Code terminal showing skill-creator successfully loaded, then searching for 2 patterns and reading files to understand the project structure, existing security references, and PHP patterns before creating the skill.

The skill-creator looked at my existing code, found security patterns already in the project, and used that context to build a skill tailored to my codebase. (Not a generic one-size-fits-all approach.)

Step 3: The Generated Skill

Claude wrote 330 lines to .claude/skills/wp-security-review/SKILL.md:

Claude Code terminal showing the created wp-security-review skill with 330 lines written, including a description covering SQL injection, XSS, CSRF, missing capability checks, unsafe file operations, and hardcoded secrets. Also shows 3 test prompts: reviewing CartHandler.php, checking BulkActions.php, and doing a full plugin security audit.

The skill included:

  • A detailed trigger description (optimized for when Claude should activate it)
  • A vulnerability checklist with 8 categories
  • WooCommerce-specific nuances — like wc_price() double-escaping and WC Settings API nonce delegation
  • Structured output format with severity ratings

All good stuff. But here’s the thing: a skill is only as good as its results.

And until now, I had no way to measure those results.

.

.

.

The Part That Made Me Actually Stop and Stare: Evals

After creating the skill, Claude immediately said: “Now let me set up test cases and run them.”

Wait, what?

Claude Code terminal showing creation of evals.json file with test cases including prompts like "Review the CartHandler.php for security issues" with expected outputs describing structured security reports identifying $_POST sanitization issues, nonce verification patterns, and price manipulation risks.

The skill-creator generated an evals.json file with:

  • 3 test prompts targeting different aspects of my plugin
  • Expected outputs for each test
  • Specific files to review

And then — and I genuinely did not expect this — it launched parallel agents.

Running 6 Agents Simultaneously

Claude Code terminal showing 6 agents launched in parallel - 3 "with skill" runs and 3 "without skill" baseline runs for CartHandler, BulkActions, and Full audit test cases, all running in the background simultaneously.

Claude launched 6 parallel agents:

  • 3 running the tests with the skill
  • 3 running the same tests without the skill (baseline Claude)

While those ran in the background, Claude drafted the evaluation assertions:

Claude Code terminal displaying detailed evaluation assertions for each test - Eval 1 (CartHandler) checking for severity ratings, identifying unsanitized $_POST, nonce patterns, structured reports, and no false positives on WC hooks. Eval 2 (BulkActions) checking for $_GET without nonce, capability checks, severity levels. Eval 3 (Full audit) checking coverage of all files, summary counts, cross-file patterns, and passed checks sections.

Each test case had specific pass/fail criteria:

  • has_severity_ratings — Findings must be categorized by severity
  • identifies_unsanitized_post — Must flag $_POST['wcpwyw_price'] at line 126
  • no_false_positives_on_wc_hooks — Must NOT flag WC Settings API as missing nonces
  • includes_passed_checks — Must list what passed, not just failures

(Real criteria. Measurable criteria. I could have cried.)

.

.

.

Results Rolling In

As the agents completed, results started appearing:

Claude Code terminal showing Eval 2 BulkActions with skill completed - security review output with summary showing 0 Critical, 0 High, 1 Medium, 2 Low, 1 Info findings, plus timing data showing 38,976 tokens used in 227 seconds.
Claude Code terminal showing Eval 2 BulkActions without skill (baseline) completed - 135 lines of output with 31,386 tokens used in 236 seconds. Notes that both with-skill and without-skill produced similar findings for this relatively simple file, but the with-skill version used structured severity format and checklist notation while baseline used freeform numbered lists.
Claude Code terminal showing all 6 agents (Eval 1, 2, 3 with and without skill) completed, now extracting remaining reports from transcripts. Shows Eval 3 Full audit with skill reviewing 23 files with findings categorized as 0 Critical, 2 High, 4 Medium, 5 Low, 3 Info.
Claude Code terminal showing all 6 reports saved, now generating the eval viewer for side-by-side comparison. Writing benchmark.json with skill_name "wp-security-review" and configurations for "with_skill" and "without_skill" runs.

All 6 agents finished. Claude compiled the results and generated something I’ve never seen in skill development before.

.

.

.

The Eval Review Playground

Claude generated an HTML-based eval viewer and opened it in my browser:

Browser-based eval review interface titled "Eval Review: wp-security-review" showing 1 of 6 test cases. Displays "WITH SKILL" tag, prompt "Review the CartHandler.php for security issues", and output showing a structured security review with Summary (0 Critical, 2 High, 2 Medium, 2 Low, 2 Info) and Findings section with severity-tagged issues like "[HIGH] Price Manipulation via Cart Session - Missing Server-Side Re-validation in applyCartItemPrice".
Browser-based eval review interface showing same prompt but with "WITHOUT SKILL" tag (baseline). Output shows a different format - plain "Security Review: CartHandler.php" header with file path, date, and "Claude Opus 4.6 (automated review)" as reviewer, followed by Executive Summary section in prose format rather than structured findings list.

Side-by-side comparison. Same prompt, same file, two different approaches.

The difference was immediately visible:

  • With skill: [HIGH] Price Manipulation via Cart Session — structured, scannable, severity-tagged
  • Without skill: Prose-style Executive Summary, harder to scan

But subjective impressions only get you so far. Here’s where the numbers come in.

.

.

.

The Benchmark Results (This Is the Good Part)

Claude Code terminal showing eval viewer opened in browser with benchmark comparison table. Metrics show: Pass rate 100% (21/21) with skill vs 90.5% (19/21) baseline (+9.5% delta). Avg tokens 74,427 with skill vs 69,734 baseline (+6.7%). Avg time 276s with skill vs 307s baseline (9.9% faster). Key differences noted: skill version elevated price cap bypass to HIGH severity, avoided false positives on WC nonces, produced more structured passed-checks sections.
MetricWith SkillBaselineDelta
Pass rate100% (21/21)90.5% (19/21)+9.5%
Avg tokens74,42769,734+6.7%
Avg time276s307s9.9% faster

👉 The skill achieved 100% pass rate on all 21 assertions.

Baseline Claude hit 90.5% — missing structured passed-checks sections and some WooCommerce-specific nuances.

And here’s the kicker: the skill was actually faster despite being more thorough.

Key differences Claude identified:

  • Skill version elevated price cap bypass and session revalidation to HIGH severity (correct — these affect money)
  • Skill version correctly noted that current_user_can checks are unnecessary on frontend AJAX handlers
  • Skill version had more structured passed-checks sections
  • Baseline missed nuances like WC Settings API nonce delegation

For the first time ever, I had proof my skill was helping.

.

.

.

Iteration: When The Evals Teach You What to Fix

Here’s where it gets even better.

Looking at the results, Claude identified areas where even the improved skill could be better:

Claude Code terminal showing user saying "please proceed" and Claude responding with 3 specific improvements to make based on analyzing all 6 reports: 1. Add guidance on reviewing JavaScript files (baseline caught DOM XSS in JS that skill missed), 2. Clarify severity calibration for wp_unslash on numeric fields (inconsistent HIGH/MEDIUM ratings), 3. Add note about checking business logic consistency across code paths (price cap bypass was the most valuable finding)

Claude proposed 3 refinements:

  1. Add JavaScript file guidance — The baseline actually caught DOM XSS in JS that the skill version missed (oops)
  2. Calibrate severity for wp_unslash — Numeric field sanitization was rated inconsistently between HIGH and MEDIUM
  3. Add business logic consistency checks — The price cap bypass (validation in one path but not another) was the most valuable finding type
Claude Code terminal showing detailed diff of SKILL.md updates - adding JavaScript file scanning guidance for DOM-based XSS, adding notes about WooCommerce-specific escaping patterns for wc_price(), calibrating wp_unslash severity on numeric fields to Low instead of High, and adding business logic consistency checking to catch when validation exists in one code path but not another.

The skill-creator modified SKILL.md with surgical precision:

  • Added DOM-based XSS scanning for JavaScript files
  • Added guidance on wc_price() double-escaping
  • Calibrated wp_unslash on numeric fields to Low severity
  • Added business logic consistency patterns

The evals taught me where my skill was weak. Then Claude fixed it.

.

.

.

Trigger Description Optimization

A brilliant skill that never activates is… well, useless.

Claude ran a trigger optimization loop:

Claude Code terminal showing skill updated with 3 improvements, now running description optimization. Creating trigger-eval.json with test queries like "I just wrote a new AJAX handler for my WooCommerce plugin that lets admins delete orders. Can you check if it's secure?" with should_trigger: true. Running optimization loop in background (up to 5 iterations).

The skill-creator generated test prompts that should trigger the skill, and prompts that shouldn’t. Then it ran those prompts against Claude to see if the skill activates correctly.

Claude Code terminal showing optimization running (36+ API calls to Opus). Summarizes current state: skill at ~350 lines covering 8 vulnerability categories, structured report format with severity ratings, WooCommerce-specific nuances. Benchmark results: 100% pass rate with skill vs 90.5% baseline, skill was 9.9% faster. Lists 4 improvements applied: JS/DOM scanning, cross-path consistency checking, wp_unslash calibration, and WP bulk actions nuances.

After multiple iterations, trigger accuracy improved. The skill now fires when it should — and stays quiet when it shouldn’t.

.

.

.

The Final Skill

Claude Code terminal showing "The skill is complete" with final summary. Skill created: wp-security-review at .claude/skills/wp-security-review/SKILL.md. Reviews WordPress plugin PHP and JS code for 8 categories of vulnerabilities including SQL injection, XSS (including DOM XSS), CSRF, IDOR, missing capability checks, unsafe file operations, insecure superglobals, and hardcoded secrets. Lists unique value over baseline: structured [SEVERITY] format, comprehensive passed checks section, WooCommerce-specific nuances, cross-path consistency checking, and correct severity calibration.
VS Code file explorer showing the wp-security-review skill folder structure with evals subfolder containing evals.json and SKILL.md file.

The completed skill:

  • Reviews WordPress plugin PHP and JS code
  • Covers 8 vulnerability categories
  • Produces structured [SEVERITY] tagged output
  • Includes WooCommerce-specific nuances (nonce delegation, wc_price() escaping, frontend vs admin hooks)
  • Catches business logic inconsistencies (validation in one path but not another)
  • Benchmarks at 100% pass rate vs 90.5% baseline

And I have the data to prove it works.

.

.

.

Why This Matters For Your Skills

The Claude Code skill creator fundamentally changes what’s possible.

👉 Before: Skills were art. Intuition. Trial and error. Hope and prayer.

👉 After: Skills are engineering. Testable. Measurable. Improvable.

Here’s what becomes possible:

1. A/B Test Every Skill You Build

Every skill you create can be benchmarked against baseline Claude. If your skill doesn’t measurably improve output, you know immediately — before you ship it, not three weeks later.

2. Catch Regressions When Models Update

When Claude Opus 5.0 ships, run your benchmarks again. If baseline now matches or exceeds your skill’s performance, the skill may be locking in outdated patterns. Time to retire it — or improve it.

3. Tune Your Trigger Descriptions

A skill that triggers 50% of the time is only half as valuable. The description optimizer catches false positives (triggering when it shouldn’t) and false negatives (not triggering when it should).

4. Run Continuous Improvement Loops

Each eval run produces actionable feedback. Claude identifies gaps, proposes fixes, and re-benchmarks — all without you manually debugging SKILL.md files at midnight.

.

.

.

Your Next Steps

  1. Open Claude Code
  2. Type /plugin and search for skill-creator
  3. Install the official Anthropic plugin (19,100+ installs and counting)
  4. Pick one skill you’ve already built — or a new one you’ve been meaning to create
  5. Ask Claude to create evals and benchmark it
  6. Watch the data tell you exactly where to improve

What skill are you going to benchmark first?

The developers who run evals will build better skills than those who don’t. That’s just… math.

Go build yours.

Now.

11 min read The Art of Vibe Coding

The Single File That Makes or Breaks Your Claude Code Workflow

Watch the video walkthrough, or read the full written guide below.

I thought I was being thorough.

My CLAUDE.md file had grown to over 1,500 lines. Every coding convention I’d ever learned. Every edge case I’d encountered. Code snippets for common patterns. Integration examples for every third-party service we used. Database schema references. The works.

I was so proud of that file. Look at all this context I’m giving Claude! Surely this would make it understand my project perfectly.

(Narrator voice: It did not.)

Here’s what actually happened: Claude started missing obvious things. Instructions I knew were in there—ignored. Conventions I’d spelled out clearly—forgotten. The more I added to my CLAUDE.md, the worse Claude performed.

I’d accidentally discovered something that changed how I approach AI-assisted development entirely.

.

.

.

The problem wasn’t that Claude couldn’t follow instructions. The problem was that I’d given it too many to follow.

What Even Is a CLAUDE.md File? (And Why Should You Care?)

Let’s back up for a second—because if you’re new to Claude Code, you might be wondering what I’m talking about.

CLAUDE.md is a markdown file that lives at the root of your project. Claude Code reads it automatically at the start of every session. Think of it as your project’s instruction manual for the AI—persistent memory across what would otherwise be completely stateless conversations.

And here’s the thing: after your choice of model, your CLAUDE.md file is the single biggest point of leverage you have in Claude Code.

One bad line in there? It cascades into everything downstream.

Every decision Claude makes flows from that initial context. A vague instruction becomes a vague spec, becomes vague research, becomes a vague plan, becomes… well, you know how this story ends.

.

.

.

The “Instruction Budget” I Wish Someone Had Told Me About

Here’s where I learned my lesson the hard way.

LLMs have a finite number of instructions they can reliably follow at once. This sounds obvious when I say it out loud, but I’d never really internalized it until I watched my 1,500-line CLAUDE.md file turn Claude into a confused mess.

The counterintuitive part? Adding more instructions doesn’t just risk the new ones being ignored. It degrades performance uniformly across all your instructions—including the ones that worked perfectly before.

Research from Chroma on “context rot” backs this up: as the number of tokens in the context window increases, the model’s ability to accurately recall information from that context decreases. Your beautiful, comprehensive CLAUDE.md file might actually be making Claude worse at remembering what’s in it.

Let me do the math for you: Claude Code’s system prompt already uses around 50 instructions. If the model handles roughly 250 total, you’ve got about 200 left for your CLAUDE.md plus your plan plus your task prompt.

And here’s the part that really stung when I realized it: a bloated CLAUDE.md means you’re filling up the context window before you even send your first instruction. Every session starts with that massive file loaded. Every message you send has to fit alongside it.

If you’ve been struggling with Claude Code eating through your weekly usage too fast, the first place to cut is your CLAUDE.md file. Seriously. Smaller context = fewer tokens consumed = more runway for actual work.

I went digging through public CLAUDE.md files on GitHub recently. About 10% of them exceed 500 lines.

That’s almost certainly too large. (Ask me how I know.)

👉 Aim for under 300 lines. Ideally much shorter.

.

.

.

The Framework That Finally Made Sense

After my bloated-file disaster, I needed a new approach. I landed on something simple: think of CLAUDE.md as an onboarding document.

Imagine you’re bringing a brilliant new hire up to speed on day one. What would you tell them? Three things, really:

WHAT — The tech stack, project structure, key files. “This is a Next.js 14 app with App Router, Prisma, and Stripe.”

WHY — The purpose of the project and its parts. “We’re building an e-commerce platform for artisan sellers.”

HOW — Commands, workflows, conventions. “Use npm, not pnpm. Run tests before commit.”

Everything else? Details that can live elsewhere or get loaded on-demand. (More on that “on-demand” part in a minute—it’s a game-changer.)

.

.

.

Where You Put Things Actually Matters

Here’s something I didn’t appreciate until embarrassingly recently: models pay more attention to the top and bottom of a document than the middle. Primacy and recency effects—same cognitive biases humans have.

So structure your CLAUDE.md accordingly:

At the top (highest weight):

  1. Project description (1-3 lines)
  2. Key commands (dev, test, build, lint, deploy)
  3. Tech stack and architecture overview

In the middle (lower relative weight): 4. Code style and conventions 5. File/folder structure map 6. Important gotchas and warnings 7. Git and commit conventions

At the bottom (high weight again): 8. Explicit DO NOTs 9. References and @imports to deeper docs

Let me walk through the ones that matter most.

Project Description

A concise summary that orients Claude to the big picture. Every decision should tie back to purpose.

# Project: ShopFront
Next.js 14 e-commerce application with App Router, 
Stripe payments, and Prisma ORM. Built for artisan 
sellers to manage inventory and process orders.

Three lines. Claude now knows what you’re building, who it’s for, and the core architecture. That’s it.

Key Commands

Be explicit here. Don’t assume Claude knows your setup—it doesn’t.

## Commands
- `npm run dev`      — Start dev server (port 3000)
- `npm run test`     — Run Jest unit tests
- `npm run test:e2e` — Run Playwright E2E tests  
- `npm run lint`     — ESLint check
- `npm run build`    — Production build
- `npm run db:migrate` — Run Prisma migrations

And include the non-obvious choices! “Use npm not pnpm or bun” saves you from Claude randomly picking bun because it read about it somewhere and thought it’d be helpful. (Thanks, Claude. Very helpful.)

Code Style—Where Most People Go Wrong

This is where I see CLAUDE.md files bloat into monsters. Vague rules that waste your precious instruction budget.

Don’t write this:

  • “Use good coding practices”
  • “Write clean code”
  • “Follow best practices”

These instructions accomplish nothing. They’re the equivalent of telling a new hire “do good work.” Thanks, very actionable.

Write this instead:

  • “TypeScript strict mode, no any types”
  • “Use named exports, not default exports”
  • “Prefer const over let
  • “Use import type {} for type-only imports”

Every instruction should produce a measurable difference in output.

And here’s a secret that took me way too long to figure out: don’t send an LLM to do a linter’s job. If a rule can be enforced by ESLint or Prettier, enforce it there. LLMs are slow and expensive linters. Claude learns from your existing code patterns anyway—it doesn’t need to be told every formatting convention.

One more thing: resist the urge to stuff code snippets, integration examples, and schema references into your CLAUDE.md. I know it feels helpful. I did it too. But all those “handy references” are just bloating your context window and triggering context rot. If Claude needs to see code, it can read your actual codebase.

The DO NOTs

Put these at the bottom (recency effect) and be specific:

## DO NOT
- Do not modify files in `/generated/` — they are auto-generated
- Do not use `console.log` — use the project logger
- Do not run `prisma db push` — always use migrations

Use emphasis sparingly. If everything is IMPORTANT, nothing is.

.

.

.

The Technique That Changed Everything: Lazy Loading Your Context

Okay, here’s where things get really interesting.

Instead of cramming everything into one giant root file, you can distribute smaller CLAUDE.md files across subfolders. The magic? They only load when Claude actually reads files in that folder.

Think about what this means: your Supabase migration instructions only consume tokens when you’re actually working on Supabase. During frontend work? Those instructions don’t exist. They’re not cluttering up Claude’s context. They’re not eating into your instruction budget.

It’s lazy loading for your AI context.

How to Split Things Up

Your root CLAUDE.md stays small—maybe 50-100 lines. Project description, global commands, universal rules. The stuff that applies everywhere.

Then each subfolder gets its own focused file:

  • src/CLAUDE.md — Component patterns, state management approach, import rules
  • api/CLAUDE.md — Endpoint conventions, auth rules, error format, validation patterns
  • supabase/CLAUDE.md — Migration flow, schema rules, dangerous commands to avoid

These load automatically when Claude reads files in those directories. No extra work on your part. Just organize your instructions where they logically belong.

Progressive Disclosure with @imports and Rules

Here’s another trick: reference detailed docs instead of inlining everything.

## References
See @README.md for project overview
See @docs/api-patterns.md for API conventions  
See @docs/auth-flow.md for authentication details
See @package.json for available scripts

Or use the .claude/rules/ directory:

All markdown files in .claude/rules/ load automatically alongside your main CLAUDE.md. Modular. Organized. Maintainable.

If you want to take this further—and I mean really further—I wrote a deep dive on building self-evolving Claude Code rules that keep all your guidelines, code snippets, and best practices organized in a system that actually grows smarter over time: How to Build Evolving Claude Code Rules.

It’s the natural next step once you’ve got the basics of CLAUDE.md structure down.

.

.

.

A Real Example: What This Looks Like in Practice

Here’s a complete root CLAUDE.md that actually works:

# Project: ShopFront
Next.js 14 e-commerce app with App Router, Stripe, Prisma ORM.
Built for artisan sellers to manage inventory and process orders.

## Commands
- `npm run dev`: Start dev server (port 3000)
- `npm run test`: Run Jest tests
- `npm run test:e2e`: Run Playwright E2E tests
- `npm run lint`: ESLint check
- `npm run build`: Production build
- `npm run db:migrate`: Run Prisma migrations

## Tech Stack
- TypeScript (strict mode)
- Next.js 14 (App Router)
- Prisma ORM + PostgreSQL
- Stripe for payments
- Tailwind CSS + Radix UI
- Jest + Playwright for testing

## Architecture
- `/app` — Pages, layouts, API routes
- `/components/ui` — Shared UI components
- `/lib` — Utilities, helpers, shared logic
- `/prisma` — Schema and migrations

## Code Conventions
- Named exports only (no default exports)
- Use `import type {}` for type-only imports
- No `any` types — use branded types for IDs
- Functional components with hooks
- Tailwind classes only, no custom CSS files

## Important
- NEVER commit .env files
- Stripe webhook must validate signatures
- Images stored in Cloudinary, not locally
- Do not modify files in `/generated/`
- Use project logger, not console.log

See @docs/auth-flow.md for authentication details
See @docs/api-patterns.md for API route conventions

That’s roughly 45 lines. Clean. Scannable. Universally applicable.

.

.

.

The Growth Strategy (Or: How to Not Repeat My Mistakes)

Please, I’m begging you: do not start with a giant template or auto-generated file.

Start with the absolute minimum. Project description, key commands. Maybe 20 lines.

Then use Claude Code on your project. When Claude makes a repeated mistake—and it will—add ONE targeted instruction to fix it. Commit that change to Git so you can trace it later.

Here’s the counterintuitive part: with every model release, look at what you can remove from your CLAUDE.md. Not what you can add. Remove.

Newer models have better built-in behaviors. Old workarounds can actively hinder them. Your CLAUDE.md should shrink over time, not grow.

(This was hard for me. I’m a collector by nature. But trust me—less really is more here.)

.

.

.

The Maintenance Checklist I Actually Use

Every few weeks—or after any model upgrade—I run through this:

Remove:

  • Redundant rules the model handles naturally now
  • Old workarounds for previous model versions
  • Vague instructions that don’t change output
  • Rules that should be enforced by linters instead
  • Code snippets and examples that bloat context

Relocate:

  • Domain-specific rules → move to subfolder CLAUDE.md files
  • Detailed docs → convert to @imported files
  • Rarely-used conventions → put in .claude/rules/

Simplify:

  • Merge overlapping instructions
  • Replace verbose paragraphs with bullet points
  • Make every line earn its place

The Complete Picture

Here’s how all the pieces fit together:

It looks like a lot. But you don’t need all of it. Start with the root file. Add lazy loading when your root gets crowded. Grow organically.

.

.

.

Your Next Step

Here’s my challenge for you:

Open your most active Claude Code project. Look at your CLAUDE.md file (or create one if it doesn’t exist). And ask yourself one question:

What instruction am I going to remove today?

Not add. Remove.

Find the vague rule that accomplishes nothing. Find the workaround for a model behavior that’s been fixed. Find the code snippet that’s just bloating your context. Find the formatting instruction that ESLint already handles.

Delete it.

Every line should earn its place.

10 min read The Art of Vibe Coding

I Found a Better Way to Design Pages in Claude Code (And I’m a Little Mad I Didn’t Know Sooner)

Watch the video walkthrough, or read the full written guide below.

You’re tweaking a landing page in Claude Code.

“Make the layout more balanced,” you type—feeling pretty clever about your prompt, if we’re being honest.

Claude adjusts the CSS. You refresh the browser.

Hmm. Not quite right.

“Actually, make the image section wider.”

Claude dutifully changes the grid. You refresh again.

Still off. (Why is this so hard?)

“Can you try pill-shaped buttons instead?”

And now you’re three iterations deep, squinting at your screen, no longer entirely sure what “right” even looks like anymore.

Here’s the thing: I spent an embarrassing amount of time in this loop before discovering there’s a plugin that makes the whole rigmarole unnecessary.

It’s called the Claude Code Playground skill. And it changes everything about how you approach design work.

Stay with me.

.

.

.

The Describe-Refresh-Despair Loop (A Love Story Gone Wrong)

Let’s be real about what’s happening here.

You have a vision in your head. A feeling of what the page should look like. Something about the proportions, the spacing, the way elements breathe together.

But translating that feeling into words?

That’s where the wheels come off the wagon.

The loop goes something like this:

  1. Describe a change in words (that you hope Claude interprets correctly)
  2. Wait for Claude to apply it
  3. Refresh the page
  4. Realize it’s not quite what you meant
  5. Try different words—maybe “airy” instead of “spacious”?
  6. Repeat 8-12 times
  7. Eventually settle for “close enough” while muttering under your breath

I burned way too many hours in this loop last week, redesigning a page. Every iteration felt like playing telephone with my own design instincts.

(Spoiler: I was the one garbling the message.)

The problem isn’t Claude. The problem is that words are a terrible interface for visual decisions.

.

.

.

Enter the Claude Code Playground Skill: Your New Best Friend

Anthropic built an official plugin—the Playground skill—that adds something radical to Claude Code:

A visual layer between your brain and your codebase.

Here’s how it works: Claude analyzes your existing page, then generates a self-contained HTML file with sliders, dropdowns, and presets that let you see different design directions instantly.

No code changes. No refreshing. No “what I said vs. what I meant” shenanigans.

Just a live preview that updates as you click.

Once you’ve dialed in the exact design you want—with your actual eyeballs—the playground generates a natural language prompt describing your choices.

Copy. Paste. Execute.

One pass. Done.

👉 The key insight: You’re no longer translating visual intuition into words. The playground does that translation for you.

.

.

.

The Real-World Test: Redesigning a WooCommerce Product Page

Enough theory. Let me show you exactly how this worked on my LicenseWP product page.

The existing page was… fine. Functional. The kind of “fine” that makes you wince slightly every time you look at it.

Alt: The original LicenseWP product page showing Theme Pro v2 with a gradient image placeholder on the left, three pricing tier cards (Agency, Business, Personal) stacked vertically on the right, and description tabs below

The image area felt cramped. The pricing cards looked squished together. The “Add to cart” button was doing its best impression of a wallflower at a party.

I wanted to experiment with different proportions, card styles, and CTA treatments.

The old me would have spent an hour going back-and-forth with Claude in text prompts.

The new me? Installed the Playground skill.

.

.

.

Step 0: Install the Plugin (30 Seconds, Tops)

First things first—you need the Claude Code Playground skill installed.

Run /plugin in Claude Code, switch to the Discover tab, and search for “playground.”

Alt: Claude Code's plugin discovery interface showing a search for "playground" with the official plugin by claude-plugins-official listed below, showing 11.4K installs and description "Creates interactive HTML playgrounds"

Hit space to toggle it on.

That’s it. Claude now knows how to build interactive design playgrounds. (11.4K installs can’t be wrong, right?)

.

.

.

Step 1: Send Claude to Do Reconnaissance

Here’s where the magic starts.

I gave Claude a prompt telling it to visit my product page, study the layout like an art critic at a museum, and then build a Design Layout Playground based on what it found.

Alt: The detailed prompt entered into Claude Code instructing it to visit the product page URL, study the layout structure, and build an interactive Design Layout Playground using the playground skill with specifications for presets, preview panel, output location, and prompt generation

Claude opens Chrome and starts analyzing—like a very thorough house inspector, but for web pages.

Alt: Claude Code using Chrome browser tools to analyze the product page, showing multiple tool calls: tabs_context, read_page, get_page_text, and several javascript_tool executions to extract HTML structure

It reads the DOM, extracts text content, and runs JavaScript to pull the full HTML structure.

Alt: Claude Code taking a screenshot of the page and scrolling down to capture the tabs section and footer, building a complete mental model of the layout

Screenshots. Scrolling. Full page analysis from header to footer.

(Claude is nothing if not thorough.)

.

.

.

Step 2: Claude Builds You a Custom Design Tool

Once Claude understands your page, it loads the Playground skill and reads the design template.

Claude Code loading the playground skill successfully and reading the design-playground template file from the plugins cache directory

Then—and here’s the part that made me do a little happy dance—it creates a single self-contained HTML file with everything baked in.

Alt: Claude Code creating the notes/playground directory and writing 1253 lines to product-page-design.html, showing the beginning of the HTML file with doctype, head section, and CSS reset

1,253 lines. A complete interactive design tool, built specifically for your page, in about 30 seconds.

Alt: VS Code file explorer showing the notes folder containing the playground subfolder with product-page-design.html file marked as untracked

Here’s what Claude built for me:

Alt: Claude Code's summary showing the playground includes controls for Page Layout, Gallery, Typography, Tier Cards, CTA Button, Tabs, and Color Theme, plus 5 presets: Current Design, Clean Editorial, Bold SaaS, Compact & Dense, and Premium Showcase
  • 5 presets — each one a cohesive design direction
  • 7 control groups — layout, gallery, typography, cards, buttons, tabs, colors
  • Live preview — using my actual page content (real product names, real prices, real structure)

Ferpetesake, this is exactly what I needed.

.

.

.

Step 3: Play With Designs Like a Kid With New LEGOs

Open the HTML file in your browser.

And then—I’m not going to lie—prepare to lose 20 minutes just playing.

Alt: The Product Page Design Playground interface showing a left panel with controls for Page Layout (max width, grid split, gap, spacing), Gallery (style, radius, aspect ratio), and Typography (title size, weight, description style), with a live preview of the product page on the right

The left panel has all the controls. The right panel shows a live preview that updates instantly as you change anything.

(I may have spent an unreasonable amount of time just clicking the “Gallery Style” options back and forth. Don’t judge me.)

Alt: Scrolled down view of the playground showing additional controls for Tier Cards (layout, card style, radius, highlight, badge), CTA Button (style, width, size), Tabs Section (style, alignment), and Color Theme (primary accent swatches and surface treatment)

I started by clicking through the presets to find a direction:

  • Bold SaaS — too aggressive for this product
  • Compact & Dense — too cramped (we’re selling premium themes, not packing a suitcase)
  • Clean Editorial — closer! But needed tweaks

Then I fine-tuned individual controls:

  • Grid split: 50/50 (equal width for image and details)
  • Gallery style: Framed (light background with border instead of that gradient)
  • Gallery radius: 24px (rounder corners, friendlier vibe)
  • CTA button: Pill-shaped, full-width, large
  • Tabs: Pill style, stretched across the full width

Every single change reflected instantly in the preview.

👉 Here’s what hit me: I wasn’t describing what I wanted anymore. I was seeing it. And clicking until it looked right.

.

.

.

Step 4: Copy the Magic Prompt

Once the design felt right, I scrolled to the bottom of the playground.

And there it was.

Alt: The playground with controls adjusted showing the generated prompt output at the bottom in a highlighted box listing all design changes: product grid split 50/50, section spacing 64px, gallery style framed, border radius 24px, title size 36px, etc., with a Copy Prompt button on the right

The Prompt Output panel had already written a clear instruction describing exactly what I chose—and only what differed from the defaults.

Redesign the product single page at http://localhost:8107/product/theme-pro-tc011/ 
with the following design changes:

- product grid split: 50/50
- section spacing: 64px
- gallery style: framed
- gallery border radius: 24px
- product title size: 36px
- tier card border radius: 8px
- selected tier highlight: border-only
- CTA button style: pill-shaped
- CTA button width: full-width
- CTA button size: large
- tab style: pill tabs
- tab alignment: stretch

No ambiguity. No “make it more modern” nonsense.

Just precise specifications that Claude can execute without guessing.

One click on Copy Prompt.

.

.

.

Step 5: Let Claude Do Its Thing

Back in Claude Code. Paste.

Alt: Claude Code receiving the pasted prompt and entering plan mode, exploring the product page CSS and templates by running find commands in the woocommerce directory

Claude immediately recognizes the design instructions and enters plan mode to explore the codebase.

Alt: Claude Code reading CSS files, checking CSS variables, and examining the single product template structure, then entering a "thinking" state for 2 minutes 17 seconds to write the plan

33 tool uses. 101k tokens. 2 minutes of thinking.

Claude reads every relevant CSS file, understands the variable system, and maps out exactly what needs to change.

Then it presents the plan:

Alt: Claude Code's detailed plan titled "Redesign Product Single Page" showing the context, file to modify (assets/css/frontend.css), and specific CSS changes including grid-template-columns changing from 2fr 3fr to 1fr 1fr, margin-bottom adjustments, and gallery style changes from gradient to framed with exact code snippets

Every change mapped out. Exact line numbers. Before/after CSS.

(Is it weird that I find this deeply satisfying? Don’t answer that.)

Alt: Continuation of the plan showing dark mode considerations (existing CSS variables handle it automatically), verification steps for testing the changes visually, and the execution prompt asking whether to proceed with options to clear context, auto-accept edits, manually approve, or modify the plan

Verification steps included. Dark mode considerations. Mobile responsiveness checks.

I approved. Claude started executing.

Alt: Claude Code applying CSS edits to frontend.css, showing a diff view with red lines indicating removed code (grid-template-columns: 2fr 3fr, old margin-bottom, linear-gradient background) and green lines showing the new values (1fr 1fr grid, updated spacing, solid background with border)

9 CSS edits. All applied.

Alt: Claude Code's completion summary confirming all 9 CSS changes were applied correctly, listing each change: grid split 1fr 1fr, section spacing, gallery styling with border and radius, title size, tier cards, CTA button, and tabs, with a note that dark mode works automatically

And… done.

.

.

.

The Result (AKA: The Part Where I Do a Victory Lap)

Here’s the final redesigned product page:

Alt: The redesigned LicenseWP product page showing the new 50/50 grid layout with a framed gallery area on the left, equal-width pricing section on the right with the same three tiers, a full-width pill-shaped purple Add to cart button, and stretched pill-style tabs below for Description, Additional information, and Reviews
  • Equal-width layout. The image and product details now have balanced visual weight.
  • Framed gallery. Clean border instead of that dated gradient background.
  • Full-width CTA. The “Add to cart” button finally commands the attention it deserves.
  • Pill tabs. Stretched across the full width with a modern, cohesive feel.

The design matches what I saw in the playground preview—applied in a single pass.

HECK YES.

.

.

.

The Before & After (Because We All Love a Good Transformation)

BeforeAfter
40/60 grid split (cramped image)50/50 split (balanced)
Gradient gallery backgroundFramed with border
Small inline CTA buttonFull-width pill CTA
Underline tabsStretched pill tabs
10+ prompt iterations1 pass

The old workflow would’ve taken an hour of back-and-forth (and probably some mild frustration-snacking).

This took 15 minutes—and honestly, most of that was me playing with the controls because it was genuinely fun.

.

.

.

When the Claude Code Playground Skill Really Shines

👉 Redesigning existing pages. You already have something. You want to explore variations without breaking it.

👉 Client projects. Preview before you commit. Show options before you build. (Clients LOVE this, by the way.)

👉 Design indecision. When you don’t know what you want—and let’s be honest, that’s more often than we’d like to admit—clicking through presets beats describing in words.

👉 Reducing the prompt iteration loop. One visual session replaces 10+ text-based rounds of “no, more like… actually less like that… wait, go back.”

The playground acts as a translation layer between your visual intuition and Claude’s execution capabilities. You figure out what “right” looks like with your eyes, then communicate that with precision.

The Prompt Template (Steal This)

Here’s the full prompt I use. Copy it. Adapt it. Make it yours.

First, use the browser to visit and read this page: [YOUR_PAGE_URL]

Study the page's current layout structure, section hierarchy, component patterns, 
and overall visual design. Take note of how content is organized and what 
elements are present.

Then, use the "playground" skill (design-playground template) to build an 
interactive Design Layout Playground based on what you found on that page.

The playground should let me visually explore different layout and component 
style combinations for that page.

## Presets
Include 3–5 named presets that snap all controls to a cohesive combination, 
inspired by what would work well for the page's content. For example:
- "Clean Editorial" — airy spacing, narrow content width, minimal components
- "Bold & Modern" — full-width hero, elevated cards, bold CTAs
- "Compact Dashboard" — tight spacing, grid cards, minimal chrome
- Adapt these to fit the actual content and purpose of the page

## Preview
- Single live preview panel that updates instantly on every control change
- The preview should use a simplified but recognizable representation of the 
  actual page content (use real section names, headings, and placeholder text 
  that matches the page structure)
- Use raw CSS (no Tailwind or frameworks)

## Output Location
- Save the playground HTML file to `notes/playground/` folder (create it if 
  it doesn't exist)

## Prompt Output
- Generate a natural language instruction at the bottom that I can copy and 
  paste back into Claude to implement the chosen design
- The prompt should describe the layout and component decisions in enough 
  detail to be actionable without the playground
- Only mention choices that differ from the defaults
- Frame it as a direction, e.g.: "Redesign the page with a full-width hero 
  section, 3-column card grid with elevated shadows and 16px gap, airy 
  section spacing (64px), pill-shaped CTAs positioned inline..."
- Include the source page URL in the generated prompt for context

Replace [YOUR_PAGE_URL] with whatever page you want to redesign.

.

.

.

Your Turn

Next time you’re about to type “make it more modern” or “adjust the spacing” or “try a different card style”—stop.

Build a playground first.

Let your eyes make the decisions. Let the Claude Code Playground skill translate those decisions into words. Let Claude execute them precisely.

What page are you going to redesign with this workflow?

Go install the plugin. Run /plugin, search “Playground”, toggle it on.

Now.

(And maybe clear your schedule. Because once you start playing with those sliders, you might lose track of time.)

9 min read The Art of Vibe Coding

I Showed You the Wrong Way to Do Claude Code Testing. Let Me Fix That.

Last week, I walked you through browser testing with Claude Code using the Ralph loop plugin.

I was pretty proud of it, actually.

Here’s the thing: I was wrong.

Well, not entirely wrong. The tests ran. Things got verified. But what I showed you? That wasn’t a true Ralph loop—not the way Geoffrey Huntley originally designed it. And the difference matters more than I realized at the time.

(Stay with me here. This confession has a happy ending.)

.

.

.

The Problem I Didn’t See Coming

The real Ralph loop is supposed to wipe memory clean at the start of each iteration. No leftover context. No accumulated baggage. Just a fresh, focused agent tackling one task at a time.

The Ralph loop plugin from Claude Code’s official marketplace? It preserves context from the previous loop. The plugin relies on a stop hook to end and restart each iteration—but the conversation history tags along for the ride.

And that’s where everything quietly falls apart.

Here’s what this actually looks like in practice:

Imagine you’re setting out on a multi-day hiking trip. Every morning, you pack your backpack for that day’s trail.

Now imagine that instead of emptying your pack each night, you just… keep adding to it. Day one’s water bottles. Day two’s snacks. Day three’s rain gear (even though it’s sunny now). By day five, you’re hauling 40 pounds of stuff you don’t need, and you can barely focus on the trail in front of you.

That’s context rot.

It happens when an AI model’s performance degrades because its context window gets bloated with accumulated information from previous tasks. The more history your agent carries forward, the harder it becomes for the model to stay sharp on what actually matters right now.

👉 The takeaway: Fresh context isn’t a nice-to-have. It’s the whole point.

.

.

.

What Context Rot Actually Looks Like

Let me make this concrete with Claude Code testing:

Iteration 1: Claude runs test TC-001. Context is clean. Performance is sharp. The backpack is light.

Iteration 5: Claude runs test TC-005. But it’s also dragging along memories of TC-001 through TC-004. The pack is getting heavy.

Iteration 15: Claude runs test TC-015. The model is now swimming through accumulated history, trying to find what actually matters among all the gear from previous days.

Iteration 25: Claude runs test TC-025. Performance has degraded. The model makes weird mistakes. It forgets what it was supposed to verify—because it’s exhausted from carrying everyone else’s context.

Same trail. Same agent. Completely different performance.

And here’s the frustrating part: you might not even notice it happening. The tests still run. They just run… worse. Slower. Less reliably. With occasional bizarre failures that make you question your own test plan.

.

.

.

The Solution That Was Already There

So I went looking for a better approach to Claude Code testing—something that would give me the clean-slate benefits of a proper Ralph loop without the context accumulation problem.

And I found it in a tool I’d been using for something else entirely: Claude Code’s task management system.

Here’s where it gets interesting.

The task management system gives you the same effect as a properly implemented Ralph loop—but with something the Ralph loop never had: dependency management.

Think back to the hiking metaphor.

Each sub-agent is like a fresh hiker starting a new day with an empty pack. They get their assignment, they complete their section of trail, they report back. Then the next hiker takes over with their own empty pack.

  • No accumulated gear.
  • No context rot.
  • No performance degradation over time.

But here’s the bonus: the task management system also handles situations where “Day 3’s trail can’t start until Day 2’s bridge gets built.” Dependencies get tracked automatically. Tests that need prerequisites don’t run until those prerequisites pass.

(Is that two features in one? Well, is a package of two Reese’s Peanut Butter Cups two candy bars? I say it counts as one delicious solution.)

.

.

.

How Claude Code Testing Actually Works With Task Management

Let me show you exactly how to set this up.

Fair warning: there are a lot of screenshots coming. But I promise each one shows something important about the workflow.

Step 1: Put the Prompt as a Command

First, store the entire testing prompt as a command file. This makes triggering your Claude Code testing workflow trivially easy—just a slash command away.

Claude Code project structure showing .claude/commands folder containing run_test_plan.md file, with a skills folder below it

The full prompt (I’ll include it at the end—it’s long but worth having) tells Claude exactly how to read your test plan, create tasks, set dependencies, execute tests sequentially, and track results.

Step 2: Trigger the Command

With the command saved, execution is just:

Claude Code terminal showing the /run_test_plan command being typed, ready for execution

That’s it. Type /run_test_plan and let the system take over.

Step 3: Claude Reads Your Specs

Since we’re starting fresh—no memory of previous execution—Claude first reads your original specs, test plan, and implementation plan to understand the context.

Claude Code output showing it checking for existing test runs and reading 3 files including the test plan

(Remember: empty backpack. The agent needs to load up on just what it needs for this journey.)

Step 4: Claude Creates the Tasks

After understanding the context, Claude creates one task per test case. Watch how it automatically detects dependencies:

Claude Code creating 30 test tasks with dependency analysis showing TC-004/005/006/008 depend on TC-003, TC-016/017 depend on TC-014, and other dependency chains

See that dependency analysis?

  • TC-004, TC-005, TC-006, TC-008 depend on TC-003 (password field must exist first)
  • TC-016, TC-017 depend on TC-014 (categories must exist first)
  • TC-019 through TC-023 depend on TC-018 (priority dropdown must exist first)
  • TC-029 depends on TC-027 (accent color must be saved first)

The system figured this out by reading the test plan. No manual configuration required.

Step 5: Dependencies Get Locked In

All 30 tasks created.

Now Claude sets up the dependencies and verifies everything:

Claude Code showing all 30 tasks created with dependencies being set up, updating test-status.json with start timestamp

Step 6: Test Status File Created

Claude creates a test-status.json file to track everything—machine-readable, resumable, and audit-friendly:

Claude Code writing 260 lines to notes/test-status.json, showing metadata structure with testPlanSource, totalIterations, maxIterations, startedAt, and summary counts

The execution order is now crystal clear:

  1. Unblocked tasks first: TC-001, TC-002, TC-003, TC-007, TC-009, TC-010, TC-011, TC-012, TC-013, TC-014, TC-015, TC-018, TC-024
  2. Tasks blocked by TC-003: TC-004, TC-005, TC-006, TC-008
  3. Tasks blocked by TC-014: TC-016, TC-017
  4. Tasks blocked by TC-018: TC-019, TC-020, TC-021, TC-022, TC-023
  5. Tasks blocked by TC-027: TC-029

Step 7: First Task Begins

Here’s where the magic happens.

Claude spawns a sub-agent—with fresh context—to execute TC-001:

Claude Code starting execution with TC-001, spawning a Task sub-agent with the instruction "You are a test execution sub-agent. You have ONE job: execute and verify ONE test case."

“You are a test execution sub-agent. You have ONE job: execute and verify ONE test case.”

That’s the key instruction. Fresh hiker. Empty backpack. Single trail.

Step 8: Browser Automation for Testing

The sub-agent uses Claude Code’s browser automation to test like a real user would:

Claude Code showing Chrome browser automation (javascript_tool) with "View Tab" option, indicating 52+ tool uses for the testing process

It navigates to URLs, clicks buttons, fills forms, takes screenshots at verification points, and checks the DOM state against expected outcomes.

Real browser. Real interactions. Real Claude Code testing.

Step 9: Test Status Gets Updated

After completing a test, the sub-agent updates the status file:

Claude Code updating notes/test-status.json after TC-001 execution, showing 30 tasks with 1 done and 29 open, with VS Code diff view

Step 10: Human-Readable Results Too

The results also get appended to a markdown log for human review:

Claude Code writing to notes/test-results.md after updating test-status.json, showing the dual logging system for machine and human readability

Every test gets logged in two places:

  • test-status.json for machine parsing
  • test-results.md for human review

(Because sometimes you want to query the data programmatically, and sometimes you just want to read what happened over coffee. Both are valid.)

Step 11: Automatic Progression to Next Task

Once TC-001 completes, Claude automatically moves to TC-002:

Claude Code showing TC-001 passed (Done with 36 tool uses, 48.1k tokens, 4m 19s), then spawning a new sub-agent for TC-002 with fresh context

Look at those stats: 36 tool uses, 48.1k tokens, 4 minutes 19 seconds for TC-001.

Then a completely fresh sub-agent spawns for TC-002. New hiker. New backpack. No accumulated context from TC-001.

Step 12: Bugs Found? Claude Fixes Them.

TC-002 found a bug. Here’s what happened:

Claude Code showing TC-002 passed after 1 fix (Quick Setup wasn't saving color options - fixed in WizardAjax.php), then moving to TC-003 which is Critical and unblocks 4 other tests

“TC-002 passed after 1 fix (Quick Setup wasn’t saving color options — now fixed in WizardAjax.php).”

The sub-agent detected the failure, analyzed the root cause, implemented a fix, and re-ran the test. All autonomously. All within the same fresh context.

Step 13: Dependencies Unlock Automatically

Now watch the dependency system in action.

Once TC-003 passes:

Claude Code showing TC-003 passed, announcing that TC-004, TC-005, TC-006, TC-008 are now unblocked, then moving to TC-007

“TC-003 passed. Now TC-004, TC-005, TC-006, TC-008 are unblocked.”

The password field exists now. All the tests that depend on it can finally run.

👉 This is why dependencies matter: They prevent tests from running before their preconditions are met—avoiding the exact conflicts where one agent messes with something another agent needs.

Steps 14-15: The Marathon Continues

It keeps going. Test after test. Each sub-agent fresh and focused:

Claude Code showing a sequence of passed tests: TC-007, TC-004, TC-006, TC-005, TC-008, TC-009, TC-010, each with tool usage stats and completion times
Claude Code showing later tests completing: TC-023 through TC-030, including TC-029 being unblocked after TC-027, with all tests passing

Every test runs sequentially. Every sub-agent gets clean context. Every dependency is respected. No context rot in sight.

Step 16: All Tests Complete

After 2 hours and 12 minutes:

Claude Code showing final test results being written to test-results.md, displaying a summary table with all 30 tests passed, including TC-001 through TC-012 with their priorities and fix attempts

30 tests. All passed. Zero known issues.

Step 17: The Full Summary

The orchestrator writes a comprehensive summary:

Here’s what got verified:

  • All 6 Critical tests passed (password handling, priority validation, accent color persistence)
  • Server-side validation confirmed working (urgent priority rejected, invalid hex rejected, password never stored in wp_options)
  • UI behaviors verified (notice dismiss, auto-hide, field error clearing, color sync)
  • Accessibility attributes verified on priority dropdown

And that bug that got fixed? handleQuickSetup() in WizardAjax.php wasn’t saving desq_primary_color or desq_accent_color options. Found during TC-002. Fixed autonomously.

.

.

.

Why This Actually Works Better

Let me be direct about the comparison:

AspectRalph Loop PluginTask Management System
ContextPreserves across iterationsFresh per sub-agent
DependenciesNoneBuilt-in blocking
Parallel SafetyRiskySequential by default
State TrackingBasic stop hookJSON + Markdown logs
Bug FixingManualAutomatic (up to 3 attempts)
ResumabilityLimitedFull state recovery

The Ralph loop was supposed to start each iteration with a clean slate. The task management system actually delivers on that promise—and adds dependency management that prevents tests from stepping on each other.

.

.

.

The Full Prompt (Copy This)

Here’s the complete command file to drop into .claude/commands/run_test_plan.md:

PROMPT: Execute Test Plan Using Claude Code Task Management System
Loading longform...
We are executing the test plan. All implementation is complete. Now we verify it works.

## Reference Documents

- **Test Plan:** `notes/test_plan.md`
- **Implementation Plan:** `notes/impl_plan.md`
- **Specs:** `notes/specs.md`
- **Test Status JSON:** `notes/test-status.json`
- **Test Results Log:** `notes/test-results.md`

---

## Phase 1: Initialize

### Step 1: Check for Existing Run (Resumption)

Before creating anything, check if a previous test run exists:

1. Check if `notes/test-status.json` exists
2. Check if there are existing tasks via `TaskList`

**If both exist and tasks have results:**
- This is a **resumed run** — skip to Phase 2 (Step 7)
- Announce: "Resuming previous test run. Skipping already-passed tests."
- Only execute tasks that are still `pending` or `fail` (with fixAttempts < 3)

**If no previous run exists (or files are missing):**
- Continue with fresh initialization below

### Step 2: Read the Test Plan

Read `notes/test_plan.md` and extract ALL test cases. Auto-detect the TC-ID pattern used (e.g., `TC-001`, `TC-101`, `TC-5A`, etc.).

For each test case, note:

- TC ID
- Name
- Priority (Critical / High / Medium / Low — default to Medium if not stated)
- Preconditions
- Test steps and expected outcomes
- Test data (if any)
- Dependencies on other test cases (if any)

### Step 3: Analyze Test Dependencies

Determine which test cases depend on others. Common dependency patterns:

- A "saves data" test may depend on a "displays default" test
- A "form submission" test may depend on "form validation" tests
- An "end-to-end" test may depend on individual component tests

If no clear dependencies exist between test cases, treat them all as independent.

### Step 4: Create Tasks

Use `TaskCreate` to create one task per test case. Set `blocked_by` based on the dependency analysis.

**Task description format:**

```
Test [TC-ID]: [Test Name]
Priority: [Priority]

Preconditions:
- [Required state before test]

Steps:
| Step | Action | Expected Result |
|------|--------|-----------------|
| 1 | [Action] | [Result] |
| 2 | [Action] | [Result] |

Test Data:
- [Field]: [Value]

Expected Outcome: [Final verification]

Environment:
- Refer to CLAUDE.md for wp-env details, URLs, and credentials
- WordPress site: http://localhost:8105
- Admin: http://localhost:8105/wp-admin (admin/password)

---
fixAttempts: 0
result: pending
lastTestedAt: null
notes:
```

### Step 5: Generate Test Status JSON

Create `notes/test-status.json`:

```json
{
    "metadata": {
        "testPlanSource": "notes/test_plan.md",
        "totalIterations": 0,
        "maxIterations": 50,
        "startedAt": null,
        "lastUpdatedAt": null,
        "summary": {
            "total": "<count>",
            "pending": "<count>",
            "pass": 0,
            "fail": 0,
            "knownIssue": 0
        }
    },
    "testCases": {
        "<TC-ID>": {
            "name": "Test case name",
            "priority": "Critical|High|Medium|Low",
            "status": "pending",
            "fixAttempts": 0,
            "notes": "",
            "lastTestedAt": null
        }
    },
    "knownIssues": []
}
```

### Step 6: Initialize Test Results Log

Create `notes/test-results.md`:

```markdown
# Test Results

**Test Plan:** notes/test_plan.md
**Started:** [CURRENT_TIMESTAMP]

## Execution Log
```

### Verify Initialization

Use `TaskList` to confirm:
- All TC-IDs from the test plan have a corresponding task
- Dependencies are correctly set via `blocked_by`
- All tasks show `result: pending`

Cross-check task count matches `summary.total` in `notes/test-status.json`.

---

## Phase 2: Execute Tests

### Step 7: Determine Execution Order

Use `TaskList` to read all tasks and their `blocked_by` fields. Determine sequential execution order:

1. Tasks with no `blocked_by` (or all dependencies resolved) come first
2. Tasks whose dependencies are resolved come next
3. Continue until all tasks are ordered

**For resumed runs:** Skip tasks where `result` is already `pass` or `known_issue`.

### Step 8: Execute One Task at a Time

For the next eligible task, spawn ONE sub-agent with the instructions below.

**One sub-agent at a time. Do NOT spawn multiple sub-agents in parallel.**

---

#### Sub-Agent Instructions

**You are a test execution sub-agent. You have ONE job: execute and verify ONE test case.**

1. **Read your task** using `TaskGet` to get the full description
2. **Parse the test steps** from the description (everything above the `---` separator)
3. **Parse the metadata** from below the `---` separator
4. **Read CLAUDE.md** for environment details, URLs, and credentials

5. **Execute the test:**

    Using browser automation:
    - Navigate to URLs specified in the test steps
    - Click buttons/links as described
    - Fill form inputs with the test data provided
    - Take screenshots at key verification points
    - Read console logs for errors
    - Verify DOM state matches expected outcomes

    Follow the test plan steps EXACTLY. Do not skip steps.

6. **Determine the result:**

    **PASS** if:
    - All expected outcomes verified
    - No unexpected console errors
    - UI state matches test plan

    **FAIL** if:
    - Any expected outcome not met
    - Unexpected errors
    - UI state doesn't match

7. **If PASS:** Update the task description metadata via `TaskUpdate`:

    ```
    ---
    fixAttempts: 0
    result: pass
    lastTestedAt: [CURRENT_TIMESTAMP]
    notes: [Brief description of what was verified]
    ```

    Mark the task as `completed`.

8. **If FAIL and fixAttempts < 3:**

    a. Analyze the root cause
    b. Implement a fix in the codebase
    c. Increment fixAttempts and update via `TaskUpdate`:

    ```
    ---
    fixAttempts: [previous + 1]
    result: fail
    lastTestedAt: [CURRENT_TIMESTAMP]
    notes: [What failed, root cause, what fix was applied]
    ```

    d. Re-run the test steps to verify the fix
    e. If now passing, set `result: pass` and mark task as `completed`
    f. If still failing and fixAttempts < 3, repeat from (a)

9. **If FAIL and fixAttempts >= 3:** Mark as known issue via `TaskUpdate`:

    ```
    ---
    fixAttempts: 3
    result: known_issue
    lastTestedAt: [CURRENT_TIMESTAMP]
    notes: KI — [Description of the issue, steps to reproduce, severity, suggested fix]
    ```

    Mark the task as `completed`.

10. **Update Test Status JSON** — Read `notes/test-status.json`, update the test case entry and recalculate summary counts, then write back:

    - Set `status` to `pass`, `fail`, or `known_issue`
    - Update `fixAttempts`, `notes`, `lastTestedAt`
    - Increment `metadata.totalIterations`
    - Update `metadata.lastUpdatedAt`
    - Recalculate `metadata.summary` counts
    - If known_issue, add entry to `knownIssues` array

11. **Append to test results log** (`notes/test-results.md`):

    ```markdown
    ## [TC-ID] — [Test Name]

    **Result:** PASS | FAIL | KNOWN ISSUE
    **Tested At:** [TIMESTAMP]
    **Fix Attempts:** [N]

    **What happened:**
    [Brief description of test execution]

    **Notes:**
    [Observations, errors, or fixes attempted]

    ---
    ```

**CRITICAL: Before finishing, verify you have updated ALL THREE locations:**

1. Task description (metadata below `---` separator) via `TaskUpdate`
2. `notes/test-status.json` (test case entry + summary counts)
3. `notes/test-results.md` (appended human-readable entry)

Missing ANY of these = incomplete iteration.

---

### Step 9: Verify and Continue

After each sub-agent finishes, the orchestrator:

1. Uses `TaskGet` to verify the task description metadata was updated
2. Reads `notes/test-status.json` to confirm JSON was updated and summary counts are correct
3. Reads `notes/test-results.md` to confirm a new entry was appended
4. **If any location was NOT updated**, update it before proceeding
5. Determines the next eligible task (unresolved, dependencies met)
6. Spawns the next sub-agent (back to Step 8)

### Step 10: Repeat Until All Resolved

Continue until ALL tasks have `result: pass` or `result: known_issue`.

```
Completion check:
  - result: pass         → resolved
  - result: known_issue  → resolved
  - result: fail         → needs re-test (if fixAttempts < 3)
  - result: pending      → not yet tested

ALL resolved? → Phase 3 (Summary)
Otherwise?    → Next task
```

---

## Phase 3: Summary

### Step 11: Generate Final Summary

When all tasks are resolved, append a final summary to `notes/test-results.md`:

```markdown
# Final Summary

**Completed:** [TIMESTAMP]
**Total Test Cases:** [N]
**Passed:** [N]
**Known Issues:** [N]

## Results

| TC | Name | Priority | Result | Fix Attempts |
|----|------|----------|--------|--------------|
| TC-XXX | [Name] | High | PASS | 0 |
| TC-YYY | [Name] | Medium | KNOWN ISSUE | 3 |

## Known Issues Detail

### KI-001: [TC-ID] — [Issue Title]

**Severity:** [low|medium|high|critical]
**Steps to Reproduce:** [How to see the bug]
**Suggested Fix:** [Potential solution if known]

## Recommendations

[Any follow-up actions needed]
```

---

## Rules Summary

| Rule | Description |
|------|-------------|
| 1:1 Mapping | One task per test case — no grouping |
| Dependencies | Use `blocked_by` to enforce test execution order |
| Sequential | One sub-agent at a time — do NOT spawn multiple in parallel |
| Sub-Agents | One sub-agent per task — fresh context, focused execution |
| Max 3 Attempts | After 3 fix attempts → mark as `known_issue` |
| Metadata in Description | Track `fixAttempts`, `result`, `lastTestedAt`, `notes` below `---` separator |
| Test Status JSON | Always update `notes/test-status.json` after each test |
| Log Everything | Append results to `notes/test-results.md` for human review |
| Resumable | Detect existing run state and continue from where it left off |
| Completion | All tasks resolved = all results are `pass` or `known_issue` |

## Do NOT

- Spawn multiple sub-agents in parallel — execute ONE at a time
- Leave tasks in `fail` state without either retrying or escalating to `known_issue`
- Modify test plan steps — execute them exactly as written
- Forget to update `notes/test-status.json` after each test
- Forget to append to the test results log after each test
- Skip the dependency analysis
- Use `alert()` or `confirm()` in any fix (see CLAUDE.md)
We are executing the test plan. All implementation is complete. Now we verify it works.

## Reference Documents

- **Test Plan:** `notes/test_plan.md`
- **Implementation Plan:** `notes/impl_plan.md`
- **Specs:** `notes/specs.md`
- **Test Status JSON:** `notes/test-status.json`
- **Test Results Log:** `notes/test-results.md`

---

## Phase 1: Initialize

### Step 1: Check for Existing Run (Resumption)

Before creating anything, check if a previous test run exists:

1. Check if `notes/test-status.json` exists
2. Check if there are existing tasks via `TaskList`

**If both exist and tasks have results:**
- This is a **resumed run** — skip to Phase 2 (Step 7)
- Announce: "Resuming previous test run. Skipping already-passed tests."
- Only execute tasks that are still `pending` or `fail` (with fixAttempts < 3)

**If no previous run exists (or files are missing):**
- Continue with fresh initialization below

### Step 2: Read the Test Plan

Read `notes/test_plan.md` and extract ALL test cases. Auto-detect the TC-ID pattern used (e.g., `TC-001`, `TC-101`, `TC-5A`, etc.).

For each test case, note:

- TC ID
- Name
- Priority (Critical / High / Medium / Low — default to Medium if not stated)
- Preconditions
- Test steps and expected outcomes
- Test data (if any)
- Dependencies on other test cases (if any)

### Step 3: Analyze Test Dependencies

Determine which test cases depend on others. Common dependency patterns:

- A "saves data" test may depend on a "displays default" test
- A "form submission" test may depend on "form validation" tests
- An "end-to-end" test may depend on individual component tests

If no clear dependencies exist between test cases, treat them all as independent.

### Step 4: Create Tasks

Use `TaskCreate` to create one task per test case. Set `blocked_by` based on the dependency analysis.

**Task description format:**

```
Test [TC-ID]: [Test Name]
Priority: [Priority]

Preconditions:
- [Required state before test]

Steps:
| Step | Action | Expected Result |
|------|--------|-----------------|
| 1 | [Action] | [Result] |
| 2 | [Action] | [Result] |

Test Data:
- [Field]: [Value]

Expected Outcome: [Final verification]

Environment:
- Refer to CLAUDE.md for wp-env details, URLs, and credentials
- WordPress site: http://localhost:8105
- Admin: http://localhost:8105/wp-admin (admin/password)

---
fixAttempts: 0
result: pending
lastTestedAt: null
notes:
```

### Step 5: Generate Test Status JSON

Create `notes/test-status.json`:

```json
{
    "metadata": {
        "testPlanSource": "notes/test_plan.md",
        "totalIterations": 0,
        "maxIterations": 50,
        "startedAt": null,
        "lastUpdatedAt": null,
        "summary": {
            "total": "<count>",
            "pending": "<count>",
            "pass": 0,
            "fail": 0,
            "knownIssue": 0
        }
    },
    "testCases": {
        "<TC-ID>": {
            "name": "Test case name",
            "priority": "Critical|High|Medium|Low",
            "status": "pending",
            "fixAttempts": 0,
            "notes": "",
            "lastTestedAt": null
        }
    },
    "knownIssues": []
}
```

### Step 6: Initialize Test Results Log

Create `notes/test-results.md`:

```markdown
# Test Results

**Test Plan:** notes/test_plan.md
**Started:** [CURRENT_TIMESTAMP]

## Execution Log
```

### Verify Initialization

Use `TaskList` to confirm:
- All TC-IDs from the test plan have a corresponding task
- Dependencies are correctly set via `blocked_by`
- All tasks show `result: pending`

Cross-check task count matches `summary.total` in `notes/test-status.json`.

---

## Phase 2: Execute Tests

### Step 7: Determine Execution Order

Use `TaskList` to read all tasks and their `blocked_by` fields. Determine sequential execution order:

1. Tasks with no `blocked_by` (or all dependencies resolved) come first
2. Tasks whose dependencies are resolved come next
3. Continue until all tasks are ordered

**For resumed runs:** Skip tasks where `result` is already `pass` or `known_issue`.

### Step 8: Execute One Task at a Time

For the next eligible task, spawn ONE sub-agent with the instructions below.

**One sub-agent at a time. Do NOT spawn multiple sub-agents in parallel.**

---

#### Sub-Agent Instructions

**You are a test execution sub-agent. You have ONE job: execute and verify ONE test case.**

1. **Read your task** using `TaskGet` to get the full description
2. **Parse the test steps** from the description (everything above the `---` separator)
3. **Parse the metadata** from below the `---` separator
4. **Read CLAUDE.md** for environment details, URLs, and credentials

5. **Execute the test:**

    Using browser automation:
    - Navigate to URLs specified in the test steps
    - Click buttons/links as described
    - Fill form inputs with the test data provided
    - Take screenshots at key verification points
    - Read console logs for errors
    - Verify DOM state matches expected outcomes

    Follow the test plan steps EXACTLY. Do not skip steps.

6. **Determine the result:**

    **PASS** if:
    - All expected outcomes verified
    - No unexpected console errors
    - UI state matches test plan

    **FAIL** if:
    - Any expected outcome not met
    - Unexpected errors
    - UI state doesn't match

7. **If PASS:** Update the task description metadata via `TaskUpdate`:

    ```
    ---
    fixAttempts: 0
    result: pass
    lastTestedAt: [CURRENT_TIMESTAMP]
    notes: [Brief description of what was verified]
    ```

    Mark the task as `completed`.

8. **If FAIL and fixAttempts < 3:**

    a. Analyze the root cause
    b. Implement a fix in the codebase
    c. Increment fixAttempts and update via `TaskUpdate`:

    ```
    ---
    fixAttempts: [previous + 1]
    result: fail
    lastTestedAt: [CURRENT_TIMESTAMP]
    notes: [What failed, root cause, what fix was applied]
    ```

    d. Re-run the test steps to verify the fix
    e. If now passing, set `result: pass` and mark task as `completed`
    f. If still failing and fixAttempts < 3, repeat from (a)

9. **If FAIL and fixAttempts >= 3:** Mark as known issue via `TaskUpdate`:

    ```
    ---
    fixAttempts: 3
    result: known_issue
    lastTestedAt: [CURRENT_TIMESTAMP]
    notes: KI — [Description of the issue, steps to reproduce, severity, suggested fix]
    ```

    Mark the task as `completed`.

10. **Update Test Status JSON** — Read `notes/test-status.json`, update the test case entry and recalculate summary counts, then write back:

    - Set `status` to `pass`, `fail`, or `known_issue`
    - Update `fixAttempts`, `notes`, `lastTestedAt`
    - Increment `metadata.totalIterations`
    - Update `metadata.lastUpdatedAt`
    - Recalculate `metadata.summary` counts
    - If known_issue, add entry to `knownIssues` array

11. **Append to test results log** (`notes/test-results.md`):

    ```markdown
    ## [TC-ID] — [Test Name]

    **Result:** PASS | FAIL | KNOWN ISSUE
    **Tested At:** [TIMESTAMP]
    **Fix Attempts:** [N]

    **What happened:**
    [Brief description of test execution]

    **Notes:**
    [Observations, errors, or fixes attempted]

    ---
    ```

**CRITICAL: Before finishing, verify you have updated ALL THREE locations:**

1. Task description (metadata below `---` separator) via `TaskUpdate`
2. `notes/test-status.json` (test case entry + summary counts)
3. `notes/test-results.md` (appended human-readable entry)

Missing ANY of these = incomplete iteration.

---

### Step 9: Verify and Continue

After each sub-agent finishes, the orchestrator:

1. Uses `TaskGet` to verify the task description metadata was updated
2. Reads `notes/test-status.json` to confirm JSON was updated and summary counts are correct
3. Reads `notes/test-results.md` to confirm a new entry was appended
4. **If any location was NOT updated**, update it before proceeding
5. Determines the next eligible task (unresolved, dependencies met)
6. Spawns the next sub-agent (back to Step 8)

### Step 10: Repeat Until All Resolved

Continue until ALL tasks have `result: pass` or `result: known_issue`.

```
Completion check:
  - result: pass         → resolved
  - result: known_issue  → resolved
  - result: fail         → needs re-test (if fixAttempts < 3)
  - result: pending      → not yet tested

ALL resolved? → Phase 3 (Summary)
Otherwise?    → Next task
```

---

## Phase 3: Summary

### Step 11: Generate Final Summary

When all tasks are resolved, append a final summary to `notes/test-results.md`:

```markdown
# Final Summary

**Completed:** [TIMESTAMP]
**Total Test Cases:** [N]
**Passed:** [N]
**Known Issues:** [N]

## Results

| TC | Name | Priority | Result | Fix Attempts |
|----|------|----------|--------|--------------|
| TC-XXX | [Name] | High | PASS | 0 |
| TC-YYY | [Name] | Medium | KNOWN ISSUE | 3 |

## Known Issues Detail

### KI-001: [TC-ID] — [Issue Title]

**Severity:** [low|medium|high|critical]
**Steps to Reproduce:** [How to see the bug]
**Suggested Fix:** [Potential solution if known]

## Recommendations

[Any follow-up actions needed]
```

---

## Rules Summary

| Rule | Description |
|------|-------------|
| 1:1 Mapping | One task per test case — no grouping |
| Dependencies | Use `blocked_by` to enforce test execution order |
| Sequential | One sub-agent at a time — do NOT spawn multiple in parallel |
| Sub-Agents | One sub-agent per task — fresh context, focused execution |
| Max 3 Attempts | After 3 fix attempts → mark as `known_issue` |
| Metadata in Description | Track `fixAttempts`, `result`, `lastTestedAt`, `notes` below `---` separator |
| Test Status JSON | Always update `notes/test-status.json` after each test |
| Log Everything | Append results to `notes/test-results.md` for human review |
| Resumable | Detect existing run state and continue from where it left off |
| Completion | All tasks resolved = all results are `pass` or `known_issue` |

## Do NOT

- Spawn multiple sub-agents in parallel — execute ONE at a time
- Leave tasks in `fail` state without either retrying or escalating to `known_issue`
- Modify test plan steps — execute them exactly as written
- Forget to update `notes/test-status.json` after each test
- Forget to append to the test results log after each test
- Skip the dependency analysis
- Use `alert()` or `confirm()` in any fix (see CLAUDE.md)

.

.

.

Your Turn

If you’ve been frustrated with AI-generated code that “works” but doesn’t actually work, give this a shot.

Define your success criteria upfront with a solid test plan. Let Claude Code testing handle the execution and verification through task management. Walk away while it iterates.

The test-fix-retest loop is boring. Tedious. The kind of thing every developer has always done manually.

Now you don’t have to.

What feature are you going to test with this workflow?

Set it up. Let it run. Come back to green checkmarks.

(And maybe grab a coffee while you wait. Your backpack is empty now—you’ve earned the rest.)

11 min read The Art of Vibe Coding

How to Make Claude Code Test and Fix Its Own Work (The Ralph Loop Method)

Last week, I showed you my Claude Code implementation workflow.

52 minutes. 13 tasks. 38 test cases worth of functionality. All built by sub-agents running in parallel.

Here’s what I didn’t tell you.

Half of it didn’t work.

(I know. I KNOW.)

.

.

.

The Part Where I Discover My “Complete” Implementation Is… Not

Let me show you what happened when I actually tested the WooCommerce integration Claude built for me.

Quick context: I have a WordPress theme for coworking spaces. Originally, it used direct Stripe integration for payments. But here’s the thing—not everyone wants Stripe. Some coworking spaces prefer PayPal. Others need local payment gateways. (And some, bless their hearts, are still figuring out what a payment gateway even is.)

The solution? Let WooCommerce handle payments. Hundreds of gateway integrations, tax calculations, order management—all built-in.

Claude followed my implementation workflow perfectly.

PERFECTLY.

The settings page looked gorgeous:

WordPress admin settings page showing Payment Gateway Configuration with two card options: Direct Stripe Integration (marked Recommended) on the left and WooCommerce on the right, each with icons, descriptions, and feature bullet points

There’s even a Product Sync panel showing 3 published plans synced to WooCommerce at 100% progress. One hundred percent!

Product Sync panel displaying 3 Published Plans, 3 Synced to WooCommerce, 100% Sync Progress, with explanation of how sync works and two buttons: Sync Plans to WooCommerce and View Products

My plans:

  • Hot Desk ($199/month),
  • Dedicated Desk ($399/month),
  • Private Office ($799/month)

—all published and ready to go:

Plans list showing three rows: Hot Desk at $199/month, Dedicated Desk at $399/month, and Private Office at $799/month, all with Published status and 0 subscribers

And look!

They synced perfectly to WooCommerce products:

WooCommerce Products page showing Hot Desk, Dedicated Desk, and Private Office as variable products with price ranges and In Stock status

Everything looked GREAT.

So I clicked “Get Started” on the Hot Desk plan to test the checkout flow. You know, like a responsible developer would do. (Stop laughing.)

And here’s what I saw:

Checkout page showing the old direct Stripe integration with Card Number field, "Your card will be charged securely via Stripe" message, and Order Summary showing Hot Desk at $199/month—despite WooCommerce mode being enabled

The old Stripe checkout.

The direct integration I was trying to REPLACE.

I switched the payment mode to WooCommerce. I synced the products. Everything in the admin looked correct.

But the frontend? Still using the old Stripe integration.

Ferpetesake.

.

.

.

Why Claude Thinks “Done” When It’s Really “Done-ish”

Here’s where I went full detective mode.

I checked the codebase. The WooCommerce checkout code exists. Functions written. Hooks registered. File paths correct. All present and accounted for.

So why wasn’t it working?

The code was never connected to the rest of the system.

(Stay with me here.)

Claude wrote the WooCommerce checkout handler. Beautiful code. But the pricing page? Still calling the old Stripe checkout function. The new code sat there—perfectly written, completely unused—like a fancy espresso machine you forgot to plug in.

And here’s the thing: this happens ALL THE TIME with AI-generated code.

Claude writes features.

It creates files. It generates functions. And in its summary, it reports “Task complete.”

But “code exists” and “code works”?

Two very different things.

You’ve probably experienced this.

Claude builds a feature. You test it. Something’s broken. You point out the bug. Claude apologizes (so polite!), fixes that specific issue, and introduces two new ones.

The Reddit crowd calls this “nerfed” or “lazy.”

They’re wrong.

👉 Claude lacks visibility into whether its code actually runs correctly in your system.

It can’t see the browser. It can’t watch a user click through your checkout flow. It can’t verify that function A actually calls function B in production.

The fix? Give Claude the ability to test its own work.

(This is where it gets good.)

.

.

.

The Most Important Testing? Not What You Think

You might be thinking: “Just write unit tests. Problem solved.”

And look—unit tests help. Integration tests help more.

But here’s what nobody talks about:

Perfect code doesn’t mean a working product.

The WooCommerce checkout code passed every logical check. Functions syntactically correct. Hooks properly registered. A unit test would have given it a gold star and a pat on the head.

But the pricing page template still imported the old Stripe checkout URL.

That’s a wiring problem. Not a code problem.

The test that catches this? User acceptance testing.

Actual users (or something simulating actual users) verifying the end product meets their needs. Clicking buttons. Filling forms. Going through the whole dang flow.

This is exactly why my implementation workflow generates a test plan BEFORE the implementation plan. The test plan represents success criteria from the user’s perspective:

  • Can a user switch payment modes?
  • Does the checkout redirect to WooCommerce?
  • Does the order confirmation show correct details?

These questions can’t be answered by reading code. They require clicking through the actual interface.

Which brings us to Ralph Loop.

.

.

.

Meet Ralph Loop: Your Autonomous Claude Code Testing Loop

Here’s the workflow I use to make Claude test its own work:

This is an autonomous loop that picks up a test case, executes it in an actual browser, checks against acceptance criteria, logs results, and repeats. If a test fails? Claude fixes the code and retests.

(Yes, really. It fixes its own bugs. I’ll show you.)

The idea comes from Ryan Carson’s video “Ralph Wiggum” AI Agent will 10x Claude Code/Amp.

The core insight: you can’t throw a vague prompt at an autonomous loop and expect magic. The loop needs structure.

Specifically, it needs:

  • A test plan defining every test case upfront
  • A status.json tracking pass/fail for each case
  • A results.md where Claude logs learnings after each iteration

Let me show you exactly how I set this up for Claude Code testing.

.

.

.

1. Create the Ralph Test Folder

First, create a folder to store all your Ralph loop files:

VS Code file explorer showing ralph_test folder containing four files: prepare.md, prompt.md, results.md, and status.json

Four files. That’s it.

  • prepare.md — Instructions for generating the status.json from your test plan
  • prompt.md — The loop instructions Claude follows each iteration
  • status.json — Tracks the state of all test cases (starts empty)
  • results.md — Human-readable log of each iteration (starts empty)

2. The Prepare Prompt

The prepare prompt tells Claude how to read your test plan and initialize the status file:

prepare.md file showing instructions to read test plan, extract all TC-XXX test cases with ID, name, and priority, then generate a JSON file with metadata including testPlanSource, totalIterations, maxIterations, and summary counts

PROMPT: Ralph Loop Testing Agent (Prepare prompt)
Read the test plan file and generate a `status.json` file with all test cases initialized.

## Input

- **Test Plan:** `notes/test_plan.md`

## Output

- **Status File:** `notes/ralph_test/status.json`

## Instructions

1. Read the test plan markdown file
2. Extract ALL test cases (format: TC-XXX)
3. For each test case, extract:
    - TC ID (e.g., "TC-501")
    - Name (the test case title after the TC ID)
    - Priority (Critical/High/Medium/Low)
4. Generate a JSON file with this exact structure:

```json
{
  "metadata": {
    "testPlanSource": "notes/test_plan.md",
    "totalIterations": 0,
    "maxIterations": 50,
    "startedAt": null,
    "lastUpdatedAt": null,
    "summary": {
      "total": <count>,
      "pending": <count>,
      "pass": 0,
      "fail": 0,
      "knownIssue": 0
    }
  },
  "testCases": {
    "TC-XXX": {
      "name": "Test case name from plan",
      "priority": "Critical|High|Medium|Low",
      "status": "pending",
      "fixAttempts": 0,
      "notes": "",
      "lastTestedAt": null
    }
  },
  "knownIssues": []
}
```

````

5. Save the file to the output path

## Extraction Rules

- Test case IDs follow pattern: `TC-NNN` (e.g., TC-501, TC-522)
- Test case names are in headers like: `#### TC-501: Checkout Header Display`
- Priority is usually listed in the test case details or status tracker table
- If priority not found, default to "Medium"

## Example

Input (from test plan):

```markdown
#### TC-501: Checkout Header Display

**Priority:** High
...

#### TC-502: Checkout Elements - Step Progress Display

**Priority:** High
...
```

Output (status.json):

```json
{
    "metadata": {
        "testPlanSource": "./docs/test-plan.md",
        "totalIterations": 0,
        "maxIterations": 50,
        "startedAt": null,
        "lastUpdatedAt": null,
        "summary": {
            "total": 2,
            "pending": 2,
            "pass": 0,
            "fail": 0,
            "knownIssue": 0
        }
    },
    "testCases": {
        "TC-501": {
            "name": "Checkout Header Display",
            "priority": "High",
            "status": "pending",
            "fixAttempts": 0,
            "notes": "",
            "lastTestedAt": null
        },
        "TC-502": {
            "name": "Checkout Elements - Step Progress Display",
            "priority": "High",
            "status": "pending",
            "fixAttempts": 0,
            "notes": "",
            "lastTestedAt": null
        }
    },
    "knownIssues": []
}
```

## Validation

After generating, verify:

- [ ] All TC-XXX IDs from the test plan are included
- [ ] No duplicate TC IDs
- [ ] Summary.total matches count of testCases
- [ ] JSON is valid (no syntax errors)
- [ ] File saved to correct path
````
Read the test plan file and generate a `status.json` file with all test cases initialized.

## Input

- **Test Plan:** `notes/test_plan.md`

## Output

- **Status File:** `notes/ralph_test/status.json`

## Instructions

1. Read the test plan markdown file
2. Extract ALL test cases (format: TC-XXX)
3. For each test case, extract:
    - TC ID (e.g., "TC-501")
    - Name (the test case title after the TC ID)
    - Priority (Critical/High/Medium/Low)
4. Generate a JSON file with this exact structure:

```json
{
  "metadata": {
    "testPlanSource": "notes/test_plan.md",
    "totalIterations": 0,
    "maxIterations": 50,
    "startedAt": null,
    "lastUpdatedAt": null,
    "summary": {
      "total": <count>,
      "pending": <count>,
      "pass": 0,
      "fail": 0,
      "knownIssue": 0
    }
  },
  "testCases": {
    "TC-XXX": {
      "name": "Test case name from plan",
      "priority": "Critical|High|Medium|Low",
      "status": "pending",
      "fixAttempts": 0,
      "notes": "",
      "lastTestedAt": null
    }
  },
  "knownIssues": []
}
```

````

5. Save the file to the output path

## Extraction Rules

- Test case IDs follow pattern: `TC-NNN` (e.g., TC-501, TC-522)
- Test case names are in headers like: `#### TC-501: Checkout Header Display`
- Priority is usually listed in the test case details or status tracker table
- If priority not found, default to "Medium"

## Example

Input (from test plan):

```markdown
#### TC-501: Checkout Header Display

**Priority:** High
...

#### TC-502: Checkout Elements - Step Progress Display

**Priority:** High
...
```

Output (status.json):

```json
{
    "metadata": {
        "testPlanSource": "./docs/test-plan.md",
        "totalIterations": 0,
        "maxIterations": 50,
        "startedAt": null,
        "lastUpdatedAt": null,
        "summary": {
            "total": 2,
            "pending": 2,
            "pass": 0,
            "fail": 0,
            "knownIssue": 0
        }
    },
    "testCases": {
        "TC-501": {
            "name": "Checkout Header Display",
            "priority": "High",
            "status": "pending",
            "fixAttempts": 0,
            "notes": "",
            "lastTestedAt": null
        },
        "TC-502": {
            "name": "Checkout Elements - Step Progress Display",
            "priority": "High",
            "status": "pending",
            "fixAttempts": 0,
            "notes": "",
            "lastTestedAt": null
        }
    },
    "knownIssues": []
}
```

## Validation

After generating, verify:

- [ ] All TC-XXX IDs from the test plan are included
- [ ] No duplicate TC IDs
- [ ] Summary.total matches count of testCases
- [ ] JSON is valid (no syntax errors)
- [ ] File saved to correct path
````

Key elements:

  • Points to your test plan location (notes/test_plan.md)
  • Specifies the output file (notes/ralph_test/status.json)
  • Defines the JSON structure with metadata and test case tracking

Nothing fancy. Just clear instructions.

3. The Loop Prompt

The prompt.md file contains the instructions Claude follows every single iteration:

prompt.md showing the testing agent loop: 1. Read state from files, 2. Pick ONE test case, 3. Execute or fix it, 4. Update state files, 5. Check if done. Includes warning that Claude has NO memory of previous iterations—files are the memory

PROMPT: Ralph Loop Testing Agent (Execution Prompt)
You are a testing agent in an iterative loop. Each iteration:

1. Read state from files
2. Pick ONE test case
3. Execute or fix it
4. Update state files
5. Check if done → output completion promise OR continue

**You have NO memory of previous iterations.** Files are your memory.

---

## Files to Read FIRST

| File                           | Purpose                                         |
| ------------------------------ | ----------------------------------------------- |
| `notes/ralph_test/status.json` | Current state of all test cases (JSON)          |
| `notes/test_plan.md`           | Full test plan with steps and expected outcomes |
| `notes/ralph_test/results.md`  | Human-readable log (append results here)        |

**Optional context:**

- `notes/impl_plan.md` — Implementation details
- `notes/specs.md` — Specifications details

---

## Environment

wp-env is running:

- Dev site: http://localhost:8101
- Test site: http://localhost:8102
- Admin: http://localhost:8101/wp-admin

Commands:

- Run inside sandbox: Standard commands
- Run outside sandbox: npm, docker, wp-env commands

## Test Credentials

### Admin

- URL: http://localhost:8101/wp-admin
- Username: admin
- Email: wordpress@example.com
- Password: password

### Reset Password (if needed)

```bash
wp user update admin --user_pass=password
```

---

## This Iteration

### Step 1: Read State

Read the test status JSON file. Understand:

- Which test cases exist
- Status of each: `pending`, `testing`, `pass`, `fail`, `known_issue`
- Fix attempts for failing tests

### Step 2: Check Completion

**If ALL test cases are `pass` or `known_issue`:**

Output completion promise and final summary:

<promise>ALL_TESTS_RESOLVED</promise>

Summary:

- Total passed: X
- Known issues: Y
- Recommendations: ...

**Otherwise, continue to Step 3.**

### Step 3: Pick ONE Test Case

Priority order:

1. `testing` — Continue mid-test
2. `fail` with `fixAttempts < 3` — Needs fix
3. `pending` — Fresh test

### Step 4: Execute Test

Using Chrome browser automation (natural language):

- Navigate to URLs
- Click buttons/links
- Fill form inputs
- Take screenshots
- Read console logs
- Verify DOM state

**Follow the test plan click-path EXACTLY.**

### Step 5: Record Result

Update test status JSON:

**PASS:**

```json
{ "status": "pass", "notes": "What was verified", "lastTestedAt": "ISO timestamp" }
```

**FAIL:**

```json
{ "status": "fail", "fixAttempts": <increment>, "notes": "What failed", "lastTestedAt": "ISO timestamp" }
```

Update `metadata.totalIterations` and `metadata.lastUpdatedAt`.

### Step 6: Handle Failures

**If FAIL and fixAttempts < 3:**

- Analyze root cause
- Implement fix in codebase
- Next iteration will re-test

**If FAIL and fixAttempts >= 3:**

- Set status to `known_issue`
- Add to `knownIssues` array with: id, description, steps, severity

### Step 7: Update Human Log

Append to test results markdown:

```markdown
## Iteration [N] — [TIMESTAMP]

**TC:** TC-XXX — [Name]
**Status:** ✅/❌/⚠️
**Notes:** [What happened]

---
```

### Step 8: Continue or Complete

- If all TCs resolved → Output `<promise>ALL_TESTS_RESOLVED</promise>`
- Otherwise → Continue working (loop will restart)

---

## Rules

1. ONE test case per iteration
2. Update files BEFORE finishing
3. Follow test steps EXACTLY
4. Screenshot key verification points
5. Max 3 fix attempts → then known_issue
6. Output promise ONLY when truly complete
You are a testing agent in an iterative loop. Each iteration:

1. Read state from files
2. Pick ONE test case
3. Execute or fix it
4. Update state files
5. Check if done → output completion promise OR continue

**You have NO memory of previous iterations.** Files are your memory.

---

## Files to Read FIRST

| File                           | Purpose                                         |
| ------------------------------ | ----------------------------------------------- |
| `notes/ralph_test/status.json` | Current state of all test cases (JSON)          |
| `notes/test_plan.md`           | Full test plan with steps and expected outcomes |
| `notes/ralph_test/results.md`  | Human-readable log (append results here)        |

**Optional context:**

- `notes/impl_plan.md` — Implementation details
- `notes/specs.md` — Specifications details

---

## Environment

wp-env is running:

- Dev site: http://localhost:8101
- Test site: http://localhost:8102
- Admin: http://localhost:8101/wp-admin

Commands:

- Run inside sandbox: Standard commands
- Run outside sandbox: npm, docker, wp-env commands

## Test Credentials

### Admin

- URL: http://localhost:8101/wp-admin
- Username: admin
- Email: wordpress@example.com
- Password: password

### Reset Password (if needed)

```bash
wp user update admin --user_pass=password
```

---

## This Iteration

### Step 1: Read State

Read the test status JSON file. Understand:

- Which test cases exist
- Status of each: `pending`, `testing`, `pass`, `fail`, `known_issue`
- Fix attempts for failing tests

### Step 2: Check Completion

**If ALL test cases are `pass` or `known_issue`:**

Output completion promise and final summary:

<promise>ALL_TESTS_RESOLVED</promise>

Summary:

- Total passed: X
- Known issues: Y
- Recommendations: ...

**Otherwise, continue to Step 3.**

### Step 3: Pick ONE Test Case

Priority order:

1. `testing` — Continue mid-test
2. `fail` with `fixAttempts < 3` — Needs fix
3. `pending` — Fresh test

### Step 4: Execute Test

Using Chrome browser automation (natural language):

- Navigate to URLs
- Click buttons/links
- Fill form inputs
- Take screenshots
- Read console logs
- Verify DOM state

**Follow the test plan click-path EXACTLY.**

### Step 5: Record Result

Update test status JSON:

**PASS:**

```json
{ "status": "pass", "notes": "What was verified", "lastTestedAt": "ISO timestamp" }
```

**FAIL:**

```json
{ "status": "fail", "fixAttempts": <increment>, "notes": "What failed", "lastTestedAt": "ISO timestamp" }
```

Update `metadata.totalIterations` and `metadata.lastUpdatedAt`.

### Step 6: Handle Failures

**If FAIL and fixAttempts < 3:**

- Analyze root cause
- Implement fix in codebase
- Next iteration will re-test

**If FAIL and fixAttempts >= 3:**

- Set status to `known_issue`
- Add to `knownIssues` array with: id, description, steps, severity

### Step 7: Update Human Log

Append to test results markdown:

```markdown
## Iteration [N] — [TIMESTAMP]

**TC:** TC-XXX — [Name]
**Status:** ✅/❌/⚠️
**Notes:** [What happened]

---
```

### Step 8: Continue or Complete

- If all TCs resolved → Output `<promise>ALL_TESTS_RESOLVED</promise>`
- Otherwise → Continue working (loop will restart)

---

## Rules

1. ONE test case per iteration
2. Update files BEFORE finishing
3. Follow test steps EXACTLY
4. Screenshot key verification points
5. Max 3 fix attempts → then known_issue
6. Output promise ONLY when truly complete

This is crucial for Claude Code testing to work properly.

Each iteration, Claude:

  1. Reads status.json to understand current state
  2. Picks the next pending test case
  3. Executes the test in an actual browser
  4. Updates status.json and results.md
  5. Ends the iteration (which triggers the next loop)

Rinse. Repeat. Until done.

4. Initialize the Status File

Run the prepare prompt to generate your starting state:

Claude Code terminal showing the command to read and implement the prepare.md file

Claude reads your test plan and creates status.json with all 38 test cases initialized:

Claude Code output showing it created status.json with 38 test cases extracted, priority breakdown of 6 Critical, 18 High, 14 Medium, all cases initialized with status pending and fixAttempts 0

The generated status file looks like this:

status.json file showing metadata section with testPlanSource, totalIterations at 0, maxIterations at 50, summary counts, and testCases section with TC-001 Display Payment Mode Settings, TC-002 Prerequisites Check WooCommerce Not Active, and TC-003 Prerequisites Check No Payment Gateway, all pending

Every test case has:

  • status: “pending”, “pass”, “fail”, or “knownIssue”
  • fixAttempts: How many times Claude tried to fix this case
  • notes: What Claude observed during testing
  • lastTestedAt: Timestamp of the last test

All 38 tests. Ready to go. Pending status across the board.

.

.

.

5. Trigger the Ralph Loop

Now the magic happens.

Trigger the Ralph loop with this command:

/ralph-loop:ralph-loop "perform this: @notes/ralph_test/prompt.md" --completion-promise "ALL_TESTS_RESOLVED" --max-iterations 100
Claude Code terminal showing the ralph-loop command with prompt.md path, completion-promise set to ALL_TESTS_RESOLVED, and max-iterations 100
  • The --completion-promise tells Ralph to keep looping until Claude outputs “ALL_TESTS_RESOLVED.”
  • The --max-iterations prevents infinite loops. (Because nobody wants that.)

6. Watch Claude Test Its Own Work

Claude starts by reading the state files to understand the current status:

Claude Code showing iteration 1 starting, reading state files showing 38 total test cases all pending, then selecting TC-001 Display Payment Mode Settings as the first test, launching browser to navigate to admin settings page

It picks TC-001: Display Payment Mode Settings.

Then it launches a browser—an actual browser!—navigates to the settings page, and verifies each requirement:

Claude Code showing TC-001 verification results with green checkmarks for: Settings page loads, Payment Mode section visible, Radio button options displayed showing Direct Stripe Integration and WooCommerce, and Description of each mode shown in info box. Concludes with TC-001 PASS

All checks pass. TC-001: PASS ✅

(Look at all those green checkmarks. Gorgeous.)

Claude updates the status file:

Claude Code diff showing status.json updates: totalIterations changed from 0 to 1, startedAt and lastUpdatedAt timestamps added, pending count decreased from 38 to 37, pass count increased from 0 to 1, TC-001 status changed from pending to pass with detailed notes

Then updates results.md with a human-readable log:

Claude Code showing results.md being written with Iteration 1 header, TC-001 test case name, PASS status, and detailed notes about what was verified. Shows iteration complete with 1 pass 37 pending, then stop hook triggering Ralph iteration 2

Notice the stop hook at the bottom: “Ralph iteration 2.”

The loop automatically triggers the next iteration.

No manual intervention.

No babysitting.

Just… Claude Code testing itself.

.

.

.

7. The Loop Continues (Without You)

Iteration 2 starts.

Claude reads the state (1 pass, 37 pending), picks TC-002:

Claude Code showing Iteration 2 starting with current state 1 pass 37 pending, selecting TC-002 Prerequisites Check WooCommerce Not Active, then running bash command to deactivate WooCommerce plugin to test the prerequisite behavior

TC-002 requires WooCommerce to be deactivated.

So what does Claude do? Runs wp plugin deactivate woocommerce, then tests the settings page behavior.

The test passes—the WooCommerce option correctly shows “Setup Required” when the plugin is inactive:

Claude Code showing TC-002 verification with green checkmarks for: Settings page loads, WooCommerce option shows Setup Required badge and is disabled, Error message displayed in Prerequisites section with Install WooCommerce link, Payment Gateway prerequisite shows warning. TC-002 PASS

Claude reactivates WooCommerce and updates the status:

Claude Code diff showing status.json metadata updates: totalIterations increased to 2, lastUpdatedAt timestamp updated
Claude Code diff showing TC-002 status changed from pending to pass with notes describing the WooCommerce deactivation test results

And appends to results.md:

Claude Code showing results.md update with Iteration 2 section added, TC-002 test details, PASS status, and notes about WooCommerce prerequisite behavior verification

Iteration 2 complete.

Stop hook triggers iteration 3:

Claude Code showing Iteration 2 Complete with TC-002 PASS, total 2 pass 36 pending, stop hook message triggering Ralph iteration 3, then reading state for TC-003 Prerequisites Check No Payment Gateway

This continues automatically. Test after test after test.

You could go make coffee. Take a walk. Do your taxes.

(Okay, maybe not taxes.)

.

.

.

8. When Tests Fail, Claude Fixes Them

HERE’S where Ralph Loop really shines.

During testing, Claude encounters a failing test. The pricing page isn’t displaying plan prices correctly.

Does it give up? Does it log “FAIL” and move on?

Nope.

Claude investigates, finds the issue—the template is using old meta keys instead of the Plan model—and fixes it:

Claude Code showing investigation of pricing section template, identifying issue with meta key mismatch between template expectations and Plan model, then updating section-plans.php to replace manual WP_Query with Plan model method call

Then Claude retests to verify the fix worked:

Claude Code showing template update to use Plan::findPublished(), refreshing pricing page, verifying correct prices displayed (Hot Desk $199/month, Dedicated Desk $399/month with Most Popular badge, Private Office $799/month), then clicking Get Started to test checkout flow

The pricing page now shows correct prices. Claude clicks “Get Started” to continue testing the checkout flow.

Test. Find bug. Fix bug. Retest. Confirm fix.

All automatic.

.

.

.

9. All Tests Pass

After 3 hours and 32 minutes, all 38 test cases resolve:

Claude Code showing final test results summary table with all categories passing: Payment Mode Configuration (5 tests), Product & Plan Sync (2 tests), Checkout Flow (6 tests), Subscription Lifecycle (6 tests), Renewal Processing (2 tests), Plan Changes (6 tests), Cancellation (2 tests), Member Portal (4 tests), Admin Features (2 tests), Emails (2 tests), Security (1 test). Key P0/P1 Critical Tests listed below, ending with ALL_TESTS_RESOLVED and total time of 3h 32m 34s

Summary of Test Results:

  • Payment Mode Configuration: 5 tests ✅
  • Product & Plan Sync: 2 tests ✅
  • Checkout Flow: 6 tests ✅
  • Subscription Lifecycle: 6 tests ✅
  • Renewal Processing: 2 tests ✅
  • Plan Changes: 6 tests ✅
  • Cancellation: 2 tests ✅
  • Member Portal: 4 tests ✅
  • Admin Features: 2 tests ✅
  • Emails: 2 tests ✅
  • Security: 1 test ✅

Total: 38 tests. All passing.

The critical P0/P1 tests that Claude fixed during the loop:

  • TC-004: Mode Switch Blocking ✅
  • TC-009: Guest Checkout Prevention ✅ (with fix)
  • TC-016: Race Condition Prevention ✅
  • TC-018: Pre-Renewal Token Validation ✅
  • TC-020: 3D Secure Handling ✅
  • TC-038: Token Ownership Validation ✅

HECK YES.

.

.

.

The Proof: It Actually Works Now

Remember the checkout problem from the beginning? The one that made me question my life choices?

Let’s see what happens now.

The pricing page displays correctly:

CoWorkPress pricing page showing three plan cards: Hot Desk at $199/month, Dedicated Desk at $399/month with Most Popular badge, and Private Office at $799/month, each with feature lists and Get Started buttons. Arrow pointing to Hot Desk Get Started button

Click “Get Started” on Hot Desk, and you’re redirected to the WooCommerce checkout:

WooCommerce Checkout page with Account Required for Subscription notice, Contact information section with email field, Billing address fields, and Order summary showing Hot Desk at $199.00 with Monthly billing cycle

See the difference?

This is the WooCommerce checkout page.

The order summary shows “Hot Desk” with “Billing Cycle: Monthly.” The account creation notice appears because subscriptions require accounts.

(This is the moment I did a small victory dance. Don’t judge.)

Scroll down to payment options—Stripe through WooCommerce:

Payment options section showing Stripe card payment form with test mode notice, card number field filled with test card 4242, expiration and security code fields, and optional fields for saving payment information

The Stripe integration now runs through WooCommerce. Same payment processor, but managed by WooCommerce’s subscription system. I can swap in PayPal, Square, or any other gateway without touching theme code.

Complete the purchase, and you land on the welcome page:

Welcome to CoWorkPress confirmation page with checkmark icon, Your membership is now active message, Order Confirmation card showing Hot Desk plan, Monthly subscription, $199.00 charged, next billing date March 10 2026, confirmation number, and View Receipt on Stripe link

Everything works.

The flow connects end-to-end.

The WooCommerce integration that Claude “completed” previously?

Now it’s actually complete.

.

.

.

The Complete Journey: From Idea to Working Product

Let me zoom out and show you how all three parts of this series connect:

Infographic titled "FROM IDEA TO WORKING PRODUCT" detailing a four-phase software development process. The first phase, "PHASE 1: SPECS" (with a brain emoji and document icon), involves describing the task and trigger, asking user questions until 95% confident. An arrow leads to "PHASE 2: TEST PLAN" (with eyes emoji and checklist icon), which asks "What does success look like?" and defines "38 test cases with criteria." The next phase, "PHASE 3: IMPLEMENTATION" (with keyboard emoji and code icon), addresses "What tasks map to which test cases?" and lists "13 tasks in 4 phases." An arrow descends to a central box for "PHASE 4: TESTING" (with seal emoji and browser loop icon), detailing the "RALPH LOOP": "Pick test case," "Execute in browser," "Pass? → Next" (with green check), "Fail? → Fix → Retest" (with red cross), and final results of "3h 32m" and "38/38 passing" (with green check). A final arrow points down to the "WORKING PRODUCT" box (with rocket emoji and green check), listing the outcomes: "All features implemented," "All edge cases handled," "All tests verified in actual browser," and "Bugs found and fixed during testing." The background is soft light gray, with navy text and structural lines, and safety orange accents for key metrics and the final result box.

Phase 1: Bulletproof Specs

We started by brainstorming comprehensive specifications.

Using the AskUserQuestion tool, Claude asked 12 clarifying questions covering everything from subscription handling to checkout experience to refund policies. Then Claude critiqued its own specs, finding 14 potential issues before we wrote any code.

Phase 2: Test Plan

Before implementation, we generated a test plan.

38 test cases defining exactly what success looks like—from a user’s perspective. These became our acceptance criteria.

Phase 3: Implementation Plan + Sub-Agents

We created an implementation plan mapping tasks to test cases. Then executed with sub-agents running in parallel waves, keeping context usage low while building everything in 52 minutes.

Phase 4: Claude Code Testing + Fixing with Ralph Loop

Finally, we let Ralph loose. The autonomous loop tested each case in an actual browser, found the bugs Claude missed during implementation, fixed them, and verified the fixes.

3 hours 32 minutes later: 38/38 tests passing.

.

.

.

What I’ve Learned About Building With AI

Here’s what this whole journey taught me.

We all want AI to one-shot solutions on the first try. To type a prompt, hit enter, and watch magic happen. And when it doesn’t work perfectly? We blame the AI. Call it nerfed. Call it lazy. Move on to the next shiny tool.

But here’s the thing I keep coming back to:

Even the most experienced developer can’t one-shot a complex feature.

We write code. Test it. Find bugs. Fix them. Test again. That’s just how building software works. Always has been. Probably always will be.

AI is no different.

The breakthrough—the real breakthrough—comes from giving AI the ability to verify its own work. The same way any developer does. Write the code. Test it against real user scenarios. See what breaks. Fix it. Test again.

Ralph Loop makes this autonomous.

You don’t have to manually test 38 scenarios. You don’t have to spot the bugs yourself. You don’t have to describe each fix.

You define success criteria upfront (test plan), give Claude the ability to test against those criteria (browser automation), and let it iterate until everything passes.

👉 That’s the entire secret: structured iteration with clear success criteria.

Not smarter prompts. Not better models. Not more tokens.

Just… iteration.

The same boring, unsexy process that’s always made software work.

Except now, you don’t have to do it yourself.

8 min read The Art of Vibe Coding

How to Make Claude Code Actually Build What You Designed

Here’s a thing that happened to me once.

I moved apartments.

Being the organized person I am, I created the most detailed inventory list you’ve ever seen. Every box labeled. Every item cataloged. “Kitchen – Plates (12), Bowls (8), That Weird Garlic Press I Never Use But Can’t Throw Away (1).”

I handed this masterpiece to the movers and said, “Here you go!”

They looked at me like I’d handed them a grocery list in Klingon.

Because here’s what my beautiful inventory didn’t tell them: Which boxes go to which rooms. What order to load things. Which items are fragile. What depends on what. The fact that the bookshelf needs to go in before the desk, or nothing fits.

They weren’t wrong to be confused. I’d given them a comprehensive what without any how.

This is exactly what happens when you hand Claude Code your bulletproof specs and say “implement this.”

.

.

.

The Gap Nobody Warns You About

Last week, we talked about creating bulletproof specs using the 3-phase method.

You followed the process. You answered every clarifying question. You had Claude critique its own work. Your specs are comprehensive—2,000+ lines of detailed requirements, edge cases, and architectural decisions.

Now you’re ready to build.

So you fire up Claude Code and type: “Read the specs and implement it.”

Claude starts working. Files appear. Code flows.

And thirty minutes later? Half your edge cases are missing. The checkout flow doesn’t match what you specified. That critical race condition prevention you spent three rounds of Q&A perfecting?

Nowhere to be found.

Here’s the thing: Comprehensive specs don’t automatically translate to comprehensive implementation.

Your specs might be 2,000 lines. Claude’s context window is limited. As implementation progresses, early requirements fade from memory. The AI starts making shortcuts. Details slip through the cracks like sand through fingers.

Sound familiar?

(If you’re nodding right now, stay with me.)

The issue isn’t Claude’s capability. It’s the gap between what’s documented and what gets built. Even human developers working from perfect documentation miss things. They get tired. They make assumptions. They interpret requirements their own way.

Claude faces the same challenges—plus context limits that force it to work with only a subset of information at any given moment.

👉 The solution isn’t better prompting. It’s better process.

And that process? It’s what I’ve been calling the Claude Code implementation workflow. Let me show you what I mean.

.

.

.

The Missing Middle Layer

Here’s what most people do:

I call this the “hope-based development methodology.”

(That’s a joke. Please don’t actually call it that.)

Here’s what actually works:

An infographic titled "DEVELOPMENT WORKFLOW: SPECS TO IMPLEMENTATION" on a light gray background. A horizontal flowchart shows four rectangular boxes with rounded corners connected by gray arrows. The first box on the left is "SPECS" with a small pink brain icon. An arrow points from "SPECS" to the second box, "TEST PLAN," which has a clipboard with a checkmark and two eyes icon. A downward arrow from "TEST PLAN" points to a gray rounded rectangular box below with the text: "'What does success look like?'" followed by a green checkmark icon. An arrow from "TEST PLAN" points to the third horizontal box, "IMPLEMENTATION PLAN," with a keyboard icon. A downward arrow from "IMPLEMENTATION PLAN" points to a gray rounded rectangular box below with the text: "'What tasks map to which test cases?'" followed by a green checkmark icon. An arrow from "IMPLEMENTATION PLAN" points to the fourth horizontal box, "TASK-BY-TASK IMPLEMENTATION," with a seal icon. A downward arrow from "TASK-BY-TASK IMPLEMENTATION" points to a gray rounded rectangular box below with the text: "'One focused sub-agent per task'" followed by a green checkmark icon. All boxes have dark blue borders and dark blue text, except for the gray boxes below which have dark blue text and a green checkmark.
  • The test plan answers: “How will we know if each requirement is implemented correctly?”
  • The implementation plan answers: “What specific tasks need to happen, and in what order?”
  • The task management answers: “How do we keep Claude focused and prevent context overload?”

Think of it like this: your specs are the inventory list. The test plan is the “here’s how we’ll know each box arrived safely” checklist. The implementation plan is the “which room, which order, what depends on what” instruction sheet.

The movers—er, Claude—can actually do their job now.

Let me walk you through exactly how this Claude Code implementation workflow… well, works.

.

.

.

Step 1: Create a Test Plan From Your Specs

Before writing any code, Claude needs to understand what success looks like.

Now, I know what you’re thinking: “Wait—isn’t a test plan for after implementation?”

Traditionally, yes. But for AI-driven development, creating the test plan first serves a completely different purpose.

👉 It forces Claude to deeply analyze every requirement and translate it into verifiable outcomes.

When Claude creates test cases for “handle race conditions during renewal processing,” it has to think through exactly what that means. What are the preconditions? What actions trigger the behavior? What should the expected results be?

It’s like asking someone to write the exam questions before teaching the class. Suddenly, they understand the material much more deeply.

Here’s the prompt I use:

PROMPT: Create a Comprehensive Test Plan based on Specs
Create a comprehensive test plan that will verify the implementation matches the specs at `notes/specs.md`

### Step 1: Identify Test Scenarios

Based on the specs:
- Happy path flows
- Error conditions
- Edge cases
- State transitions
- Responsive behavior
- Accessibility requirements

### Step 2: Create Test Cases

For each scenario, create detailed test cases:

```markdown
### TC-NNN: [Test Name]

**Description:** [What this test verifies]

**Preconditions:**
- [Required state before test]
- [Required data]

**Steps:**

| Step | Action | Expected Result |
|------|--------|-----------------|
| 1 | [Action to take] | [What should happen] |
| 2 | [Action to take] | [What should happen] |

**Test Data:**
- Field 1: `value`
- Field 2: `value`

**Expected Outcome:** [Final verification]

**Priority:** Critical / High / Medium / Low
````

### Step 3: Organize by Category

Group test cases:

- Functional tests
- UI/UX tests
- Validation tests
- Integration tests (if applicable)
- Edge case tests

### Step 4: Create Status Tracker

```markdown
## Status Tracker

| TC     | Test Case | Priority | Status | Remarks |
| ------ | --------- | -------- | ------ | ------- |
| TC-001 | [Name]    | High     | [ ]    |         |
| TC-002 | [Name]    | Medium   | [ ]    |         |
```

### Step 5: Add Known Issues Section

```markdown
## Known Issues

| Issue | Description | TC Affected | Steps to Reproduce | Severity |
| ----- | ----------- | ----------- | ------------------ | -------- |
|       |             |             |                    |          |
```

## Output

Save the test plan to: `notes/test_plan.md`

Include:

1. Overview and objectives
2. Prerequisites
3. Reference wireframe (if applicable)
4. Test cases (10-20 typically)
5. Status tracker
6. Known issues section

## Test Case Guidelines

- Each test should be independent
- Use specific, concrete test data
- Include both positive and negative tests
- Cover all screens from wireframe (if applicable)
- Test all states from prototype
- Consider mobile/responsive

## Do NOT

- Over-test obvious functionality
- Skip error handling tests
- Forget accessibility basics
Create a comprehensive test plan that will verify the implementation matches the specs at `notes/specs.md`

### Step 1: Identify Test Scenarios

Based on the specs:
- Happy path flows
- Error conditions
- Edge cases
- State transitions
- Responsive behavior
- Accessibility requirements

### Step 2: Create Test Cases

For each scenario, create detailed test cases:

```markdown
### TC-NNN: [Test Name]

**Description:** [What this test verifies]

**Preconditions:**
- [Required state before test]
- [Required data]

**Steps:**

| Step | Action | Expected Result |
|------|--------|-----------------|
| 1 | [Action to take] | [What should happen] |
| 2 | [Action to take] | [What should happen] |

**Test Data:**
- Field 1: `value`
- Field 2: `value`

**Expected Outcome:** [Final verification]

**Priority:** Critical / High / Medium / Low
````

### Step 3: Organize by Category

Group test cases:

- Functional tests
- UI/UX tests
- Validation tests
- Integration tests (if applicable)
- Edge case tests

### Step 4: Create Status Tracker

```markdown
## Status Tracker

| TC     | Test Case | Priority | Status | Remarks |
| ------ | --------- | -------- | ------ | ------- |
| TC-001 | [Name]    | High     | [ ]    |         |
| TC-002 | [Name]    | Medium   | [ ]    |         |
```

### Step 5: Add Known Issues Section

```markdown
## Known Issues

| Issue | Description | TC Affected | Steps to Reproduce | Severity |
| ----- | ----------- | ----------- | ------------------ | -------- |
|       |             |             |                    |          |
```

## Output

Save the test plan to: `notes/test_plan.md`

Include:

1. Overview and objectives
2. Prerequisites
3. Reference wireframe (if applicable)
4. Test cases (10-20 typically)
5. Status tracker
6. Known issues section

## Test Case Guidelines

- Each test should be independent
- Use specific, concrete test data
- Include both positive and negative tests
- Cover all screens from wireframe (if applicable)
- Test all states from prototype
- Consider mobile/responsive

## Do NOT

- Over-test obvious functionality
- Skip error handling tests
- Forget accessibility basics
Claude Code terminal showing the prompt to create a comprehensive test plan based on specs, with structured steps for identifying test scenarios and creating detailed test cases with preconditions, steps tables, test data, and priority levels

Claude reads through all the specs, identifies what needs to be tested, and generates a structured test plan.

Claude Code output showing it reading the specs file and three detailed spec parts, then writing a 1145-line test plan file titled "WooCommerce Integration Test Plan" with version, date, and specification reference

For my WooCommerce integration, Claude created 38 test cases organized into 12 sections:

Claude Code displaying the test plan summary with 38 test cases across 12 sections including Payment Mode Configuration, Product Sync, Checkout Flow, Subscription Lifecycle, Renewal Processing, and more, with priority distribution showing 7 Critical, 20 High, and 11 Medium tests

Notice the priority distribution:

  • Critical (P0): 7 tests — Must pass before deployment
  • High (P1): 20 tests — Essential functionality
  • Medium: 11 tests — Important but not blocking

Each test case maps directly to a requirement in my specs. Nothing ambiguous. Nothing assumed. Nothing left to interpretation.

(This is the part where past-me would have skipped ahead to coding. Don’t be past-me.)

.

.

.

Step 2: Create an Implementation Plan That Maps to Test Cases

Now Claude knows what success looks like. Next question: how do we get there?

The implementation plan bridges test cases to actual tasks. Every task links back to specific test cases it will satisfy. It’s the “what depends on what” instruction sheet for our movers.

PROMPT: Create an Implementation Plan That Maps to Test Cases
Specs is approved, test plan is ready. Now we need an implementation plan.

- **Specs**: `notes/specs.md`
- **Test Plan:** `notes/test_plan.md`

## Your Task

Create a detailed implementation plan that maps to the test cases.

### Step 1: Analyze Test Cases

For each test case (TC-NNN):
- What functionality must exist?
- What files need to be created/modified?
- What dependencies are needed?

### Step 2: Create Task Breakdown

Group test cases into implementation tasks:

```markdown
## Implementation Plan: [PHRASE_NAME]

### Overview
[Brief description]

### Files to Create/Modify
[List all files]

### Implementation Tasks

#### Task 1: [Name]
**Mapped Test Cases:** TC-001, TC-002, TC-003
**Files:**
- `path/to/file1.php` - [description]
- `path/to/file2.js` - [description]

**Implementation Notes:**
- [Key detail 1]
- [Key detail 2]

**Acceptance Criteria:**
- [ ] TC-001 passes
- [ ] TC-002 passes
- [ ] TC-003 passes

#### Task 2: [Name]
...
````

### Step 3: Identify Dependencies

- What from previous phrases is needed?
- What order should tasks be implemented?
- Any external dependencies?

### Step 4: Estimate Complexity

- Simple: 1-2 tasks, straightforward
- Medium: 3-5 tasks, some complexity
- Complex: 6+ tasks, significant work

## Output

Save the implementation plan to: `notes/impl_plan.md`

Include:

1. Overview
2. Files to create/modify
3. Tasks with TC mappings
4. Dependencies
5. Complexity estimate

## Guidelines

- Every test case must map to a task
- Tasks should be completable in one session
- Include enough detail to guide implementation
- Reference design system patterns

## Do NOT

- Include actual code (next step)
- Over-engineer simple features
Specs is approved, test plan is ready. Now we need an implementation plan.

- **Specs**: `notes/specs.md`
- **Test Plan:** `notes/test_plan.md`

## Your Task

Create a detailed implementation plan that maps to the test cases.

### Step 1: Analyze Test Cases

For each test case (TC-NNN):
- What functionality must exist?
- What files need to be created/modified?
- What dependencies are needed?

### Step 2: Create Task Breakdown

Group test cases into implementation tasks:

```markdown
## Implementation Plan: [PHRASE_NAME]

### Overview
[Brief description]

### Files to Create/Modify
[List all files]

### Implementation Tasks

#### Task 1: [Name]
**Mapped Test Cases:** TC-001, TC-002, TC-003
**Files:**
- `path/to/file1.php` - [description]
- `path/to/file2.js` - [description]

**Implementation Notes:**
- [Key detail 1]
- [Key detail 2]

**Acceptance Criteria:**
- [ ] TC-001 passes
- [ ] TC-002 passes
- [ ] TC-003 passes

#### Task 2: [Name]
...
````

### Step 3: Identify Dependencies

- What from previous phrases is needed?
- What order should tasks be implemented?
- Any external dependencies?

### Step 4: Estimate Complexity

- Simple: 1-2 tasks, straightforward
- Medium: 3-5 tasks, some complexity
- Complex: 6+ tasks, significant work

## Output

Save the implementation plan to: `notes/impl_plan.md`

Include:

1. Overview
2. Files to create/modify
3. Tasks with TC mappings
4. Dependencies
5. Complexity estimate

## Guidelines

- Every test case must map to a task
- Tasks should be completable in one session
- Include enough detail to guide implementation
- Reference design system patterns

## Do NOT

- Include actual code (next step)
- Over-engineer simple features

Claude Code terminal showing the prompt to create an implementation plan with task breakdowns mapped to specific test cases, including file lists and acceptance criteria linked to TC numbers

Claude analyzes both the specs and test plan, then generates a phased implementation plan:

Claude Code output showing it analyzing specs and test plan files, reading detailed spec parts, then writing an 848-line implementation plan file with version info and test plan reference
Claude Code displaying the implementation plan summary organized into 4 phases with 12 tasks total, showing a table with Phase, Focus, Tasks count, and Test Cases covered for each phase, plus critical path items and new files required

The result: 4 phases, 12 implementation tasks, each explicitly linked to the test cases that will verify them.

  • Phase 1: Foundation & Configuration (2 tasks → TC-001 to TC-007)
  • Phase 2: Checkout & Lifecycle (2 tasks → TC-008 to TC-014)
  • Phase 3: Renewal Processing (3 tasks → TC-015 to TC-021, TC-038)
  • Phase 4: Features (6 tasks → TC-022 to TC-037)

Plus a critical path of P0/P1 items that must work before deployment:

  • Mode switch blocking when active subscriptions exist
  • Race condition prevention (double-charge protection—ferpetesake, the payments!)
  • Pre-renewal token validation
  • Token ownership security

Now we have specs, a test plan, AND an implementation plan. Three documents that all reference each other. A complete picture.

But here’s where most people (including past-me, again) would stumble.

.

.

.

Step 3: Execute With Sub-Agents (This Is Where It Gets Fun)

Here’s something I learned the hard way.

There’s research showing that LLM performance degrades as context size increases. When Claude’s context fills up with implementation details from Task 1, it starts losing precision on Task 8. It’s like asking someone to remember the first item on a grocery list after they’ve been shopping for an hour.

The fix: run each task in its own sub-agent.

Each sub-agent gets fresh context. It focuses on one task, implements it, and reports back. The orchestrating agent manages dependencies and progress. No context pollution. No forgotten requirements.

It’s like having a team of movers where each person is responsible for exactly one room—and they all have fresh energy because they haven’t been carrying boxes all day.

Here’s the prompt that kicks off the Claude Code implementation workflow execution:

PROMPT: Claude Code implementation workflow execution

We are executing the implementation plan. All design and planning is complete.

## Reference Documents
- **Specs:** @notes/specs.md
- **Implementation Plan:** @notes/impl_plan.md
- **Test Plan:** @notes/test_plan.md

## Phase 1: Task Creation

### Before Creating Tasks
1. Review @notes/impl_plan.md completely
2. Understand test case expectations from @notes/test_plan.md
3. Reference wireframe/prototype for UI (if applicable)
4. Check design system for patterns (if available)

### Create Tasks from Implementation Plan
Parse the implementation plan and use `TaskCreate` to create a task for each implementation item:

1. **Extract all tasks** from @notes/impl_plan.md
2. **Identify dependencies** between tasks (what must be done before what)
3. **Create each task** with:
   - Clear description including the specific files to create/modify
   - Mapped test cases (TCs) that verify the task
   - `blocked_by`: tasks that must complete first
   - `blocks`: tasks that depend on this one

Tasks should be granular enough to run independently but logical enough to represent complete units of work.

---

## Phase 2: Task Execution

### Execution Strategy
Execute tasks using sub-agents for parallel processing:

1. **Group tasks into waves** based on dependencies
2. **Run each task in its own sub-agent** - This keeps context usage low (~18% vs ~56%)
3. **Process waves sequentially** - Wave N+1 starts only after Wave N completes

### For Each Task (Sub-Agent Instructions)
1. Use `TaskGet` to read full task details
2. Create/modify specified files
3. Implement functionality to pass mapped TCs
4. **Self-verify the implementation:**
   - Check that code compiles/runs without errors
   - Verify the functionality matches test expectations
   - Ensure design consistency with existing patterns
5. Use `TaskUpdate` to mark task complete with a brief summary of what was done
6. Note any deviations, concerns, or discovered issues

### If Issues Are Discovered
- Use `TaskCreate` to add new fix/bug tasks
- Set appropriate dependencies so fixes run in correct order
- Continue with other independent tasks

---

## Phase 3: Completion Summary

After all tasks are complete, provide:

### 1. Summary of Changes
- Files created
- Files modified  
- Key functionality added

### 2. Self-Verification Results
- What works as expected
- Any concerns or edge cases noted
- Tasks that required fixes (if any)

### 3. Ready for Testing
- Confirm all tasks marked complete
- List any setup needed for testing
- Note any known limitations

---

## Important Notes

- **Do NOT run full test suite** - that's the next step
- **Use `TaskList`** periodically to check overall progress
- **Dependencies are critical** - ensure tasks don't start before their blockers complete
- **Keep sub-agent context focused** - each sub-agent only needs info for its specific task

We are executing the implementation plan. All design and planning is complete.

## Reference Documents
- **Specs:** @notes/specs.md
- **Implementation Plan:** @notes/impl_plan.md
- **Test Plan:** @notes/test_plan.md

## Phase 1: Task Creation

### Before Creating Tasks
1. Review @notes/impl_plan.md completely
2. Understand test case expectations from @notes/test_plan.md
3. Reference wireframe/prototype for UI (if applicable)
4. Check design system for patterns (if available)

### Create Tasks from Implementation Plan
Parse the implementation plan and use `TaskCreate` to create a task for each implementation item:

1. **Extract all tasks** from @notes/impl_plan.md
2. **Identify dependencies** between tasks (what must be done before what)
3. **Create each task** with:
   - Clear description including the specific files to create/modify
   - Mapped test cases (TCs) that verify the task
   - `blocked_by`: tasks that must complete first
   - `blocks`: tasks that depend on this one

Tasks should be granular enough to run independently but logical enough to represent complete units of work.

---

## Phase 2: Task Execution

### Execution Strategy
Execute tasks using sub-agents for parallel processing:

1. **Group tasks into waves** based on dependencies
2. **Run each task in its own sub-agent** - This keeps context usage low (~18% vs ~56%)
3. **Process waves sequentially** - Wave N+1 starts only after Wave N completes

### For Each Task (Sub-Agent Instructions)
1. Use `TaskGet` to read full task details
2. Create/modify specified files
3. Implement functionality to pass mapped TCs
4. **Self-verify the implementation:**
   - Check that code compiles/runs without errors
   - Verify the functionality matches test expectations
   - Ensure design consistency with existing patterns
5. Use `TaskUpdate` to mark task complete with a brief summary of what was done
6. Note any deviations, concerns, or discovered issues

### If Issues Are Discovered
- Use `TaskCreate` to add new fix/bug tasks
- Set appropriate dependencies so fixes run in correct order
- Continue with other independent tasks

---

## Phase 3: Completion Summary

After all tasks are complete, provide:

### 1. Summary of Changes
- Files created
- Files modified  
- Key functionality added

### 2. Self-Verification Results
- What works as expected
- Any concerns or edge cases noted
- Tasks that required fixes (if any)

### 3. Ready for Testing
- Confirm all tasks marked complete
- List any setup needed for testing
- Note any known limitations

---

## Important Notes

- **Do NOT run full test suite** - that's the next step
- **Use `TaskList`** periodically to check overall progress
- **Dependencies are critical** - ensure tasks don't start before their blockers complete
- **Keep sub-agent context focused** - each sub-agent only needs info for its specific task

Claude Code terminal showing the comprehensive implementation execution prompt with three phases: Task Creation with dependency tracking, Task Execution using sub-agents in waves for parallel processing, and Completion Summary requirements

I know. That’s a lot of prompt. But here’s the thing—you set this up once, and then you watch the magic happen.

Task Creation

First, Claude creates all 13 tasks from the implementation plan:

Claude Code showing the task creation process with a list of all tasks including Payment Mode Configuration, Product Sync, Plan Changes, Cancellations, Day Pass Handling, Member Portal, Admin Dashboard, Email Notifications, Checkout Flow, and Subscription Lifecycle with their blocking dependencies

Then sets up dependencies between them. (This is the “bookshelf before desk” part.)

Claude Code displaying the complete task dependency tree showing all tasks with their blocking relationships, such as Task 2.1 Checkout Flow blocked by Tasks 1 and 2, Task 3.1 Renewal Processing blocked by Tasks 3 and 4, and Phase 4 tasks blocked by multiple earlier tasks

Wave-Based Execution

Before diving into implementation, Claude explores the existing codebase to understand patterns:

Claude Code showing Phase 1 execution starting with codebase exploration using 18+ tool calls, displaying the task list with all tasks pending and their dependency relationships clearly visible

Wave 1 starts with 2 sub-agents running Tasks 1.1 and 1.2 in parallel:

Claude Code showing Wave 1 execution with 2 Task agents running in parallel, implementing Payment Mode Configuration (40 tool uses, 111.2k tokens) and Product Sync (36 tool uses, 123.2k tokens), with real-time status updates showing files being modified

Once Wave 1 completes, Wave 2 begins:

Claude Code displaying Wave 1 completion with both tasks marked Done, then starting Wave 2 with 2 new Task agents implementing Checkout Flow and Subscription Lifecycle, showing the task list with Tasks 3 and 4 now unblocked

Wave 2 completes, Wave 3 (the critical phase) starts with 3 sub-agents:

Claude Code showing Wave 2 completion and Wave 3 starting with 3 Task agents implementing Renewal Processing with Race Condition Prevention, Token Validation and Ownership Security, and 3D Secure Handling with Retry Logic, with completed tasks shown crossed out

Wave 3 completes, and Wave 4 launches with 6 sub-agents in parallel:

Claude Code displaying Wave 3 completion with all 3 tasks done, then starting Wave 4 with 6 Task agents running simultaneously for Plan Changes, Cancellations, Day Pass, Portal Integration, Admin Dashboard, and Email Notifications

(Six sub-agents. Running in parallel. Each with fresh context. This is the future we were promised.)

Implementation Complete

All 13 tasks across 4 waves:

Claude Code showing all 6 Wave 4 Task agents finished with their tool counts and token usage, followed by the Implementation Complete summary showing Phase 1 Foundation and Configuration changes including new files created and modified for Payment Mode and Product Sync
Claude Code displaying Phase 2 Checkout and Lifecycle implementation summary showing created files for Checkout Flow and Subscription Lifecycle with test case mappings, plus Phase 3 Renewal Processing Critical section with Race Prevention, Token Validation, and 3DS Retry Logic details
Claude Code showing Phase 4 Features implementation summary including Plan Changes, Cancellations, Day Passes, Portal Integration, Admin Dashboard, and Email Notifications, plus Test Coverage table showing all 38 test cases implemented across P0 Critical, P1 Critical, High, and Medium priorities
Claude Code displaying the final implementation summary with P0/P1 Critical Features Implemented table showing 8 critical features with their implementations, Ready for Testing section with setup requirements, Known Limitations, and confirmation that all 13 tasks across 4 phases are complete, with total execution time of 52 minutes 22 seconds

Total time: 52 minutes 22 seconds.

That’s 13 tasks. 38 test cases worth of functionality. Parallel execution keeping context usage low throughout.

Remember my WooCommerce integration specs? The 2,000+ lines that would have turned into a context-overloaded mess if I’d just said “implement this”?

Every requirement addressed. Every edge case accounted for. Every critical feature implemented.

.

.

.

The Complete Claude Code Implementation Workflow

Let me put this all together:

A technical infographic titled "SPECS TO IMPLEMENTATION WORKFLOW" outlines a process in five steps using a blue, gray, and orange color scheme on a light grid background. The title is at the top in large, bold dark text. Below it, the workflow begins with a box on the left titled "BULLETPROOF SPECS," accompanied by an icon of a document with a shield, stating "2,000+ lines of detailed requirements." An arrow points right to a box titled "TEST PLAN," with a clipboard and magnifying glass icon, asking "'What does success look like?'" and noting "38 test cases across 12 sections." An arrow points down to a central box titled "IMPLEMENTATION PLAN," featuring a flowchart icon, with the text "'What tasks map to which test cases?'" and "12 tasks in 4 phases with dependencies." Another arrow points down to a box titled "TASK EXECUTION WITH SUB-AGENTS," with an icon of a gear and people, listing "Wave 1: 2 agents," "Wave 2: 2 agents," "Wave 3: 3 agents," and "Wave 4: 6 agents," with a callout box highlighting "~18% context vs ~56% without." The final box at the bottom is titled "IMPLEMENTATION COMPLETE," with a checkered flag and stopwatch icon, summarizing the results as "52 minutes," "13 tasks," and "38 test cases" with a green checkmark.

Step 1: Create a test plan from your specs

  • Claude analyzes every requirement
  • Generates verifiable test cases
  • Establishes priority levels

Step 2: Create an implementation plan mapped to test cases

  • Groups requirements into logical tasks
  • Links each task to specific test cases
  • Identifies dependencies between tasks

Step 3: Execute with sub-agents

  • Create all tasks with dependency tracking
  • Process in waves based on blocking relationships
  • Each sub-agent gets fresh context
  • Parallel execution where dependencies allow

It’s the complete Claude Code implementation workflow—from inventory list to fully unpacked apartment. (Okay, I’ll stop with the moving metaphor now.)

.

.

.

But Wait. There’s a Catch.

You followed the workflow. Claude built everything in 52 minutes. Your implementation summary shows all 38 test cases “implemented.”

Here’s the uncomfortable truth I need to share with you:

“Implemented” and “working correctly” aren’t the same thing.

Claude sometimes takes shortcuts. Features get partially built. Edge cases get acknowledged in comments but not actually handled. The sub-agent said “Done”—but did it really do what the test case required?

You need a verification step. A way to systematically check that every test case actually passes. A process for catching the gaps before you discover them in production.

That’s next week.

(Yes, another cliffhanger 😁)