Skip to content

Category: The Art of Vibe Coding

10 min read The Art of Vibe Coding

How I Chained Two Codex /goal Runs to Build a Complete CLI Tool

One paragraph describing a CLI idea.

Two /goal commands. Eighty minutes of letting the machine work.

A complete CLI tool at the end — 35 tests passing across 6 files, build green, typecheck green, every acceptance criterion mapped to evidence.

Last week’s post showed /goal building a single WordPress plugin in 28 minutes. One goal, one feature, one walk-away-and-come-back. That was the proof of concept.

This time: two goals, chained. The first goal built the MVP command surface in 32 minutes. The second added OAuth authentication in 47 minutes. Both ran fully autonomously while I was doing something else.

The vehicle for this build is one you might recognize — the same Reddit summarizer I built back in March using the 6-step Workflow Engineering process. That version was an Express server with REST endpoints and active supervision throughout. It worked. But every time I wanted an AI agent to pull Reddit data, it had to start the server, make HTTP calls, parse responses — burning tokens on ceremony. A CLI tool that an agent can invoke directly from the command line, get structured JSON back, and move on? That consumes a fraction of the tokens and saves real money on Claude and Codex subscriptions.

So I rebuilt it.

Back in March, building that summarizer required active supervision across six steps — spec brainstorm, review, test plan, implementation plan, execute, test. This time: two skill invocations and two /goal commands, with less than ten minutes of human input across the whole thing.

The codex goal command scales beyond single features.

When a project is too big for one goal, you slice it — and the skill can help you find the seams.

VS Code file explorer showing only .codex/skills/cli-spec-to-goal and playwright-cli folders — the starting state with skills checked in but no implementation code

.

.

.

Meet cli-spec-to-goal — The Skill That Splits

Last week’s post introduced wp-spec-to-goal, a Codex Agent skill that turns a vague paragraph into the GOAL.md / VERIFY.md / PROGRESS.md trio that Codex needs to finish a goal autonomously. That skill was designed for WordPress plugins.

cli-spec-to-goal is its counterpart for CLI tools.

Same core workflow — take a vague idea, ask focused questions, produce the goal bundle plus an optional project scaffold. One key difference sets it apart.

Automatic complexity detection.

The WP skill always produced one goal.

The CLI skill inspects the spec and judges whether the project fits in a single goal or should be split into multiple slices. When it decides to split, it writes a goals-plan.md at the project root with numbered slices and generates the first goal bundle only — leaving the rest for later invocations.

(That split detection turned out to be the most interesting part of the whole build. More on that in a moment.)

Every goal it generates includes six AI-agent-friendly patterns: --json output mode, stdout/stderr separation, TTY detection, meaningful exit codes, structured errors, and --dry-run previews. These patterns make the resulting CLI safe for AI agents to invoke directly. The repo has the full list in the skill’s reference templates.

For the invocation, I typed one paragraph.

It described what the existing Express server does, said I wanted a CLI version that’s easier for AI agents to interact with, and pointed at the server codebase as a read-only reference.

Codex terminal showing the cli-spec-to-goal skill invocation with a plain-English description of converting the Express/TypeScript server into an AI-agent-friendly CLI tool, with the server codebase path provided as reference

The skill probed both repos before asking anything.

It read the empty target repo, then explored the Express server’s source files — config, Reddit API client, storage patterns, route handlers, rate limiting, test structure. It identified this as a project broad enough to warrant splitting.

Codex exploring the empty target workspace and reading the Express server's source files, noting the project is shaped like a new CLI that can reuse the server's Reddit client and storage logic

.

.

.

The Split Decision

This is the moment the skill earned its keep.

After probing both repos — the empty target and the existing Express server — it came back with three decisions to confirm:

  1. Scope shape: Split plan + first goal (recommended) vs. one combined goal
  2. Target repo: Build the CLI in the new repo, using the server as a read-only source reference
  3. First CLI surface: MVP core commands — collect, collect-all, logs list, logs read, health — using an existing env-based refresh token. No OAuth login in the first slice.
Codex presenting three decisions with recommended defaults: split plan plus first goal for scope shape, build in the new repo for target, and MVP core commands for the first CLI surface, with reasoning for each recommendation

I replied “use recommendations” and the skill started writing.

It pulled its reference templates — goal, verify, and progress — and began generating. Two and a half minutes of file creation later, here’s what appeared:

A goals-plan.md listing three proposed goal slices (MVP, auth, config polish).

A complete goal trio for the first slice: GOAL.md with 4 user stories and 15 acceptance criteria, VERIFY.md with binary smoke checks, functional checks, exit code checks, and integration checks, and a skeleton PROGRESS.md ready for Codex to fill in during the goal run.

It also produced the exact /goal command to paste — tailored to the file paths it had just created.

Total time from invocation to handoff: 4 minutes 45 seconds.

Skill completion output listing the generated files — goals-plan.md plus goals/reddit-cli-mvp/GOAL.md, VERIFY.md, and PROGRESS.md — with the tailored /goal command ready to paste and a 4m 45s elapsed time
VS Code file explorer showing the generated structure: .codex/skills with cli-spec-to-goal and playwright-cli, goals/reddit-cli-mvp with GOAL.md, PROGRESS.md, and VERIFY.md, plus goals-plan.md at the root

.

.

.

Goal 1: The MVP (32 Minutes)

The handoff: paste the /goal command, press enter, walk away.

/goal Complete goals/reddit-cli-mvp/GOAL.md. Use goals/reddit-cli-mvp/VERIFY.md
as the verification contract. Update goals/reddit-cli-mvp/PROGRESS.md continuously.
Treat uncertainty as incomplete.

Short command.

Heavy lifting lives in the files the skill already wrote.

Codex terminal showing the /goal command for reddit-cli-mvp pasted into the composer, ready to execute

Codex activated the goal.

It read GOAL.md, VERIFY.md, and PROGRESS.md, inspected the current repo state, checked for any existing scaffold or generated files, and started implementing. It used the Express server source as a read-only reference — pulling the Reddit client patterns, filtering logic, and JSON log structure — while building the TypeScript CLI from scratch.

Codex goal activation showing it reading the goal trio files, exploring the repo tree, checking for existing package.json and README.md, and beginning its implementation plan

Then the black box.

I left. Codex worked.

I went to make coffee. Checked YouTube while it brewed. The Codex session kept running through file edits, test runs, and self-audits in the background. The session was busy. I was elsewhere.

32 minutes and 44 seconds later, the goal was marked complete.

What shipped:

  • The full TypeScript ESM CLI scaffold with Commander.js,
  • 5 commands (collect, collect-all, logs list, logs read, health),
  • 26 source files,
  • 5 test files with 15 tests — all passing. npm install, npm run build, npm run typecheck, and npm test all green. Binary smoke checks and functional JSON/error/log checks from VERIFY.md all passed.

If you read last week’s post, this shape should look familiar. The spec defines the boundaries, the continuation prompt refuses to declare victory without evidence, and you trust the result because the audit trail is sitting right there in PROGRESS.md.

One honest note from the completion summary: no live Reddit API check was run, because no real credentials were available in the build environment. Tests used mocks, as required by the verification contract. Codex noted this explicitly — the kind of transparency you want from an autonomous run.

Goal completion summary showing 5 test files, 15 tests passed, all verification commands green including npm install, build, typecheck, and test, with binary smoke checks and functional checks from VERIFY.md passed. Goal usage 1955 seconds, worked for 32m 44s, with a "Goal achieved (32m)" badge

.

.

.

Goal 2: OAuth Auth (47 Minutes, Same Pattern)

The MVP left a deliberate gap.

It required an existing REDDIT_REFRESH_TOKEN in .env to call the Reddit API — functional for a developer who already has credentials, but no way to acquire them through the tool itself. The goals-plan.md had already named the next slice: reddit-cli-auth.

VS Code showing goals-plan.md with three proposed goal slices, the second slice (goals/reddit-cli-auth for OAuth auth commands) highlighted with a red box, indicating the natural next step

I invoked the skill again, this time with one sentence: “Now that we have completed goals/reddit-cli-mvp, I want to proceed with the next goal: goals/reddit-cli-auth.”

Codex terminal showing the cli-spec-to-goal skill invocation with the instruction to proceed with the next goal, referencing the goals-plan.md

The skill scanned the now-implemented codebase.

The repo that was empty an hour ago now had 26 source files, a working test suite, and a complete CLI structure. It read the existing Commander setup, Vitest test patterns, error handling conventions, and .env configuration. It also searched Reddit’s OAuth2 documentation to verify endpoint details and token flow specifics.

Codex reading the implemented MVP source files — cli.ts, config.ts, reddit.ts, errors.ts, types.ts — and searching Reddit OAuth documentation for authorization code flow details, redirect URI scope, and token endpoints

The GOAL.md it produced fit into the existing codebase. It referenced the same Commander program, the same test framework, the same error types, and the same .env storage approach. The skill verified there were no unresolved template placeholders, confirmed it had only generated the /goal contract files (no implementation), and cross-checked OAuth endpoint details against Reddit’s official documentation.

Skill output listing the generated goal bundle — goals/reddit-cli-auth/GOAL.md, VERIFY.md, and PROGRESS.md — with the tailored /goal command and a note that OAuth endpoints were checked against Reddit's OAuth2 documentation

Paste the second /goal command. Press enter. Walk away again.

/goal Complete goals/reddit-cli-auth/GOAL.md. Use goals/reddit-cli-auth/VERIFY.md
as the verification contract. Update goals/reddit-cli-auth/PROGRESS.md continuously.
Treat uncertainty as incomplete.
Codex terminal showing the second /goal command for reddit-cli-auth pasted into the composer

Codex activated the second goal.

Same startup pattern — read the goal trio, explore the repo, compare current implementation against the verification contract, build a plan.

Codex starting the second goal, reading goal files and exploring the existing codebase structure to understand what's already implemented before making changes

47 minutes and 15 seconds later…

auth login (full OAuth Authorization Code flow with a temporary localhost callback server), auth status (machine-readable token and identity info), and auth logout (with optional --revoke flag to invalidate the token server-side) — all wired into the existing CLI. 35 total tests across 6 files. Build, typecheck, and test all green.

Second goal completion showing auth commands implemented with OAuth/auth core in src/auth.ts, auth login/status/logout wired in src/cli.ts, access-token identity verification in src/reddit.ts, test coverage in tests/auth.test.ts and tests/cli.test.ts, README updated. Worked for 47m 15s

An interesting wrinkle surfaced during this run.

The second time I pasted a /goal command and walked away, it felt… ordinary. The novelty was gone. I didn’t hover over the terminal wondering if it would work. I just left.

That’s the point. When the second walk-away feels routine, the pattern has landed.

.

.

.

Does It Actually Work?

Same instinct as the WP post: close the terminal and test it like a real user.

First, the dry run:

node bin/reddit-summarizer.js collect --subreddit ClaudeCode --dry-run --json
Terminal showing the dry run command with clean JSON output: dryRun true, command collect, wouldCallReddit false, wouldWriteLog true, subreddit ClaudeCode, minScore 10, minComments 5, hours 4, commentsPerPost 10, output log

Clean JSON showing exactly what would happen without calling Reddit or writing files. dryRun: true, wouldCallReddit: false, wouldWriteLog: true. This is the --dry-run pattern in action — an AI agent invoking this CLI can preview any command before committing to side effects.

Then the real run:

node bin/reddit-summarizer.js collect --subreddit ClaudeCode --hours 24 \
  --min-score 10 --min-comments 5 --comments-per-post 3 --output both --json
Terminal showing the full collect command with flags for hours, score threshold, comment threshold, comments per post, and output mode

Real Reddit data came back. Posts from r/ClaudeCode with titles, authors, scores, comment counts, URLs, flairs, timestamps, and threaded comment data — machine-parseable JSON that any agent can consume directly.

Terminal showing real Reddit post data in JSON format — posts from r/ClaudeCode with titles like "Most important skill with agent coding learned so far", author names, scores, comment counts, URLs, flairs, and nested comment threads with body text and timestamps

The log file landed exactly where GOAL.md said it would — logs/ClaudeCode/2026-05-13.json — with the same structured data persisted to disk for downstream processing.

VS Code editor showing logs/ClaudeCode/2026-05-13.json with structured Reddit post data including post IDs, titles, authors, scores, comment counts, URLs, and nested comments, with the file tree showing the logs folder structure on the left

Two autonomous goal runs.

Zero manual coding.

The tool works against a live API.

The dry-run test proves the agent-safety patterns work.

The live run proves the Reddit integration works.

.

.

.

When to Split (And When Not To)

The skill detected the split automatically, but the heuristic is learnable.

Here’s when you should slice a project into multiple codex goal command runs:

Split when:

  • The project has more than ~5 acceptance criteria spanning unrelated concerns
  • There’s a natural “core first, then extensions” shape — MVP then auth, core then plugins
  • One slice needs credentials or setup that another doesn’t (OAuth login needs Reddit app registration; the MVP only needs an existing token)
  • The total scope would exhaust a single goal’s token budget

Keep as one goal when:

  • Everything shares the same test fixtures and setup
  • The feature is a single vertical slice — one user story, 3-5 acceptance criteria
  • Splitting would create artificial boundaries that increase integration risk

Here’s how goals-plan.md ties it all together:

  1. number your slices, give each a one-line description, and generate one goal at a time.
  2. Run them sequentially.
  3. Each /goal run inherits the codebase state from the previous one — the skill detects what already exists and generates goals that fit into the structure that’s already there.

.

.

.

The Bigger Picture

The pattern is repeating.

Last week showed the codex goal command with wp-spec-to-goal for a WordPress plugin. This post shows it with cli-spec-to-goal for a CLI tool. Skill generates spec, /goal executes spec, PROGRESS.md proves it. The domain changed, the workflow stayed the same.

Goal chaining is the multiplier.

One goal proved the concept. Two goals proved it scales. Each goal run inherits the full context of what was built before, because the codebase itself is the shared state. No context window to manage between goals — just files on disk.

And here’s the full circle.

The Reddit summarizer started as a 6-step Workflow Engineering build in March — spec brainstorm, review, test plan, implementation plan, execute, test — with active supervision at every step. The same project, rebuilt as a CLI, took two skill invocations and two /goal commands. The human work was describing what to build. The machine work was everything else.

The repo is public at github.com/nathanonn/cli-reddit-summarizer. Inspect the actual GOAL.md, VERIFY.md, and PROGRESS.md files for both goals. Grab the cli-spec-to-goal skill from .codex/skills/ if you build CLI tools.

Start thinking about your next project in terms of goal slices.

19 min read The Art of Vibe Coding

How to Use Codex /goal to Build WordPress Plugins (My Spec-to-Ship Workflow)

I typed /goal.

Walked away from the keyboard. Half-expected to come back to a mess.

Twenty-eight minutes later — no mess.

A working WordPress plugin was sitting there instead. Acceptance criteria mapped to evidence. Verification commands run. Browser screenshots captured by Playwright. A PROGRESS.md audit file in git, waiting for me to read it like a report card I didn’t have to study for.

Codex completion summary showing the goal marked complete after 28 minutes elapsed, with implementation details, test results, and Playwright browser evidence

That gap — between typing the command and seeing the result — was the whole point of the experiment.

A year ago at WordCamp Johor Bharu 2025, I was on stage demoing a five-tool workflow that took fifty minutes. Today I trigger one command and leave the room.

(Progress looks a lot like laziness if you squint.)

Here’s what changed.

OpenAI shipped the /goal command in Codex CLI 0.128.0 on April 30, 2026. The official description calls it “persisted goal workflows with app-server APIs, model tools, runtime continuation, and TUI controls.”

Translated for humans: you give Codex an objective, and Codex keeps working toward it until evidence says it’s done.

That’s a different shape of AI assistance from the usual prompt-and-watch loop. Worth pausing on — because the honest caveat lands fast. The codex goal command is not magic. Garbage spec in, garbage outcome out. The other half of the win was an Agent skill I built called wp-spec-to-goal, which turns a vague paragraph into the GOAL.md, VERIFY.md, and PROGRESS.md trio that Codex actually needs to finish.

(Last August I wrote about vibe coding a WordPress plugin in 50 minutes with Claude Code. That post was honest at the time. Fifty minutes felt fast. Looking at it now? Most of those minutes were me clicking “approve,” reading diffs, and playing air traffic controller for an AI that didn’t need one.)

This post is what happened when I removed myself from that loop entirely.

By the end of it you’ll know what /goal is, how to turn it on, how to write a goal that actually completes, and how I scaffolded the spec for a working WordPress plugin in under five minutes.

Stay with me.

.

.

.

What /goal Actually Is (And Why It Changes How You Build)

Here’s the thing about a normal Codex prompt: it says “do this task once.”

/goal says something different. It says “keep pursuing this objective until evidence says done.”

Subtle distinction. Enormous consequences.

When you start a goal, Codex attaches a persisted objective to your thread. The runtime quietly tracks what you asked for, the current status, how much time has passed, and how many tokens are gone.

Then a small loop kicks in — kind of like a dog that won’t stop fetching until you take the ball away.

Codex finishes a turn. The session goes idle. The runtime checks whether the goal still needs work. If yes, Codex gets a continuation prompt and picks the next action. The cycle repeats until completion criteria are met, the token budget runs dry, or you pause it yourself.

Four states matter:

  • active
  • paused
  • complete
  • budget_limited

The TUI summary shows them when you type /goal on its own.

Now here’s the small (but load-bearing) detail that makes everything else in this post possible. If you peek at Codex’s open source continuation prompt template, you’ll find the model is told to map every requirement to concrete evidence — files, command output, test results — and to treat uncertainty as not-done.

Read that last part again. Treat uncertainty as not-done.

That’s what makes 28 minutes of absence possible. Codex won’t mark a goal complete on vibes. The continuation prompt forces a real audit against real artifacts every single turn.

Compare that to a Ralph-style outer loop, where you script the iteration yourself. Or to a single long prompt that just keeps going until the context window gets tired.

(I’ve watched enough Ralph loops drift past the third iteration to recognize /goal as a different beast entirely.)

With /goal, the runtime tracks the objective, decides whether to continue, and refuses to declare victory without proof. You hand Codex an objective and a definition of done — then step out of the way.

👉 That mental model is the foundation for everything else in this post.

.

.

.

Three Commands and a Restart

Before any of the autonomy stuff works, you need to flip a couple of switches. Don’t worry — it’s quick. Like, “faster than making instant noodles” quick.

Update Codex first. The version that introduced /goal is 0.128.0, so anything older won’t even show the command.

npm install -g @openai/codex@0.128.0

Or if your install supports the built-in updater:

codex update

Confirm with codex --version. You want 0.128.0 or newer.

The /goal command is gated behind a feature flag, so you have to flip it on before it appears in the TUI. Run codex features list and look for goals.

Codex feature list output showing the goals row marked under-development with a value of false, highlighted with a red box

The label under development is honest, isn’t it? Functional, but flagged. I treat that as a reminder to actually read the PROGRESS.md output afterward instead of blindly trusting the run. You should too.

Enable it:

codex features enable goals
Codex terminal showing the command codex features enable goals returning the message Enabled feature goals in config.toml

Restart Codex inside your repo. The launch banner will warn you that under-development features are enabled, and /goal will appear in the slash-command menu the moment you type /.

Codex 0.128.0 launch screen with a warning that under-development features goals is enabled and the slash-goal command in the composer ready to autocomplete

Three commands and a restart. That’s it. The whole setup.

.

.

.

The Spec Is the Work — Meet wp-spec-to-goal

Here’s where most people trip.

The temptation with a shiny new feature like /goal is to write one sentence, press enter, and hope for the best. And honestly — for trivial tasks, that works fine.

For WordPress? It falls over fast.

There are too many quiet failure modes lurking in WordPress land — capability checks the agent forgets to add, environment gates between local and production, input sanitization before lookup, output escaping in admin HTML, hook timing, and the difference between wp-cli running on your host versus inside the wp-env Docker container.

(I’ve shipped each of those mistakes at least once. My shenanigans are your free education.)

An agent that doesn’t know about those things produces code that looks correct and behaves badly. Which — if we’re being honest — is worse than code that obviously breaks. At least broken code has the decency to announce itself.

So I built an Agent skill called wp-spec-to-goal to handle the spec layer. It lives at .codex/skills/wp-spec-to-goal/, and its only job is to take a vague paragraph and produce a Codex-ready bundle:

  • A scaffolded plugin folder (PSR-4 layout, composer.json, .wp-env.json, AGENTS.md, .gitignore, package.json) — but only the parts that don’t already exist.
  • A goals/<slug>/ directory with three files: GOAL.md, VERIFY.md, PROGRESS.md.
  • A tailored /goal command to copy and paste into Codex.

The skill follows six steps: probe the repo, judge complexity, ask clarifying questions in batches, scaffold what’s missing, write the goal trio, hand off the final command.

Here’s the starting state for this build — an empty repo with only the two relevant skills checked in.

VS Code file explorer showing a clean repo with only .codex/skills/playwright-cli and .codex/skills/wp-spec-to-goal folders, no plugin files yet, beside a Codex terminal session

I invoked the skill with one paragraph. No formatting. No structure. No acceptance criteria. Just the rough shape of what I wanted — like handing someone a napkin sketch and saying “make this real.”

The wp-spec-to-goal skill prompt typed into Codex, describing in plain English a WordPress plugin that lets an AI agent log in via a URL with a username or email parameter and switch users automatically

The skill probes the repo first. It runs ripgrep across .wp-env.json, composer.json, package.json, AGENTS.md, the goals folder, and the plugin source. It reads its own template references. It builds a picture of what already exists before asking me anything.

Codex output showing the wp-spec-to-goal skill exploring the repo with ripgrep file searches, finding only .codex and .agents folders, and noting the repo is essentially empty aside from skills and git metadata

Then comes the clarification round.

The skill judges this as a single-goal feature, flags the security boundary (a public autologin URL is intentionally dangerous outside local/dev) as the main uncertainty, proposes the slug wp-login-for-ai, and asks for confirmation before writing any files.

Codex showing the skill's analysis: a single-goal feature with security boundary as the main uncertainty, proposing the slug wp-login-for-ai and asking the user to reply yes to proceed

I replied “yes.” The skill loaded four template references, checked git status, and started writing files.

Codex output showing the wp-spec-to-goal skill confirming defaults, loading the four reference templates, running git status, and creating the wp-login-for-ai plugin and goals directories

A few minutes later, the scaffold was done.

The skill produced a summary listing the scaffolded files, the generated goal trio, the exact /goal command to paste, and a validation note. (No unresolved template placeholders, all JSON files parse, PHP lint deferred to wp-env since the host machine doesn’t have PHP installed.)

The wp-spec-to-goal skill completion summary listing scaffolded files including the plugin entry composer.json wp-env.json package.json gitignore and AGENTS.md, plus generated GOAL.md VERIFY.md PROGRESS.md, the slash-goal command to paste, and validation results

In VS Code, the new file tree showed up clean.

VS Code file explorer showing the newly scaffolded structure: goals/wp-login-for-ai with GOAL.md PROGRESS.md VERIFY.md, the wp-login-for-ai plugin folder with composer.json and the PHP entry file, plus root-level .gitignore .wp-env.json AGENTS.md and package.json

👉 The takeaway is simple: the autonomy /goal provides downstream is paid for upfront, in the spec. Five minutes here bought me 28 minutes there.

That’s not a bad trade.

.

.

.

What the Goal Trio Actually Contains

Three files, three jobs. No moonlighting.

  • GOAL.md describes what must be true when the work is done.
  • VERIFY.md describes how Codex proves it.
  • PROGRESS.md records what happened along the way.

Why three instead of one? Because /goal continues across many turns, and the continuation prompt re-reads these files every time. Mix the responsibilities and the audit gets confused — like giving one person three different job titles and hoping they remember which hat they’re wearing. Keep them separate and Codex always knows what it’s looking at.

Here’s a slice of the actual GOAL.md the skill generated for the autologin plugin:

### US-003 - Fail safely

As a site owner,
I want the shortcut constrained to local development and invalid requests
handled safely, so that the plugin cannot become a production backdoor.

Acceptance criteria:

- [ ] AC-003.1 - The shortcut only runs when `wp_get_environment_type()` is
      `local` or `development`.
- [ ] AC-003.2 - The shortcut only runs for local development hosts such as
      `localhost`, `127.0.0.1`, or `[::1]`.
- [ ] AC-003.3 - Requests for an unknown username or email fail without
      changing the current logged-in user.
- [ ] AC-003.4 - Blocked or invalid requests return a safe machine-readable
      error and do not emit PHP warnings or notices.

Notice that every acceptance criterion has an ID. Those same IDs show up later in the completion audit table that Codex fills in. That linkage is your insurance policy — it’s how you check that the run wasn’t just theatre.

GOAL.md also closes with a Definition of Done section:

## 13. Definition of Done

The goal is complete only when:

- [ ] Every acceptance criterion is implemented.
- [ ] Every required verification command in
      `goals/wp-login-for-ai/VERIFY.md` passes or has a documented external
      blocker.
- [ ] New or changed behavior has tests where practical.
- [ ] Existing behavior is not regressed.
- [ ] `README.md` is updated.
- [ ] `goals/wp-login-for-ai/PROGRESS.md` contains final evidence.
- [ ] /goal has performed a completion audit mapping each AC to evidence.

See that seventh bullet? The Definition of Done explicitly references the audit. Codex can’t declare victory without filling in that table. No shortcut. No “close enough.”

VERIFY.md is the verification contract — the commands Codex must run before completion, the smoke checks it must perform, and the evidence format for PROGRESS.md.

Here’s a key detail that matters more than it looks: every WordPress command routes through npx wp-env run cli rather than running native wp or composer on the host machine. Why? Because native commands target a different PHP/MySQL environment and produce results that look right but lie.

(Results that lie are — and I cannot stress this enough — the worst kind of results.)

So the skill emits this rule once, VERIFY.md enforces it again, and AGENTS.md repeats it a third time for /goal to bump into from any angle. Triple-redundant on the rules that matter.

PROGRESS.md starts as a skeleton — status: not started, empty completed list, empty commands table. By the end of the run, Codex fills it in. The most important section is the completion audit. Here’s a representative row from the final state:

| AC-001.1 | wp-login-for-ai/tests/run.php evidence-AC-001.1; npm run test:smoke;
playwright-cli screenshot B-001/admin-dashboard.png | Pass |

Every acceptance criterion gets a row. Every row points to a real file, a real command output, or a real screenshot saved on disk. The evidence is the artifacts themselves, sitting right there on your filesystem. Anyone can audit them after the fact.

That’s the contract /goal operates under. Three files. One linkage. No completion without evidence.

.

.

.

Pasting the Command and Stepping Back

The handoff itself is small. Almost anticlimactic.

Open Codex inside the project, paste the tailored command from the skill output, and press enter.

Codex new session showing model gpt-5.5 high and the slash-goal command pasted into the composer: Complete goals/wp-login-for-ai/GOAL.md. Use VERIFY.md as the verification contract. Update PROGRESS.md continuously. Treat uncertainty as incomplete.

The command itself is short:

/goal Complete goals/wp-login-for-ai/GOAL.md. Use goals/wp-login-for-ai/VERIFY.md
as the verification contract. Update goals/wp-login-for-ai/PROGRESS.md
continuously. Treat uncertainty as incomplete.

Wondering why such a tiny command does so much? Because the heavy lifting already lives in the files the skill wrote. /goal just needs the contract and a few rules of engagement.

That last sentence — Treat uncertainty as incomplete — mirrors the exact wording in Codex’s own continuation prompt. Speaking the same language as the runtime is a small thing, but it helps Codex stop the right way when something blocks it.

Codex’s first turn tells you the autonomy is kicking in. Watch what it does: explores the repo, reads all three goal files, inspects the existing scaffold, then lays out a concrete plan. Implement the handler. Run PHP checks. Run browser checks with playwright-cli. Write the final PROGRESS audit before marking the goal complete.

Codex first goal turn showing it exploring the repo, reading the goal trio files, inspecting the plugin entry and composer.json, and producing an updated five-item plan with implement test verify and audit steps

That plan is the autonomy warming up.

There’s a moment when you paste the command and your finger hovers over enter — two seconds of “did I trust the spec enough?” — and then you commit.

From here forward, I stopped paying attention.

.

.

.

The 28-Minute Black Box

There are no screenshots between this section and the next.

Nothing happened on screen worth showing you.

Codex worked. I went to make coffee. Watched some YouTube while it brewed. Answered a few emails I’d been pretending didn’t exist.

The Codex session kept running through tool calls, file edits, test runs, and self-audits in the background. The session was busy. I was elsewhere.

The whole value of the codex goal command sits in this gap.

If you sit at the screen pressing approve every two minutes, you’re using Codex like a normal prompt — and missing the point entirely. The autonomy only pays off if you actually walk away. (This is harder than it sounds. The first time feels like leaving a toddler alone with a box of markers.)

So what makes the absence feel safe?

The spec, mostly. The model executes; the spec sets the boundaries.

Scope rules in GOAL.md keep Codex from refactoring random files. Stop conditions cover ambiguous architectural decisions. VERIFY.md defines proof. The continuation prompt refuses to declare victory without it.

Each layer is a guardrail, and together they let you trust a 28-minute run more than a five-minute supervised one.

(Yes, really.)

The trade-off is real, and worth naming out loud. You give up real-time control. You get back time. The honesty test is whether the spec was tight enough to let you trust the result when you come back.

Wrote the spec yourself? Your trust is calibrated by your confidence in your own writing. Used a benchmarked skill? Your trust is calibrated by how well that skill has been tested.

In my case both gates were green. So I left.

.

.

.

Reading the Receipts

When I came back twenty-eight minutes later, my first instinct wasn’t to celebrate.

It was to scroll through PROGRESS.md half-expecting Codex to have quietly lied.

It hadn’t.

The Codex TUI showed a clean completion message.

Codex showing the goal complete message after 28 minutes, listing the implemented autologwp flow, files added including verifier coverage and mounted eval artifacts, all required commands passing including npm install npx wp-env start composer install/test/lint and npm test:smoke, plus playwright browser checks B-001 through B-004 against localhost:8888

Three categories of evidence shipped together — implementation, tests, and browser proof:

Implementation. A full autologwp handler with environment gate, host gate, user lookup via WordPress APIs, cookie clearing, new auth, and a safe redirect.

Tests. PHPUnit tests inside the plugin, npm/wp-env start, composer install/test/lint via wp-env, plus npm run test:smoke, npm run lint, and npm test — all passing.

Browser proof. Four playwright-cli runs labeled B-001 through B-004, with screenshots saved to goals/wp-login-for-ai/test-artifacts/ for admin login, editor switch, email login, and invalid input handling.

I asked Codex for a summary with ASCII diagrams. The answer came back as a clean specification traced through the request lifecycle.

Codex output explaining the autologwp WordPress dev login shortcut with an ASCII diagram showing request flow from /?autologwp=username-or-email through env gate local/development, host gate localhost or 127.0.0.1, get_user_by email/login lookup, clear old auth cookies, set current user with auth cookie and wp_login hook, then wp_safe_redirect to wp-admin

Read that flow slowly.

Environment gate before host gate. Host gate before user lookup. Lookup before cookie clear. Cookie clear before new auth. Hooks before redirect. Codex understood the architecture down to the order of those security gates — the kind of summary you’d write yourself after spending an hour with the source code.

Then — unprompted — it produced the verification matrix.

Codex showing the changed files list with the plugin folder composer.json tests run.php and eval-artifacts, the goals folder PROGRESS.md and B-001 through B-004 screenshots, plus a verified table where every required VERIFY.md command and every targeted check P-001 P-002 B-001 B-002 B-003 B-004 reports PASS, with the final audit recorded in PROGRESS.md

Every required command from VERIFY.md marked PASS. Every targeted check — P-001 environment gate, P-002 invalid user, B-001 admin login, B-002 email login, B-003 editor switch, B-004 invalid JSON plus preserved session — marked PASS. Final audit recorded in PROGRESS.md.

That table is what makes me willing to trust the run.

Want to verify it yourself?

The artifacts are right there on disk. The screenshots live in goals/wp-login-for-ai/test-artifacts/. PROGRESS.md is checked into git. Anyone can re-run the commands and confirm the markings.

No theatre.

.

.

.

Trust, But Verify (The Old-Fashioned Way)

Codex saying it works and the plugin actually working are two different claims.

And the “green tests, red production” surprises stay with you long enough to make manual smoke tests a reflex. So I closed the terminal and tested the plugin like a regular human would.

The dev environment was already running at localhost:8888. Front-end, no logged-in session.

Browser at localhost:8888 in incognito mode showing the wp-login-for-ai dev site with the default WordPress Hello World blog post, no logged-in user visible

I typed the autologin URL into the address bar.

Browser address bar showing localhost:8888/?autologwp=admin being typed into a Chrome incognito window with a wp-login-for-ai tab

Hit enter. The redirect happened. The session switched. The wp-admin dashboard loaded with the admin user identity in the corner.

WordPress wp-admin dashboard at localhost:8888/wp-admin showing Howdy admin in the top right corner, the wp-login-for-ai site name, dashboard menu with Posts Media Pages Comments Appearance, and the Welcome to WordPress version 6.9.4 panel

I tried the email variant (?autologwp=wordpress@example.com) and the editor switch. Both worked. None of the edge cases I poked at suggested Codex had declared completion incorrectly.

👉 The bigger point: /goal doesn’t replace your QA.

It offloads the part of the build you don’t enjoy — writing the implementation — so you can focus on the part you should be doing anyway. Which is verifying the result.

(Turns out the most valuable developer skill in the age of AI agents is… being a good tester. Who saw that coming?)

.

.

.

When /goal Earns Its Keep (And When It Doesn’t)

So when does the codex goal command actually earn its keep?

Bounded objectives with clear acceptance criteria. That’s the sweet spot. The autologin plugin is a good example: one feature, defined inputs, defined outputs, a small set of scope boundaries, and a verification contract that fits on one screen.

Here’s where you should reach for it:

  • Bug fixes with reproducible failures and regression tests.
  • Refactors with a “behavior is preserved” success condition.
  • Single feature slices from a larger project — one user story at a time.
  • API integration work where the contract is well-specified upfront.

And here’s where you should hold back:

  • Vague objectives like “make the app better” or “refactor everything.” Codex can’t audit completion for those — they don’t have a finish line.
  • Multi-feature builds that should really be split into separate goals.
  • Anything where you can’t define “done” before you start.

(A “refactor this whole module” goal will hit budget_limited and stop, having shipped nothing you’d want. There’s no audit to run when there’s no definition of done to run it against.)

Here’s the framing that’s stuck with me: /goal works best as an inner loop. The project manager role still belongs to you. For larger work, split it into multiple goals — 001-data-model, 002-admin-ui, 003-rest-api — and run them one after the other. One coherent slice per goal.

Two practical caveats before you fire your first goal.

First: Plan Mode and /goal don’t mix. The runtime suppresses goal continuation while Codex is in Plan Mode, so if you trigger /goal from inside a plan you’ll sit there wondering why nothing’s happening. Plan first, leave Plan Mode, then start the goal.

Second: /goal still depends on the spec. Skip the wp-spec-to-goal step (or whatever the equivalent is for your stack), write a one-line objective, and you’ll get a one-line-objective result. Garbage in, garbage out — same rule as always. Ferpetesake, write the spec.

.

.

.

The Bigger Picture

Here’s what /goal actually represents — a shift toward evidence-based autonomy.

Codex doesn’t need a human in the loop because it has files in the loop. Those files define done, prove done, and capture what happened along the way.

Compare back to the 50-minute supervised Claude Code build I wrote about last year. Fifty minutes was impressive at the time. Looking at it now, most of those minutes were judgment calls — clicking approve, reading diffs, deciding whether the next step looked sane. The codex goal command moves that judgment upfront into the spec, so the same decision doesn’t get made forty times during execution.

The skill investment pays back too.

wp-spec-to-goal was real work to design and benchmark.

But after two or three uses, the math stops being subtle: five minutes turning a paragraph into a goal trio, twenty-eight minutes of nothing. Once you’ve done it, you can’t really go back to the supervised loop for bounded tasks. That’d be like going back to dial-up after you’ve tasted fiber.

The part of building software that AI is starting to get genuinely good at is executing well-specified plans without supervision.

👉 Your job is to get good at writing the plan.

If you want to try it, here’s a starting point.

Pick a small bounded task this week — a bug fix with a failing test, or a single feature slice from a project you’re already working on. Don’t reach for the big rewrite. Write a tight GOAL.md (even by hand from the templates in this post), pair it with a VERIFY.md, paste a /goal command, and walk away.

The 28 minutes only feel real when you’ve spent them yourself.

The plugin from this post is a public repo. You can clone it, inspect the actual GOAL.md, VERIFY.md, and PROGRESS.md files, look at the Playwright screenshots checked into goals/wp-login-for-ai/test-artifacts/, and grab the wp-spec-to-goal skill from .codex/skills/ if you want to use it on your own builds.

The repo is at github.com/nathanonn/wp-login-for-ai.

Go give Codex something specific to do, then leave the room.


More workflows like this — AI-assisted development with Claude Code, Codex, and the tools between them — land in The Art of Vibe Coding newsletter every week. If this one was useful, the next one probably will be too.

13 min read The Art of Vibe Coding

Never Let Claude Code Auto-Compact Again

Auto-compact fires when the context is full — not when your task is at a clean boundary. Here’s how to stay in control with a status line, manual compact instructions, and a HANDOFF.md habit.

Here’s the moment that made me religious about manual compaction.

I was deep in a Claude Code session with one hard rule: no parallel sub-agents. One at a time. Always. I’d stated it clearly at the start of the session — burn one agent at a time, not five.

Auto-compact fired mid-session. And with it, that rule vanished. Gone from context.

I kept going. Claude kept going.

Then I glanced at the status line.

Ten sub-agents. Running simultaneously. My five-hour budget torched in about four minutes. The constraint wasn’t in CLAUDE.md, so the compact summary had nothing to reload it from. Just… gone.

That was my last auto-compact.

Claude Code auto compact functions like a seatbelt — there to catch you at the hard limit, but with no awareness of where you are in your task. It fires when the system decides the window is too full. No knowledge of whether you’re mid-hypothesis or mid-debugging loop. No idea whether the work has reached a clean handoff point.

The system protects itself.

You lose state.

Nate Herk raised a useful heuristic in his video How to Never Hit Your Claude Session Limit Again: the 1M context window is insurance. His argument is that resetting around 120k tokens — rather than filling the full window — keeps the model operating at full quality across a long session.

I’ve adopted a version of this as my working rule. The context window is a budget you actively manage. Stop treating it like a lap pool you’re trying to fill.

Never let the session reach the “dumb zone.”

That’s the upper range where compaction is imminent, signal-to-noise is poor, and the model is sorting through stale logs and abandoned attempts on every turn.

By then, you’ve already paid the tax.

.

.

.

What Actually Lives in the Context Window

Here’s what most people don’t realize about the context window.

Everything costs.

Before you type a single message, the session is already carrying: the root CLAUDE.md and any auto-memory blocks, MCP tool names and schema, skill descriptions, output style instructions, system prompts, and any path-scoped rules triggered on load.

That’s a meaningful slice of the context window before any work begins.

As the session runs, more piles in.

Every file read. Every command result. Every hook output. Every tool call and response. The full assistant turn history. You think “I only sent 20 messages” — but the session is carrying all of the above, in full, on every turn.

  • Stale exploration logs from an hour ago? In there.
  • Error output you resolved three steps back? In there.
  • Assistant messages full of planning that’s now completely moot? Also in there.

On every single turn, the model processes the entire window.

Every bit of it.

So yes — that 20-message conversation might be carrying the weight of forty.

/context gives you a live breakdown: how much is used, by which category, with optimization suggestions. Run it at least once per session to get a feel for where the weight is. (It’s the closest thing to a profiler the session gives you.)

The /context command output showing live token usage breakdown by category

.

.

.

Why Auto-Compact Is Lossy by Design

Here’s the thing: compaction isn’t broken.

The tradeoff is real and explicit.

Compaction takes a long running session and converts it into a structured summary, then continues from that summary. The official docs are clear about it: requests and key code snippets are preserved; detailed instructions from earlier in the conversation may be lost.

The problem with Claude Code auto compact is timing.

When the system fires it, the session has no clean boundary. The compaction summarizes whatever is in the window at the moment of overflow — including partial plans, mid-hypothesis reasoning, and error threads still in flight.

That’s where my rule vanished.

That’s where your constraints vanish, too.

Stay with me, because understanding what survives is what makes the mechanism workable.

Auto-Compact Is a Lossy Filter

After compaction, these reload reliably: the root CLAUDE.md and auto-memory blocks. They come back because they’re read from disk — the filesystem is their source, not the summary.

These do not survive automatically:

  • Path-scoped rules and nested CLAUDE.md files. They existed in the session because matching files were read. After compaction, they’re gone — until those files are read again. If your project has a src/api/CLAUDE.md with API-specific rules, those rules are out of context post-compaction until Claude re-reads that file.
  • Invoked skill bodies are a middle case. They may reload with token caps applied — the full skill text might come back, or a compressed version, depending on what the token budget allows.

(The Decode Claude team has a thorough breakdown of how compaction actually works under the hood — worth reading if you want the full mechanism.)

The practical caution: compact when the work has natural shape.

  • After a feature lands and tests pass.
  • After a root cause is identified but before the fix starts.
  • Before switching from implementation to review.

Never compact mid-plan without first writing down the state you need preserved.

The mechanism works when you give it a retention policy. When auto-compact fires without one, you get whatever the summarizer decided mattered. /compact [instructions] gives you that control.

Use it.

.

.

.

Install a Context Meter in Your Status Line

The goal: always-visible context usage in the terminal status line.

Custom Claude Code status line showing model name, effort level, context percentage, and rate limit usage

Without it, you find out the window is at 78% when you run /context — which means you checked too late.

Here’s the script. Save it as ~/.claude/statusline.sh:

#!/bin/bash

input=$(cat)

MODEL=$(echo "$input" | jq -r '.model.display_name // "Claude"')
EFFORT=$(echo "$input" | jq -r '.effort.level // "n/a"')
PCT=$(echo "$input" | jq -r '.context_window.used_percentage // 0' | cut -d. -f1)

FIVE_H=$(echo "$input" | jq -r '.rate_limits.five_hour.used_percentage // empty')
WEEK=$(echo "$input" | jq -r '.rate_limits.seven_day.used_percentage // empty')

LIMITS=""
[ -n "$FIVE_H" ] && LIMITS=" | 5h:$(printf '%.0f' "$FIVE_H")%"
[ -n "$WEEK" ] && LIMITS="$LIMITS | 7d:$(printf '%.0f' "$WEEK")%"

echo "[$MODEL] effort:$EFFORT | ctx:${PCT}%$LIMITS"

Make it executable:

chmod +x ~/.claude/statusline.sh

Add this block to ~/.claude/settings.json:

{
  "statusLine": {
    "type": "command",
    "command": "~/.claude/statusline.sh",
    "padding": 2
  }
}

Here’s why that matters.

ctx:47% is the core signal.

Once it’s in your status line, context management becomes part of your session loop — you glance at it the same way you glance at a battery indicator. You stop waiting for it to reach critical before acting.

5h:71% prevents starting a heavy refactor when rate limits are already burning.

If you’re at 71% of your five-hour budget, a full session of parallel tool calls might hit the ceiling before the task finishes.

Better to know before you start.

.

.

.

The Operating Rule — Compact at Boundaries, Not at Panic

These are the zones I use as working heuristics.

Hard-won across real sessions — not universal thresholds from published research. If they feel arbitrary, calibrate them against your own work.

But start here:

  • Green (0–30%): Keep working. Avoid dumping unrelated files or running broad research in the main session — keep the window clean while you have room. The session is young. Let it breathe.
  • Yellow (30–50%): Start watching for task boundaries. Note where the natural stopping points are in your current work. You have runway. Use it intentionally.
  • Orange (50–60%): Finish the current micro-task, then compact or hand off. Do not start a new major branch. The window is narrowing faster than it feels.
  • Red (above 60%): The threshold I use before a major context reset. Do not start a new feature, refactor, or research thread without resetting context first. This is the zone where sessions start producing work you’ll have to redo.

1M Opus exception: If token budget matters, treat 15–20% as the practical reset band. Twenty percent of 1M tokens is already a large session with substantial context weight. The math changes when the window is enormous.

One important pushback worth stating clearly: do not interrupt a working implementation mid-flight just because the number crossed a threshold. If you’re in the middle of a function, a migration, or a debugging loop that’s producing real signal — finish the micro-task first.

Compact at a boundary.

Never mid-sentence.

The threshold is a trigger to watch for the next natural stopping point. The task shapes where compaction makes sense:

Choose the Reset at a Clean Boundary

.

.

.

Write /compact Like a Handoff Prompt

The biggest leverage point in this workflow is how you write the compact instruction.

Bad:

/compact summarize everything so far

That hands the retention decision back to the model. You get whatever the summarizer determined was important.

Ferpetesake.

Better:

/compact Preserve only what a fresh coding agent needs to continue safely: current goal, files changed, decisions, errors, tests, pending tasks, and exact next step. Drop stale exploration and repeated logs.

Even that is improvable.

The reusable KEEP/SUMMARIZE/DROP template:

/compact
KEEP:
- Current goal and acceptance criteria
- Exact files changed and why
- Important code decisions and rejected alternatives
- Open bugs, failing tests, console errors, and commands already tried
- The last 5 user/assistant turns in detail

SUMMARIZE:
- Earlier exploration
- Completed debugging paths
- General discussion

DROP:
- Repeated test output
- Long logs that no longer matter
- Dead-end ideas already ruled out

This format works because it makes compaction a retention policy.

You’re telling the model exactly what to keep verbatim, what to compress, and what to discard. The output is shaped by your instructions — not the summarizer’s defaults.

Three task-specific variants ready to copy:

Feature implementation:

/compact Preserve feature implementation state: goal, acceptance criteria, files changed, functions/components touched, business rules, test results, unresolved bugs, and exact next step. Summarize old exploration and drop repeated logs.

Debugging:

/compact Preserve debugging state: original bug, reproduction steps, exact error messages, hypotheses tested, files inspected, fixes attempted, current most likely root cause, and next verification command. Drop dead-end logs unless they explain a rejected approach.

Refactor:

/compact Preserve refactor state: target architecture, files already refactored, files not yet touched, compatibility constraints, naming conventions, migration risks, test status, and next file to inspect.

.

.

.

Add Compact Instructions to CLAUDE.md

One-off compact prompts work.

But if you’re running long sessions regularly, encoding the retention policy in CLAUDE.md means you stop writing it from scratch every time. The official docs support this directly: add a “Compact Instructions” section to CLAUDE.md, and Claude Code uses it during compaction.

Here’s the copy-paste version:

## Compact Instructions

When compacting, preserve working state for continuation, not chat history.

Always keep:

- Current goal and acceptance criteria
- Exact files changed, created, deleted, or inspected and why
- Important hooks, functions, classes, routes, settings, commands, and config keys
- Business rules and architectural decisions
- Rejected approaches and why they were rejected
- Errors, failed tests, commands run, and fixes attempted
- Pending tasks and the exact next step

Summarize:

- Completed exploration
- Older discussion
- Repeated command output

Drop:

- Verbose logs unless they contain unresolved errors
- Duplicate explanations
- Abandoned ideas that are no longer relevant

After compaction, re-read PLAN.md or HANDOFF.md if present before continuing.

One caution: if CLAUDE.md grows too large, it becomes its own context tax — loaded on every session, occupying window space before any work begins. The Compact Instructions block is worth including if it replaces ad-hoc reminders you’d otherwise type in long sessions.

Keep the file disciplined.

If you haven’t read The Single File That Makes or Breaks Your Claude Code Workflow, that’s the foundation. The Compact Instructions block sits inside it.

.

.

.

Write a Handoff File Before Compaction or Clear

The compact instruction controls the summary.

The handoff file makes that summary durable — written to disk so it survives the reset.

A Hacker News commenter put it well, and I’m passing this on the way a mentor would: get better results by asking Claude to write the important parts into a Markdown file, reviewing it, clearing context, and continuing from that file. The session state becomes explicit and inspectable. You can read it. You can verify it. You can hand it to a fresh session and have it pick up exactly where you left off.

That’s worth framing as a habit.

The moment before compaction is the moment to make your state visible.

The prompt to generate HANDOFF.md:

Create HANDOFF.md. Include:
- Goal
- Current branch/state
- Files changed
- Decisions made
- Commands run
- What failed
- What remains
- Exact next step

Make it complete enough that a fresh Claude Code session can continue without reading this chat.

Then compact with a focused instruction:

/compact Focus on HANDOFF.md, current git diff, unresolved errors, and next step. Drop old exploration.

Or for a truly fresh start:

/clear
Read HANDOFF.md, then continue from the exact next step.

The distinction between the three reset mechanisms:

  • /compact — same session continues. Old context is summarized, not erased. The compressed history is still present.
  • /clear + HANDOFF.md — clean slate. The continuation document is the only thread back to prior work.
  • /rewind — last branch was wrong. Use this to remove polluted attempts rather than summarizing them into the session memory.

HANDOFF.md works with all three.

Write it before the reset. Verify it’s complete. Then pick the right reset for the situation.

.

.

.

Rehydrate After Compaction

Compaction has a subtle failure mode: after the summary is generated, Claude may know a plan exists — but not have the plan’s actual content in context.

Let that land for a second.

The root CLAUDE.md and auto-memory reload — they’re read from disk. PLAN.md, HANDOFF.md, and path-scoped rules loaded from specific subdirectories are not automatically restored. The compacted summary might reference them. The content is not there.

The fix is explicit.

Tell Claude to re-read them.

After-compaction checklist:

After any compaction:
1. Re-read PLAN.md if it exists.
2. Re-read HANDOFF.md if it exists.
3. Re-check git diff --stat.
4. Confirm the current goal, unresolved risks, and exact next step.
5. Continue from the latest pending task, not from memory alone.

This checklist can live in CLAUDE.md under “Compact Instructions” — the last line of the block from the previous section already includes it. But it’s worth stating explicitly so the habit is clear: after compaction, re-read the durable files before continuing.

Do not assume the summary captured everything they contained.

The official docs confirm it: root CLAUDE.md reloads after compaction; path-scoped and nested instructions may need trigger files re-read. An explicit rehydrate checklist is more reliable than hoping the summary preserved those details.

.

.

.

The Decision Table — Continue, Compact, Clear, Rewind, or Subagent

Here’s where it all comes together.

When the context meter signals it’s time to act, this is the call:

SituationUseWhy
Same task, clean milestone reached/compact [instructions]Preserve state, drop noise
New unrelated task/clearAvoid dragging old context into new work
Last branch was wrong/rewindRemove polluted attempts instead of summarizing them
Heavy research or file explorationSubagentKeep raw reading out of main context
Long implementation needs continuityHANDOFF.md + /compactMake continuation state durable
Session feels confused far below the limitHANDOFF.md + /clearReset quality before wasting more turns

The new habit is five steps:

  • Watch ctx%.
  • Finish the current micro-task.
  • Write down state if needed.
  • Compact with a retention policy.
  • Re-read the durable plan.
  • Continue.

Never letting Claude Code auto compact means staying in charge of what survives.

The best sessions — the ones that actually ship — are the ones where context stays intentional. Build the status line. Write the compact instruction. Keep HANDOFF.md in the habit. Re-read the durable files every time.

That’s the playbook.

Own it.

If you want the full framework this sits inside, start with Context Engineering with Claude Code, Explained.

11 min read The Art of Vibe Coding

How to Run Firecrawl for Free in the Cloud (No API Key Needed)

Run the full Firecrawl stack on a free GitHub Codespaces 16 GB cloud machine — no API keys, 5-minute setup, and wired into Claude Code via a single tunnel command.

The Hardware Problem Nobody Warns You About

I have an M1 Pro MacBook Pro. Base model. 16 GB of RAM.

I figured that was plenty.

So I read a tutorial about Firecrawl — the open-source tool that turns messy web pages into clean, LLM-ready markdown — and ran docker compose up without a second thought.

Then I opened Activity Monitor.

14+ GB of RAM. 3 GB spilling into swap.

Memory pressure glowing yellow — the macOS equivalent of a check-engine light.

What I’d forgotten (ferpetesake) was that VS Code, Chrome, and a dev server were already running. Firecrawl’s Docker stack — five services simultaneously: the API server, a Playwright browser cluster, Redis, RabbitMQ, and PostgreSQL — landed on top of my normal development tools like a brick on a soufflé.

Here’s what most tutorials skip: they say “just install Docker” and assume you have unlimited RAM under your desk. Firecrawl can allocate up to 12 GB of RAM on its own. If your machine is already breathing hard from your regular workflow, adding Firecrawl is the thing that tips it over.

But — and stay with me here — GitHub Codespaces gives you a 16 GB RAM, 4-core cloud machine for free. The free tier includes 30 hours of runtime per month. For development, tutorials, and on-demand scraping sessions, that’s more than enough.

I’ve packaged the entire setup into a template repo you can fork: firecrawl-codespaces. Five minutes from zero to a working Firecrawl instance, connected to Claude Code on your local machine.

Let me show you exactly how.

.

.

.

What You’re Actually Getting (Honest Assessment First)

Before we touch a single command, let’s set expectations.

I’d rather you know the trade-offs now than feel surprised after investing 5 minutes.

GitHub Codespaces (Free)Local Machine
RAM16 GB (4-core machine)Whatever you have
CPU4 coresWhatever you have
Cost30 hrs/month freeFree (hardware cost)
Setup time~5 minutes~10 minutes
Always-on?No — auto-stops after inactivityYes
API keys needed?NoneNone

Two honest limitations:

1. Not always-on. Codespaces auto-stops after 30 minutes of inactivity. There are workarounds (covered later), but if you need Firecrawl running 24/7 without any interaction — Codespaces is the wrong fit.

2. No anti-bot bypass. The self-hosted version of Firecrawl doesn’t include Fire-engine — the component that handles IP rotation and bot detection circumvention. For scraping documentation sites, GitHub repos, and public content (the 95% use case for Claude Code), you don’t need it. For scraping LinkedIn or heavily Cloudflare-protected sites, you do.

The verdict: Codespaces is perfect for development, learning, and on-demand scraping sessions. You spin it up when you need it, stop it when you don’t.

.

.

.

Prerequisites

Short list. Zero friction.

  • A GitHub account (free tier works; Pro gives 50% more hours)
  • The GitHub CLI (gh) installed on your local machine — install guide

That’s it.

No Docker Desktop. No Homebrew. No Node.

The entire Firecrawl stack runs inside the Codespace — the only thing your local machine needs is the gh CLI for the tunnel command.

.

.

.

The Setup — Step by Step

Step 1: Create the Codespace

Go to the firecrawl-codespaces repo on GitHub. Fork it (or use it directly).

Click the green <> Code button → select the Codespaces tab → click Create codespace on main.

Machine type matters. Select the 4-core (16 GB RAM) option. The 2-core machine only has 8 GB — Firecrawl will OOM (out of memory) on it.

GitHub "Create a new codespace" page showing the firecrawl-codespaces repo selected, main branch, "Firecrawl on Codespaces" dev container configuration, and 4-core machine type

The core-hour gotcha: Free hours are measured in core-hours, not wall-clock hours. A 4-core machine uses free hours 4x faster than a 2-core. The 120 core-hours/month free tier gives you 30 actual hours on a 4-core machine.

Step 2: Wait for the Automated Setup

The moment the Codespace provisions, it runs setup.sh automatically.

This is configured in the repo’s devcontainer.json via postStartCommand — meaning it runs every time the Codespace starts or resumes, not just on initial creation.

VS Code terminal inside the Codespace showing "Finishing up... Running postStartCommand... > bash setup.sh"

Here’s what setup.sh does behind the scenes:

  1. Clones Firecrawl from the official repo
  2. Creates a minimal .env — port 3663, no authentication, no API keys
  3. Copies a docker-compose.override.yaml that uses pre-built Docker images instead of compiling from source (cuts first-run startup from 5-15 minutes down to ~90 seconds)
  4. Starts the Docker stack with docker compose up -d
  5. Waits for the health check to confirm Firecrawl is responding

First run takes ~2-5 minutes (image pull). After that, resuming a stopped Codespace takes ~30 seconds.

Once the setup completes, the Ports tab shows Firecrawl API on port 3663 with a green indicator:

VS Code Ports tab showing "Firecrawl API (3663)" with a green status dot and forwarded address

Expected warning: You’ll see WARN — You're bypassing authentication in the Docker logs. Completely normal. USE_DB_AUTHENTICATION=false is the correct setting for self-hosted Firecrawl. Safe to ignore.

Step 3: Verify the Stack

Run docker ps inside the Codespace terminal. You should see all five containers running:

Terminal showing docker ps output with five containers running: firecrawl-api-1 (port 3663), rabbitmq, redis, postgres, and playwright-service — all showing "Up 4 minutes"

Five containers. All healthy. Firecrawl is running inside your Codespace.

Now you need to get it to your local machine.

Step 4: Connect From Your Local Machine

I’ll spare you the detour I took.

I spent an embarrassing amount of time messing with public port URLs and GitHub token authentication before discovering that gh codespace ports forward does everything in one command. Learn from my shenanigans.

Switch to your local machine’s terminal (not the Codespace). Run:

gh codespace list
MacBook terminal showing gh codespace list with one Codespace available on the main branch

Copy the Codespace name from the output, then forward port 3663:

gh codespace ports forward 3663:3663 -c <your-codespace-name>
MacBook terminal showing gh codespace ports forward command with output "Forwarding ports: remote 3663 <=> local 3663"

One command.

Firecrawl is now at http://localhost:3663 on your machine — exactly as if it were running locally. No public exposure. No authentication tokens. And here’s the bonus: the tunnel keeps the Codespace alive as long as it’s running. More on that later.

Verify by opening http://localhost:3663 in your browser:

Browser showing localhost:3663 with JSON response: {"message":"Firecrawl API","documentation_url":"https://docs.firecrawl.dev"}

Firecrawl API. Running. Accessible. Free.

Other connection methods: The tunnel is the recommended approach. Two alternatives exist — a public port URL and a private port with GitHub token auth — but both reset on every Codespace restart. The tunnel is simplest and has the bonus keep-alive benefit. See the repo README for details on the alternatives.

.

.

.

Wire It Into Claude Code

Firecrawl is running.

Now let’s make Claude Code actually use it. This is the firecrawl Claude Code setup that turns your coding assistant into a web-aware research agent — and it’s three steps.

Install the Firecrawl CLI

On your local machine:

npm install -g firecrawl-cli

Install Firecrawl Skills

firecrawl setup skills --agent claude-code

This clones 8 markdown skill files from the official Firecrawl CLI repo and installs them into Claude Code’s skills directory. Each skill teaches Claude Code how to use a different Firecrawl capability: search, scrape, crawl, map, interact, download, and agent-powered extraction.

Firecrawl CLI skills installer showing ASCII art "SKILLS" header, repository cloned from github.com/firecrawl/cli.git, "Found 8 skills" with a selection list including firecrawl, firecrawl-agent, firecrawl-crawl, firecrawl-download, firecrawl-interact, firecrawl-map, firecrawl-scrape, and firecrawl-search

After installation, type /firecrawl in Claude Code. You should see all available Firecrawl slash commands:

Claude Code prompt showing /firecrawl typed with autocomplete dropdown listing all Firecrawl slash commands and their descriptions

Add Firecrawl Instructions to CLAUDE.md

This is the critical step.

Without this, Claude Code won’t know to prefer Firecrawl over its built-in (and more limited) web tools.

Add this block to your project’s CLAUDE.md:

## Firecrawl

- **Always use Firecrawl skills** (firecrawl, firecrawl-scrape, firecrawl-search, etc.) for web searches and scraping. Avoid the built-in WebFetch/WebSearch tools.
- We are using the localhost version of Firecrawl. Use `firecrawl` command to interact with the service.
- **Always prefix `firecrawl` CLI commands with `FIRECRAWL_API_URL=http://localhost:3663`** so the CLI targets the localhost service instead of prompting for cloud authentication. Example: `FIRECRAWL_API_URL=http://localhost:3663 firecrawl scrape "<url>" -o
.firecrawl/page.md`.
- **NEVER run `firecrawl --status`** — it checks cloud API auth and always shows "Not authenticated" for localhost. Instead, check if Firecrawl is running with: `curl -s http://localhost:3663 > /dev/null 2>&1` (requires `dangerouslyDisableSandbox: true`).
- All Firecrawl-related commands (including server health checks) must run with `dangerouslyDisableSandbox: true`.
- **Sub-agents**: When spawning agents that may need web access, include these Firecrawl rules in the agent prompt so they use Firecrawl instead of built-in web tools.

Why each line matters:

  • The FIRECRAWL_API_URL prefix is essential. Without it, the Firecrawl CLI defaults to cloud authentication and prompts for an API key you don’t have. The environment variable tells it “talk to localhost instead.”
  • The --status trap — and I say this from personal experience — will burn you. I ran firecrawl --status and it said “Not authenticated.” I spent 20 minutes trying to generate an API key I didn’t need. My self-hosted instance was running perfectly the entire time. The command only checks cloud auth. It has no localhost awareness. Use the curl health check instead.
  • The dangerouslyDisableSandbox note is necessary because Claude Code’s sandbox blocks localhost network calls by default. Firecrawl commands need to reach port 3663.
  • The sub-agent rule prevents a common gotcha: you spawn a research sub-agent, and it uses built-in WebFetch instead of Firecrawl because it didn’t inherit the instructions.

.

.

.

See It In Action

Theory is nice. Let’s see it work.

I asked Claude Code to scrape a FluentCart REST API documentation page — the kind of task you’d do when building an integration and need to understand an endpoint’s parameters before writing any code.

Claude invoked the /firecrawl-scrape skill.

It first checked whether Firecrawl was running at localhost:3663, confirmed the health check passed, then ran the scrape with the FIRECRAWL_API_URL prefix:

Claude Code terminal showing the /firecrawl-scrape skill in action — Claude checks if Firecrawl is running at localhost:3663, then scrapes the FluentCart REST API docs page with the FIRECRAWL_API_URL prefix

The result?

Clean, structured markdown. Endpoint names, URL patterns, parameter tables with types and descriptions, and complete curl examples — all formatted and ready for Claude to work with:

Claude Code displaying scraped FluentCart API documentation showing "Bulk Insert Products" endpoint details, a parameter table with columns for Parameter, Type, Required, and Description, plus a formatted curl example with JSON body

Claude then saved the scraped content as a .md file in a .firecrawl/ folder for future reference:

VS Code showing the scraped content saved as fluentcart-products.md with clean markdown formatting — Products API documentation with headings, links, base URL, and structured endpoint listings

Compare that to what a raw HTTP fetch returns: the same page’s HTML would be 10x larger, stuffed with navigation menus, footers, tracking scripts, and CSS class names. Firecrawl strips all of that away and returns only the content that matters — clean markdown that fits neatly into Claude’s context window instead of bloating it.

.

.

.

The One Gotcha That Will Catch You: Idle Timeout

I learned this one the hard way.

I set up Firecrawl in a Codespace, walked away to make coffee, came back 40 minutes later — and everything was gone. The Codespace had stopped itself.

Here’s what happens:

  1. You start Firecrawl with docker compose up -d (detached mode)
  2. You close the Codespace browser tab
  3. Thirty minutes later, the Codespace auto-stops
  4. Firecrawl is gone

Why?

Codespaces measures inactivity as “lack of terminal input or output.” A detached Docker daemon running in the background produces no terminal output. From Codespaces’ perspective, nobody’s home.

Three fixes:

  • Fix 1: The tunnel keeps it alive (you’re already doing this). The gh codespace ports forward command counts as active interaction. As long as that tunnel is running on your local machine, the Codespace stays alive.
  • Fix 2: Stream logs. Inside the Codespace, run docker compose logs -f in the Firecrawl directory. Each log line resets the idle timer.
  • Fix 3: Extend the timeout. In GitHub Settings → Codespaces → Default idle timeout, set it to 240 minutes (the maximum).

And if it does stop? The postStartCommand in devcontainer.json auto-starts Firecrawl on every resume. Just re-run the tunnel command on your local machine and you’re back.

.

.

.

Free Tier Math and Alternatives

GitHub Free accounts get 120 core-hours/month.

MachineRAMFree wall-clock hours
2-core8 GB60 hrs (not enough CPU & RAM for Firecrawl)
4-core16 GB30 hrs
8-core32 GB15 hrs

What 30 hours gets you: roughly 4 full work days of active Firecrawl sessions. Enough for a serious project sprint, a full tutorial walkthrough, or hundreds of documentation page scrapes.

The storage caveat: Storage is billed even while the Codespace is stopped — $0.07/GB/month. With the Firecrawl repo and Docker layers, expect ~5-10 GB total. That’s ~$0.35-$0.70/month.

Pro tip: GitHub Pro ($4/month) bumps you to 180 core-hours — 45 hours on a 4-core machine. And add a $5 spending cap in GitHub Settings → Billing → Budgets to prevent surprise charges if you forget to stop the Codespace.

Where Codespaces Fits Among Your Options

OptionSetupCostAlways-onAnti-bot
Local machine~10 minFreeYesNo
Codespaces (this repo)~5 minFree (30 hrs/mo)NoNo
Railway~2 min$5+/moYesNo
Firecrawl Cloud0 min$16+/moYesYes

Local is best if your machine can handle it — no time limits, always-on. Codespaces is best for tutorials, learning, and on-demand sessions (what you just set up). Railway has an official Firecrawl deploy template — one click and $5/month for always-on hosting. Firecrawl Cloud is pay-per-use with the anti-bot bypass engine included.

.

.

.

The Bigger Picture

That yellow memory pressure warning on my MacBook Pro? Gone from the equation entirely.

The complete firecrawl Claude Code setup now runs on a 16 GB cloud machine that costs nothing — while my laptop handles what laptops should handle: VS Code, Chrome, and the dev server. No RAM fights. No swap memory. No jet engine fans.

But the real takeaway goes beyond Firecrawl.

The pattern is this: instead of installing powerful tools on every machine you own, run them once in a cloud environment and tunnel to them from wherever you are. GitHub built the tunneling right into the CLI. The free tier covers 30 hours a month. And the template repo makes setup a 5-minute operation.

Your AI coding assistant now has live web access. Documentation pages, API references, technical articles — anything Claude Code needs to read before writing code, Firecrawl can fetch.

Ready to set it up?

  1. Fork the repo: firecrawl-codespaces
  2. Create a 4-core Codespace
  3. Wait for the automated setup (~3 minutes)
  4. Run gh codespace ports forward 3663:3663 on your local machine
  5. Install the CLI and skills: npm install -g firecrawl-cli && firecrawl setup skills --agent claude-code
  6. Add the Firecrawl block to your CLAUDE.md

Six steps. Five minutes. Zero API keys.

Go build something with it.

12 min read The Art of Vibe Coding

The “Real” Context Engineering with Claude Code, Explained

I’ve written 40+ posts about Claude Code.

Sub-agents. CLAUDE.md files. Skills. Workflow engineering. Testing loops. Spec-driven development. Memory. Self-evolving rules.

I was outlining a post last week — when I stopped mid-sentence and stared at my screen. I had the outline open on one side, my published posts list on the other. And for the first time, I saw it.

Every single post was about the same thing.

Not “AI coding tips.” Not “Claude Code tricks.” Something deeper — a discipline I’d been teaching without realizing I was teaching it. I’d been circling the same idea for almost a year, approaching it from forty-four different angles, and I just didn’t have a name for it.

(That’s the annoying thing about patterns. They’re invisible until suddenly they’re not.)

.

.

.

The Name Drop

The name is context engineering.

Tobi Lütke (Shopify CEO) tweeted in June 2025 that he preferred “context engineering” over “prompt engineering.” Karpathy co-signed it. Anthropic published an official guide. The term stuck.

But here’s what nobody’s saying: if you’ve been following this newsletter, you’ve been a context engineer. You just didn’t know it yet.

Let me show you what I mean.

.

.

.

What Happens Without Context Engineering

Let me tell you about an afternoon that changed how I think about context windows.

I was adding a chat interface to a Next.js app using Vercel’s AI Elements library. Simple task — wire up useChat with <Conversation> and <Prompt>. Maybe thirty minutes of work.

So I did what felt responsible: I dumped the entire AI Elements documentation into Claude’s context. Every hook, every provider, every component. Thorough. Comprehensive. Professional.

And then Claude started… hedging.

Vague suggestions instead of concrete code. Recommendations that contradicted themselves across responses. Instructions I’d given three messages ago — forgotten entirely. I watched Claude’s quality degrade in real time, like a student cramming so hard for an exam they forgot how to spell their own name.

That’s context rot — when irrelevant information degrades the AI’s ability to focus on what matters.

I closed the session. Started fresh. This time I gave Claude only the docs for the two components I actually needed. It nailed the implementation on the first try.

Less context. Better results.

(I know. Counterintuitive.)

And here’s the part that really bakes your noodle: bigger context windows don’t make AI smarter. Past about 50% fill, performance actually degrades.

A senior engineer working on an 80k-line codebase posted on Reddit calling the 1M context window “a noob trap”.

They aggressively keep under 250k. And before you even type a word, 45,000 tokens are already loaded (system prompt, tool schemas, agent descriptions, memory files, MCP schemas). On the standard 200k window, that’s 20% gone at session start.

Context engineering is how you fight this.

.

.

.

What Context Engineering Actually Is

Here’s the definition I’ve landed on after (almost) a year of teaching these techniques:

Context engineering is the discipline of designing what information reaches your AI — the right knowledge, the right constraints, the right tools, at the right time — so it can actually do what you need.

The key distinction:

Your CLAUDE.md file isn’t a prompt. Your sub-agents aren’t just parallelism. Your skills aren’t just shortcuts. They’re all components of a context system that assembles the right information before the model ever sees a token.

Prompt engineering is choosing the right words. Context engineering is building the right world around the AI so it barely needs prompting at all.

.

.

.

The Context Engineering Stack

Here’s the framework I wish I had when I started.

Every context engineering Claude Code technique I’ve taught maps to one of six layers — each solving a different problem, each building on the one below it.

Let me walk you through each layer — bottom up.

.

.

.

Layer 1: Static Context (CLAUDE.md)

The “hello world” of context engineering.

A CLAUDE.md file loads automatically into every Claude Code session. It pre-loads project knowledge — your stack, conventions, patterns, gotchas — so every conversation starts with the essentials instead of from zero.

Without it, every session is amnesia.

Claude doesn’t know your project uses Tailwind, your team prefers functional components, or that your API has a weird auth flow. You spend the first five minutes of every conversation re-explaining things you explained yesterday.

(Sound familiar? Yeah.)

But — and stay with me here — there’s a paradox.

CLAUDE.md is incredible because it’s always loaded into context. And terrible for the exact same reason. Always-on context isn’t dynamic. Once your CLAUDE.md passes a few hundred lines, Claude starts ignoring nuances. The very file that’s supposed to help starts contributing to context rot.

The fix: keep CLAUDE.md lean — around 100 lines of essential universals. Load additional context dynamically with skills or custom commands. Prime, don’t hoard.

Deep dives: CLAUDE.md Guide → The Single File

.

.

.

Layer 2: Behavioral Context (Rules & Constraints)

Here’s a scenario that’ll make you wince.

Claude can’t get an API working, so it silently inserts a try/catch that returns sample data. Everything looks correct. All your tests pass. The UI renders beautifully. You demo it to your client on Thursday.

Three days later, you discover nothing was ever real.

(I’ll let that sink in for a moment.)

That’s what happens without behavioral context — instructions that shape HOW the AI behaves, not just what it knows. Knowledge without constraints is a liability.

The fix is a rule in your CLAUDE.md: “Never silently replace real functionality with mocked data. If something fails, fail loud.”

One sentence.

Prevents an entire category of mistakes.

Context engineering goes beyond feeding information in. It constrains behavior through instructions. Think of CLAUDE.md as a behavior contract:

  • “Always write tests before implementation” (TDD constraint)
  • “Never modify files outside /src without asking” (scope constraint)
  • “Use TypeScript strict mode” (quality constraint)

Every rule you add is a piece of behavioral context. And unlike knowledge — which can get stale — good behavioral rules compound. They prevent the same mistake from happening across every future session.

Deep dives: Project Rules → Self-Evolving Rules

.

.

.

Layer 3: Context Persistence (Memory & Evolution)

Every Claude Code session starts with amnesia. The AI doesn’t remember what it learned yesterday — that brilliant debugging approach it discovered at 2 AM, the edge case it finally cracked after four attempts, the architectural decision you both agreed on.

Gone. Every time.

Your CLAUDE.md handles project-level knowledge, but what about session-to-session learnings? That’s what this layer solves:

  • Memory skills that log discoveries, decisions, and patterns
  • Self-evolving rules that update themselves based on what the AI encounters
  • Compaction that snapshots state when a context window fills up

The progression looks like this:

When Claude Code’s context window fills up, it automatically summarizes the conversation — preserving architectural decisions and unresolved bugs while discarding redundant output. That’s automated context engineering built into the tool itself.

But the real power — the thing that still kind of amazes me — is when your rules evolve on their own. A memory skill logs what the AI discovers. Self-evolving rules incorporate those learnings. The next session starts smarter than the last.

Your context system learns while you sleep.

Deep dives: Memory Skill → Self-Evolving Rules

.

.

.

Layer 4: Context Modules (Skills)

If CLAUDE.md is your operating system’s default settings, skills are apps you install for specific tasks.

A skill is a packaged, reusable context bundle.

When you invoke one, you inject a curated set of instructions, examples, and constraints into the model’s context.

When you’re done, you unload it. Clean.

This matters because the alternative is cramming everything into CLAUDE.md — bloating your static context with domain knowledge that’s only relevant 10% of the time. Skills let you modularize. Load the right context for the right task. Unload it when done.

(Think of it like this: you wouldn’t keep every cookbook you own open on your kitchen counter while making scrambled eggs. You’d grab the one recipe you need.)

Even the creator of Claude Code, Boris Cherny, warns: “Too many skills and agents inflate context massively — be selective per project.”

Skills enable both sides of the equation: they reduce what goes into your default context, and they inject domain expertise exactly when you need it.

Context engineering in miniature.

Deep dives: Skills Part 1 → Part 2 → Part 3

.

.

.

Layer 5: Context Delegation (Sub-Agents)

This is where context engineering gets spatial.

Instead of cramming everything into one context window, you split work across focused agents — each with its own tailored context. Each agent sees only what it needs. Nothing more.

Here’s the difference:

A focused agent with limited, relevant context outperforms a bloated one with everything.

Every time.

Read-only sub-agents are especially powerful — context scouts that gather information and report back without polluting the main agent’s context window.

The progression: sub-agents (partially forked context) → background agents (fully independent) → agent experts (single-purpose specialists with one tool, one job, one context window).

Deep dives: Sub-Agents → Read-Only Sub-Agent

.

.

.

Layer 6: Context Orchestration (Workflow Engineering)

This is the top of the stack — and it’s where everything comes together.

Context orchestration is designing how context flows through multi-step processes. Not “what context does the AI need?” but “what context does it need at each step, and how does each step’s output become the next step’s input?”

Every workflow step is a context handoff.

Research produces context for spec-writing. Specs produce context for implementation. Tests produce context for debugging. Each step refines raw information into the precise context the next step needs.

This is why process matters more than prompts.

A well-designed workflow ensures the right context reaches the right agent at the right time — automatically. You’re not just prompting anymore. You’re building a context pipeline.

Deep dives: Workflow Engineering → In Action

.

.

.

The Bonus Layer: Runtime Context

Here’s one most people miss entirely.

Claude builds a perfect admin panel. All unit tests pass. You feel great about it. Ship it.

But when you open two browser tabs, log out in one, and try to delete a user in the other — it works. The session is still active in Tab 2. You just let an unauthenticated user delete accounts.

(That’s… not ideal.)

Why did this happen?

Without browser testing, Claude’s context looks like this:

With browser testing, Claude’s context expands:

Context engineering goes beyond text files and prompts.

Screenshots, console output, browser state — these are all forms of context that close the gap between “the code works” and “the product works.”

Most agent failures aren’t model failures.

They’re context failures.

The admin bug above wasn’t a coding mistake — the AI simply didn’t have the runtime context to know about cross-tab state.

Give it that context, and it catches the bug immediately.

Deep dives: Debugging Visibility → The Ralph Loop

.

.

.

The Decision Framework

When you hit a problem, which context engineering lever do you pull?

When I first started with Claude Code — way back in the early days — I treated it like a magic box.

Dump everything in, get magic out. Ask more detailed questions, get better answers.

It took me an embarrassingly long time to realize that’s backward.

The AI is more like a brilliant intern on their first day. They’ve read every textbook. They can code circles around most juniors. But they know absolutely nothing about your project, your codebase, your conventions — and they forget everything after each conversation.

Context engineering is deciding which sticky notes to put on their desk each morning.

Too few, and they’re lost. Too many, and they’re overwhelmed. Just right, and they look like a genius.

(The intern metaphor isn’t perfect — no metaphor is — but it’s the closest thing I’ve found to describing why some people get incredible results from AI coding tools while others keep complaining “it doesn’t work.”)

.

.

.

You’ve Been Doing This All Along

If you’ve been following this newsletter, you ARE a context engineer.

  • When you wrote your first CLAUDE.md — you were engineering static context.
  • When you added “never mock data silently” — you were engineering behavioral context.
  • When you set up memory skills — you were engineering persistent context.
  • When you created your first skill — you were engineering modular context.
  • When you delegated to sub-agents — you were engineering context isolation.
  • When you designed a research → spec → build workflow — you were engineering context pipelines.

Context engineering isn’t a new skill you need to learn.

It’s a name for the discipline you’ve been developing, one technique at a time, for almost a year.

Just like DevOps unified existing practices — CI/CD, infrastructure-as-code, monitoring — under one discipline, context engineering unifies everything we’ve been doing with AI coding tools. People were already doing it.

The name just made it official.

.

.

.

What Changes Now

Now that you have the framework, you can be deliberate about it.

Instead of reaching for techniques randomly, you diagnose which layer needs attention. Instead of asking “how do I write a better prompt?” you ask a better question:

BEFORE:  "How do I write a better prompt?"

AFTER:   "What does this agent need in its context to succeed?"

That’s the mindset shift. That’s context engineering Claude Code in one sentence.

Pick one layer of the stack you haven’t explored yet:

You’re not a prompt engineer. You’re a context engineer.

Start acting like one.

9 min read The Art of Vibe Coding

Talk Like a Caveman, Save > 75% on Claude Code Usage (I Tested It)

Reddit post hit r/ClaudeAI on April 3rd and absolutely exploded.

The title: “Taught Claude to talk like a caveman to use 75% less tokens.”

10,000 upvotes. Hundreds of comments. Half the thread was laughing. The other half was already adding it to their projects.

Here’s what claude code caveman mode looks like in practice:

Normal Claude: “Added validation with: blur validates when focus leaves, input re-validates as user types, submit validates all fields. Each field uses the existing .error / .valid CSS hooks already in the file, so no style changes were needed.”

Caveman Claude: “Done.”

Same task. Same code quality. Wildly different token bills.

And here’s the thing — I’d been scrolling past Claude’s explanations for weeks without realizing it. Helpful bullets explaining what the code does. Notes about CSS hooks. Context I already understood. The code was always fine. Everything around it was for an audience of nobody.

Credit where it’s due: Reddit user flatty kicked this off, and Drona Gangarapu (3.3k stars on GitHub) took the concept and productized it into a polished, drop-in CLAUDE.md file with actual benchmarks.

I wanted to test it myself. On a real coding task. With real results.

The original Reddit post by flatty about caveman mode with 10K upvotes on r/ClaudeAI

.

.

.

What Is Caveman Mode?

Caveman mode is a prompt instruction that tells Claude to strip all filler from its output.

No articles (“the”, “a”). No pleasantries (“Great question!”). No restating your problem back to you. No unsolicited explanations. No “Let me know if you’d like me to adjust anything!” sign-offs.

Just the answer.

Here’s the instruction I used:

Respond like a caveman. No articles, no filler words, no pleasantries.
Short. Direct. Grunt-level brevity. Code speaks for itself.
If me ask for code, give code. No explain unless me ask.

Why does this work?

Claude’s output tokens are the expensive part of any API call — output tokens cost roughly 4x what input tokens cost on most models. Every “Let me walk you through this…” and “That’s a great approach!” is burning tokens on words that carry zero information for the developer reading them.

I learned this the hard way. I hit my usage limit on a Tuesday afternoon — right in the middle of a productive streak. When I looked at what had actually consumed those tokens, a depressing amount was Claude being polite. Greetings I never read. Summaries of things I’d just asked. Sign-offs I scrolled past. The code itself was maybe 40% of the output.

You already know what you asked. You don’t need Claude to repeat it back to you. You don’t need a greeting. You don’t need encouragement.

You need the code.

Caveman mode eliminates the social performance.

.

.

.

The Test: Normal vs. Caveman on a Real Coding Task

Time to put caveman mode through a real task.

I have a styled contact form — four fields (name, email, subject, message), a submit button, and clean UI. No JavaScript yet. The form looks great but does absolutely nothing when you hit “Send Message.”

The styled contact form before any validation — four fields, a submit button, and nothing else

Here’s the exact prompt I gave Claude, identical in both runs:

Add JavaScript input validation to this contact form. Validate name
(required, 2+ chars), email (required, valid format), subject (required),
and message (required, 10+ chars). Show inline error messages. Validate
on blur and on submit.

Normal Mode Response

Claude Code terminal showing normal mode response — code plus bulleted explanation of validation behavior and CSS hooks note

Both modes wrote the code. Both modes got the validation right. But look at what comes after the code in normal mode — a bulleted explanation of the validation behavior, a note about CSS hooks, context about what each event does. Helpful? Sure. But wrapping the exact same validation code that speaks for itself.

Caveman Mode Response

Claude Code terminal showing caveman mode response — just code and a single word: Done.

Same prompt. Same form. Same working validation code.

One word: “Done.”

The Numbers

Here’s where it gets real.

The non-code explanation in normal mode? 377 characters. The caveman equivalent? 5 characters. That’s “Done.” — period included.

377 to 5. A 99% reduction in the explanation wrapper.

Now multiply that across a full coding session. If you send Claude 30 prompts in an afternoon — and you probably do — that’s 30 explanations you didn’t ask for, 30 sign-offs you never read, 30 “here’s what this does” summaries for code you wrote the prompt for. Those tokens add up fast.

Drona Gangarapu’s benchmarks across five different prompts showed a consistent ~63% total word reduction when you factor in both code and explanation. But the explanation wrapper — the part that caveman mode actually targets — is where nearly all the savings come from.

The code is virtually identical in both cases. Same validateField function. Same event listeners. Same submit handler. Caveman mode cuts the wrapper, not the work product.

.

.

.

Me Tell You Where Put Caveman Words

You’ve seen the results. Now — where should you actually put the caveman instruction?

Three options, ranked from best to worst.

This is the best place for your claude code caveman mode instruction. It persists across sessions, applies to every prompt in that project, and you set it once and forget it.

Add this to the top of your CLAUDE.md:

## Communication Style

Respond like a caveman. No articles, no filler words, no pleasantries.
Short. Direct. Code speaks for itself.
If asked for code, give code. No explain unless asked.
No sycophancy. No restating the question. No sign-offs.

Why the top? Claude processes CLAUDE.md instructions in order, and primacy effects matter. Communication style should be established before anything else.

The project-level approach also gives you control — caveman mode on your personal projects, normal mode on client work. Different projects, different communication styles. One file each.

If you’re not using CLAUDE.md yet, start with The Single File That Makes or Breaks Your Claude Code Workflow. It covers why this file matters and how to structure it.

2. In Your Prompt Directly (Good for Testing)

Prepend the instruction to any prompt:

[caveman mode: no filler, no pleasantries, code only] Add JavaScript
input validation to this contact form...

This is how most people start — and it works fine for a test drive.

The downside: you’re typing it every time, and it adds input tokens to every single message. If you like the results, move it to CLAUDE.md and stop paying the repeated input cost.

3. In ~/.claude/CLAUDE.md (Global — Use Carefully)

This applies caveman mode to every project on your machine. Only do this if you want terse output everywhere. Most people should keep it project-level.

Bonus: As a Claude Code Skill

If you want cleaner separation of concerns — keeping your CLAUDE.md focused on project instructions while communication style lives in its own toggleable unit — Thomas Schlossmacher’s caveman-mode skill packages the whole thing as a drop-in .claude/skills/ file.

Worth a look if you manage multiple communication styles across projects.

Best Practices

  • Put it at the top of CLAUDE.md. Primacy effect means early instructions carry more weight.
  • Combine with other token-saving rules. “Don’t restate my question” and “Skip the sign-off” stack well with caveman mode.
  • Be specific about what to keep. If you still want code comments, say so: “Keep code comments. Skip everything else.”
  • Test first. Run a few prompts before committing it to CLAUDE.md permanently. (You’ll know within two prompts whether you love it or hate it.)

.

.

.

When Caveman Mode Bad. When Skip.

Caveman mode has real tradeoffs. And being honest about them is what makes the technique actually useful — instead of just another internet hack you try once and forget.

Skip it when you’re learning something new.

If you’re asking Claude to explain async/await, database indexes, or CSS grid — you want the verbose explanation. Those filler words become teaching words when you’re building a mental model. Caveman mode strips the pedagogy, and that’s a real loss when pedagogy is the whole point.

Skip it for complex architecture discussions.

“Use microservices” is a caveman answer. But what you actually need is: “Here’s why microservices fit your use case, here are the tradeoffs, and here’s what will break if your team is under five people.” When you need Claude to reason through options with you, let it reason.

Skip it when sharing outputs with collaborators.

If teammates read your Claude Code outputs or review AI-generated code, they need the context that caveman mode strips. Readability matters when the audience hasn’t seen the original prompt. (I learned this one the slightly awkward way.)

Skip it for debugging unfamiliar errors.

When you’re stuck on a cryptic error and need Claude to walk you through what’s happening, the detailed explanation is the value. “Fix: change line 42” doesn’t help if you don’t understand why line 42 was wrong in the first place.

The rule of thumb: use caveman mode when you know what you want and just need Claude to produce it. Skip it when you need Claude to think with you.

And here’s the good part — you can switch freely.

Caveman mode in your CLAUDE.md for daily coding. Remove it (or override it in the prompt) when you need the full Claude experience. Per-project. Per-session. Per-prompt. No commitment necessary.

.

.

.

Me Save Tokens. You Save Tokens. Community Win.

Caveman mode is funny. A developer on Reddit taught an AI to grunt, and thousands of people immediately started saving money.

That’s the internet at its best.

But zoom out and there’s something real underneath the joke. As AI coding tools move to per-token billing, being intentional about output verbosity becomes a genuine skill. And not just for your wallet — less fluff means faster responses, less scrolling, and more signal per screen.

The community drove this one. flatty on Reddit who made everyone laugh while solving a real problem. Drona Gangarapu who turned the concept into a benchmarked, production-ready CLAUDE.md file. The thousands of developers riffing on their own variations. Good ideas have a way of finding their people.

If you want to go further on the token-saving front — front-loading your prompts, optimizing your CLAUDE.md structure, getting more done within your usage limits — check out How to Double Your Claude Code Usage Limits. Caveman mode is one lever. There are more.

Me done. You go try. Report back.

11 min read The Art of Vibe Coding

Codex Reviews My Code Inside Claude Code — But I Don’t Trust It Blindly

I’ve been building something I can’t fully show you yet.

It’s a Chrome extension called PinFlow. The idea: you browse a page, click on any element, attach an instruction to it, and those instructions get routed straight into a local Claude Code session. No tab switching. No copy-pasting selectors. You pick, you describe, Claude edits your code.

The original PinFlow extension sidebar open on a Google page, showing the element picker with a "Pick an element first" prompt

I’ll cover how PinFlow works in a dedicated post in the future. (Subscribe if you don’t want to miss that one.)

But today’s story starts after I finished a major UI redesign of that extension.

The code had gotten complex. Multi-step wizard flows, state management across views, permission handling, concurrent request logic. The kind of complexity where you know bugs are hiding somewhere — you just can’t see them yet.

I needed a second pair of eyes.

Normally, that meant switching over to Codex in a separate terminal, running a review there, then hauling the results back to Claude Code. I’ve done this workflow dozens of times — I even wrote about it back in October 2025.

This time, I didn’t have to switch at all.

There’s a plugin for that now.

.

.

.

What Is the Claude Code Codex Plugin?

On March 30, 2026, OpenAI shipped an official Claude Code Codex plugin (openai/codex-plugin-cc). It lets you run Codex code reviews, adversarial reviews, and delegate tasks to Codex — all from inside your Claude Code session.

A few things worth knowing:

  • Free to use with any ChatGPT subscription, including the Free tier
  • Uses your local Codex CLI — same auth, same config, same models
  • Runs as a Claude Code plugin — the new plugin system, so it lives inside your session
  • 2,500+ GitHub stars in one day — the community noticed fast

If you’ve been following along, you’ll recognize the workflow this replaces.

Back in September 2025, I wrote about using Claude Code and Codex as separate tools in separate terminal windows. In October, I refined that into a structured handoff: Codex plans → Claude builds → Codex reviews.

The plugin collapses all of that into slash commands. No window switching. No copy-pasting context between tools. The review happens right where the code was written.

Install and Setup

Four commands. Under 2 minutes. That’s it.

/plugin marketplace add openai/codex-plugin-cc
/plugin install codex@openai-codex
/reload-plugins
/codex:setup

If you don’t have Codex installed yet, /codex:setup handles that for you. If Codex isn’t logged in, run !codex login from within Claude Code — the ! prefix executes shell commands in your session.

After installation, you’ll see the new slash commands and the codex:codex-rescue subagent ready to go.

What Commands Do You Get?

The plugin ships with 7 commands:

CommandWhat it doesRead-only?
/codex:reviewStandard code review of uncommitted changes or branch diffYes
/codex:adversarial-reviewSteerable challenge review — questions design, tradeoffs, assumptionsYes
/codex:rescueDelegate a task to Codex (bug investigation, fixes, cheaper model pass)No
/codex:statusCheck progress on background Codex jobs
/codex:resultShow final output of a finished job
/codex:cancelCancel an active background job
/codex:setupCheck/install Codex, manage review gate

The two I reach for most:

/codex:review — the bread and butter. Point it at your current changes and get a review. Supports --base main for branch diffs and --background for long-running reviews. Or --wait if you want to stay in the session until the review finishes.

/codex:adversarial-review — the pressure test. Unlike the standard review, you can steer it: “look for race conditions,” “challenge whether this caching approach is right.” I pull this one out before shipping anything risky.

There’s also /codex:rescue, which is the only command that can change code. It hands a task to Codex and supports different models (--model gpt-5.4-mini for quick passes). Think of it as delegating grunt work to a cheaper model while you stay focused.

.

.

.

The Demo: Reviewing a Real Redesign

Here’s where it gets concrete.

I was redesigning PinFlow’s sidebar UI — moving from a single-element component to a full multi-step wizard with Pick → Write → Review steps, multi-pin support, and shared/per-pin instruction modes. A big change.

I gave Claude Code the task with my redesign notes:

Claude Code prompt: "I want to redesign the sidebar UI for the chrome extension based on the redesign notes here: @notes/new_ui_redesign.md"

Claude explored the codebase, reviewed the wireframes, and came back with clarifying questions — architecture decisions, multi-pin picking strategy, scope for the reference mode, how far to go with the running and done states.

Claude exploring the codebase and asking its first architecture question about keeping vanilla HTML strings vs. introducing a lightweight component library
Round 1 of clarifying questions covering architecture, multi-pin support, reference mode, and implementation scope
Round 2 of clarifying questions on send mode and reference UI placement

Two rounds of questions later, Claude had enough context to plan.

It created a task list — three-step wizard, multi-pin picking, write step with shared/per-pin modes, review step, running and done states, activity view, submit flow, and prompt builder.

Claude creating implementation tasks: update types, rewrite render shell, implement Pick step, Write step, Review step, Running/Done states, Activity view, submit flow, and wire up event handlers

8 minutes and 45 seconds later, the redesign was complete.

Implementation summary showing the new architecture: 3-step wizard with Pick/Write/Review steps, 4 views, multi-pin support, shared and per-pin instruction modes, running and done states, activity view, and updated submit flow

A full sidebar redesign. New wizard flow. New state management. New views. All from a single Claude Code session.

But here’s the thing — when that much code changes at once, edge cases don’t announce themselves. They hide in the seams between states, waiting for a user to stumble into them.

I could feel it. Time for Codex.

Triggering the Review

/codex:review --wait
Triggering /codex:review --wait in Claude Code

The --wait flag keeps the session active until the review finishes. Behind the scenes, the plugin spins up a Codex review thread against your uncommitted changes.

Codex starting the review thread, showing the bash command running the codex-companion script, thread ID, and "Reviewer started: current changes" with a 1-minute timeout

6 minutes 35 seconds later, the results came back.

The Review Results

Codex Review Results showing 4 issues found — all edge-case correctness issues. A table lists: P1 Submit silently no-ops when no project configured, P2 Step bar allows jumping to Review without instructions, P2 Back from Activity strands running requests, P2 Concurrent requests can overwrite lastResult and currentView. Ends with "Want me to fix these issues?"

4 issues found. All related to edge-case correctness rather than the core redesign:

PriorityIssue
P1Submit silently no-ops when no project is configured (no user feedback)
P2Step bar allows jumping to Review without writing instructions
P2“Back” from Activity always goes to wizard, even if a request is running
P2Concurrent requests can overwrite each other’s lastResult and currentView

Every single one of these is the kind of bug that slips through during a big redesign. You’re focused on the main flow — the happy path — and the edge cases hide in the seams between states.

At the bottom of the review: “Want me to fix these issues?”

I could have said yes. Let Claude apply all 4 fixes and move on with my day.

I’ve been on the other side of that decision. Said “yes, fix everything” on a review once, walked away, came back to a diff full of renamed variables and reshuffled imports that had nothing to do with the actual bugs. Took longer to untangle than the original review would have.

So no. I didn’t say yes.

.

.

.

The Validation Prompt: Where the Real Value Lives

Here’s what I do instead — and honestly, this is the part I want you to steal.

After receiving Codex’s review comments, I paste this prompt:

let's address the code review comments provided.

Follow the steps below to effectively address the code review comments:

1/ First, you should analyze the code review comments carefully and understand the feedback given.
2/ Then, determine if the comments given are valid and we should make changes to the code based on the feedback.
3/ If the comments are valid, you should make the necessary changes to the code to address the comments. If you believe the comments are not valid, you should provide a clear explanation to justify why you think the comments are not valid.

Use the AskUserQuestion tool to ask me clarifying questions until you are 95% confident you can complete this task successfully. For each question, add your recommendation (with reason why) below the options. This would help me in making a better decision.
The validation prompt pasted into Claude Code after receiving Codex's review comments

What happens next is the key insight.

Claude reads the Codex review. It analyzes each comment against the actual codebase — the code it just wrote, with full context of why things are structured that way. And instead of blindly applying everything, it comes back with a verdict and clarifying questions.

Claude's response: "All 4 comments are valid. Most fixes are straightforward, but two have design decisions worth confirming." Shows AskUserQuestion with two questions — P1: how to handle the no-project case (recommends Disable Send + inline hint) and P2: how to handle concurrent requests (recommends Prevent new submissions)

Look at what Claude did here:

“All 4 comments are valid. Most fixes are straightforward, but two have design decisions worth confirming.”

For the straightforward fixes, Claude proceeds. For the ones with judgment calls, it asks — with a recommendation and reasoning for each option:

  • P1 — No project configured: When no project is set, the Send button silently does nothing. How should we handle this? Claude recommends: Disable Send + inline hint.
  • P2 — Concurrent requests: A second request can start while one is already running. Should we prevent it or handle the overlap? Claude recommends: Prevent new submissions.

Each question comes with Claude’s recommendation and the reasoning behind it. I pick the recommended options for both.

This is the part that matters.

Claude becomes a filter between the review and your code. It validates each comment in context, surfaces the ones that need your judgment, and handles the rest. You stay in control without having to re-read every line yourself.

Watching the Fixes Go In

With the design decisions confirmed, Claude applies all 4 fixes.

Fix 1 (P1): Disable the Send button on the Review step when no project is configured. Show an inline hint: “No project configured — open Settings.”

Fix 2 (P2): Block step-bar navigation to Review when no instructions have been written. Add a hasInstruction() guard to the click handler.

Claude applying Fix 1 — disabling Send + showing hint for no project — and Fix 2 — adding hasInstruction guard to block premature Review navigation. Shows actual code diffs in src/content/index.ts

Fix 3 (P2): Back from Activity now checks if a request is currently running. If it is, the view returns to running instead of the wizard — preventing the user from stranding an in-flight request.

Fix 2 continued — updating the step-bar click handler condition — and Fix 3: updating the Back from Activity handler to check records.some for running state and return to the running view instead of wizard

Fix 4 (P2): Scope lastResult and currentView updates so they only fire when the completing request matches the current one. Prevents a finishing background request from overwriting the state of a newer request.

Fix 4: Scoping lastResult/currentView updates behind a currentRequestId check in both success and error paths, preventing stale responses from overwriting current state

Then Claude verifies the build still compiles. All clean.

Build verification passes. Summary table showing all 4 fixes: P1 Send button disabled with inline hint, P2 hasInstruction guard added, P2 Back from Activity returns to running view, P2 Send disabled during active request with scoped state updates. "Brewed for 1m 54s"

4 issues identified. 4 fixes applied. 2 design decisions confirmed. 1 minute 54 seconds.

And here’s the part worth sitting with: if any of those review comments had been a false positive — a stylistic preference that didn’t match the codebase, or a “problem” that was actually intentional — Claude would have flagged it. It would have said “this comment suggests X, but the current approach is correct because Y” and asked whether to skip it.

That filtering step is the difference between a code review you can act on and a code review that introduces churn.

The Before and After

Remember the original PinFlow UI from the top of this post? Here’s what it looks like after the redesign and the review fixes:

New wizard flow. Clean state management. And four edge-case bugs caught before they ever reached a user.

I’ll go deep on the extension itself in a future post.

(Stay tuned for that one.)

.

.

.

The Review Gate: The Automated Alternative

The plugin also includes a review gate — a built-in hook that automatically runs a Codex review before Claude finishes a task:

/codex:setup --enable-review-gate

When enabled, every response Claude is about to complete gets intercepted for a Codex review first. If issues are found, the stop is blocked so Claude can address them.

I prefer the manual approach.

The review gate can create long-running Claude/Codex loops that drain usage limits, and it doesn’t give you the chance to filter false positives before they get fed back in. For long autonomous runs where you want a safety net, though, the gate has its place.

Think of the manual prompt as the scalpel and the review gate as the safety net — choose based on how much control you want.

.

.

.

The Bigger Picture: Claude and Codex, Integrated

Let me zoom out for a second. My Claude-Codex workflow has gone through three distinct phases:

1. Side by side (Sept 2025) — Separate tools, separate terminal windows, separate contexts. I used to keep two terminals open — Claude Code on the left, Codex on the right. Copy a file path from the review, switch windows, find the line, switch back. By the third comment I’d lost track of what I was even fixing.

2. Manual handoff (Oct 2025) — Structured workflow with Codex planning and reviewing, Claude building. Better. But still separate tools with separate contexts.

3. Integrated (now) — Codex commands running inside Claude Code. Shared context. No switching. The review happens where the code lives.

Each evolution removed friction. The Claude Code Codex plugin removes the last meaningful barrier: context loss between tools.

And when I pair that with the validation prompt — having Claude critically evaluate Codex’s feedback before acting on it — I get a review workflow that catches real bugs without drowning me in noise.

Between the Codex plugin and the Chrome extension I teased at the top, the direction feels clear. The tools are converging. The best workflow is the one where you never have to leave.

.

.

.

Your Next Steps

The plugin takes 2 minutes to install. The validation prompt is 6 lines you can copy-paste.

Together, they give you a code review workflow that catches real issues — and lets you skip the noise.

Here’s what to do:

  1. Install the plugin (4 commands above)
  2. Run /codex:review on whatever you’re working on right now
  3. Paste the validation prompt and let Claude filter the results
  4. Fix what matters. Skip what doesn’t.

Try it on your next session. You’ll be surprised how many review comments are noise — and how valuable the ones that survive the filter actually are.

Plugin repo: openai/codex-plugin-cc

12 min read The Art of Vibe Coding

Workflow Engineering in Action: Building a Reddit Summarizer From Scratch With Claude Code

Here’s a confession.

I follow about a dozen subreddit threads. AI tooling, Claude Code tips, local LLM experiments, dev workflows. And every single morning, I open Reddit fully intending to spend five minutes catching up.

Forty-five minutes later, I’m still scrolling.

Ninety percent of it is noise. Reposts, complaints (like those weekly usage/rate limits rants in r/ClaudeCode), low-effort memes, questions that got answered three threads ago. But buried somewhere in there — a workflow trick someone discovered at 2am, a Claude Code hack that actually works in production, a case study with real numbers — that stuff is gold.

I just couldn’t find it fast enough.

So I decided to build something. A simple Express server that would connect to the Reddit API, pull posts and comments from my favorite subreddits, store them locally as JSON files, and let me point Claude at the data to surface only what matters.

And here’s the part that matters for you: I built it using the Claude Code Workflow Engineering process I described in the previous issue. Start to finish. No shortcuts. No “eh, I’ll just wing this part.”

(Okay, I was tempted. But I didn’t.)

What follows is every step of that process applied to a real project — from a blank folder to a working app with full tests passing on the first attempt. Every screenshot. Every command.

Stay with me.

.

.

.

The Starting Point: One Idea, Zero Code

Here’s what my project folder looked like when I started: an idea.md file describing what I wanted, and the Workflow Engineering slash commands from the previous issue.

That’s it. No boilerplate. No template repo. No starter code. Just an idea and a process.

Project folder showing only the idea.md file and workflow engineering commands

The idea itself was pretty straightforward: an Express server that fetches posts and comments from configured subreddits within the last 24 hours, then saves everything as JSON files organized by subreddit and date. No database — just files on disk. Once the data is collected, I can ask Claude to read it and find the good stuff for me.

The one wrinkle? Reddit’s API now requires OAuth 2.0. So the app needs to handle the full authorization flow — token exchange, refresh tokens, the whole dance — before it can fetch anything.

With a clear idea written down, I handed it to the workflow.

Let’s walk through what happened.

.

.

.

Step 1: Brainstorm the Specs

I triggered the /spec_brainstorm command and pointed Claude at my idea file.

Claude Code terminal showing /spec_brainstorm command being triggered with the idea.md file

Now, I’ve tried building apps like this before — dumping everything into one prompt and letting Claude run. It got through maybe 60% before the code started contradicting itself. Requirements from the top of the conversation were ghosted by the bottom.

The Claude Code Workflow Engineering approach is different. Instead of jumping into code, Claude started asking clarifying questions. Real ones. With options, explanations, and a recommendation for each.

The first round covered core architecture decisions: How should data collection be triggered? (Manual API endpoint, cron scheduler, or both?) What kind of frontend does this need beyond the OAuth setup page? How should filtering work?

Claude Code presenting multiple-choice question about data collection trigger method with options for manual API endpoint, built-in cron scheduler, or both
Claude Code asking about frontend scope with options for minimal OAuth-only page or full dashboard UI
Summary of first round answers covering Summarizer, Trigger, Frontend, and Filtering categories

The second round went deeper: How should subreddits be configured? (Config file, hardcoded, or environment variables?) What data should go into the JSON files? (Posts only, posts + all comments, or posts + top comments?) Language preference?

Summary of second round answers covering Config, Data scope, and Language with TypeScript selected

Once Claude had enough context from both rounds, it wrote the full specification document.

Claude Code writing the complete specs.md file based on all the answers provided

Two rounds of questions. Clear decisions documented in a file. The specs existed as an artifact on disk — ready to be read by a completely fresh session with zero memory of this conversation.

That last part matters more than you might think.

(We’re about to see why.)

.

.

.

Step 2: Review the Specs

Here’s where most people go wrong. And I know this because I was most people.

On an earlier project, I skipped the review step. The specs looked fine to me. Three hours into implementation, I found a conflict that would have taken a reviewer two minutes to flag. Two minutes.

So now I don’t skip it.

Here’s the thing: the agent that wrote the specs is the worst possible agent to review them. It already “knows” what it meant. It won’t catch ambiguity because it can fill in the gaps from memory. A fresh agent reading the same file cold? It has no such luxury.

New session. /spec_review command.

New Claude Code session showing /clear followed by /spec_review command

A fresh Claude instance — with zero memory of the brainstorming conversation — read the specs and started poking holes.

And it found real problems. Using GET for state-changing operations (a REST convention violation and a security risk — someone could trigger data collection just by visiting a URL). Writing refresh tokens directly to .env at runtime (which, ferpetesake, doesn’t work the way the spec assumed). Vague OAuth state storage. And more.

Claude Code presenting spec review findings including P1 GET for state-changing operations, P2 writing refresh token to .env, and P3 vague OAuth state storage

Now, here’s where your judgment comes in. Claude surfaced a long list of potential issues — some critical, some nice-to-have. You don’t have to fix everything. You get to choose what matters.

I went through them in three tiers.

First — the spec-breaking issues:

Multi-select interface showing spec fixes with options like REST endpoint methods, OAuth state storage, error strategy, and collect-all endpoint behavior

Second — important improvements:

Second page of issue selection showing P5 Collect-all timeout, P6 Pagination limit, P7 Partial failure, and P8 Date boundary

Third — lower priority fixes (I picked the ones with real consequences):

Third page showing lower priority issues including P10 More Reddit error codes, P11 User-Agent source, and P13 Path traversal security fix

Before applying fixes, Claude asked clarifying questions to make sure the solutions would be solid. How should /collect-all handle job tracking? What should the filename date represent — when data was collected or when the post was created? Where should the Reddit username for the User-Agent header come from?

Claude asking about collect-all endpoint job tracking approach with sequential vs parallel options
Claude asking about date boundary logic for filename dates with a visual example showing collection date mapping
Claude asking about User-Agent source with env var recommended, showing example .env configuration

With answers in hand, Claude updated the specs — changing GET to POST for state-changing endpoints, adding proper error handling, fixing the OAuth storage approach, adding pagination limits, and patching a path traversal vulnerability.

Claude modifying the specs file with red/green diff showing changes to API endpoints and OAuth flow
Summary table of all spec changes applied, showing 10 fixes across REST methods, token storage, OAuth state, pagination, error handling, and path traversal

Ten issues addressed. Specs refined. The artifact on disk now reflected a far more robust design than what the brainstorming session produced alone.

And we still haven’t written a single line of code.

(On purpose.)

.

.

.

Step 3: Write the Test Plan

I’ll be honest with you — this step almost didn’t happen.

Writing tests for code that doesn’t exist yet? It felt ceremonial. Like filling out a form nobody would read. I almost skipped it.

Then the test plan revealed two requirements I’d completely glossed over in the spec.

So now I never skip it.

Fresh session. /write_test_plan command.

New session with /clear followed by /write_test_plan command

Claude read the specs and produced a structured test plan: 33 test cases organized by priority. 8 Critical, 14 High, 10 Medium. Each one with preconditions, specific steps, and expected outcomes.

Claude writing test_plan.md with 33 test cases organized into sections covering config, OAuth, collection, filtering, storage, API, errors, frontend, and date handling

Why does this matter so much?

Because writing test cases forces deep analysis of every requirement. Turning “handle pagination limits” into a specific test case — with exact inputs, steps, and expected outputs — requires genuine understanding. Shallow understanding produces shallow tests, and you’d catch that now rather than three hours into debugging.

And there’s a second benefit: the test plan gives implementation a concrete target. Every task will map to specific test cases. “Done” stops being a gut feeling and starts being a checkmark.

.

.

.

Step 4: Write the Implementation Plan

Another fresh session. /write_impl_plan command.

New session with /clear followed by /write_impl_plan command

Claude read both the specs and the test plan, then generated an implementation plan — 10 tasks, each explicitly linked to the test cases it would satisfy.

Claude writing implementation plan showing project overview and task structure with dependencies

The plan organized tasks into execution waves based on dependencies. Every task mapped to specific test case IDs.

Implementation plan summary table showing 10 tasks with their key test case mappings and execution order across 5 waves

This is the last thinking step. After this, every design decision has been made. Every task has a defined scope. Every success criterion sits in a file on disk.

Now — and only now — we build.

.

.

.

Step 5: Execute the Implementation

Fresh session. /do_impl_plan command.

New session with /clear followed by /do_impl_plan command

Here’s where the Claude Code Workflow Engineering approach earns its keep.

Instead of running all 10 tasks in a single session (which would cause context degradation as the window fills up — I’ve been there, remember?), Claude created each task and processed them in waves using sub-agents. Each sub-agent got a fresh context window. It read the implementation plan from disk, found its assigned task, and executed with laser focus.

Wave 1 started with the foundation — project scaffolding.

Claude creating all tasks with dependencies and executing Wave 1 Task 1 for project scaffolding

Then the waves rolled forward, with parallel tasks running wherever dependencies allowed:

Wave progression showing tasks running in parallel — config loading, OAuth routes, collection logic, and filtering all executing concurrently across waves
Later waves handling comment fetching, error handling, and storage management with sub-agents completing tasks
Final implementation waves covering API routes and the frontend OAuth setup page

Implementation done. 16 files created. All 10 tasks completed across multiple waves.

Implementation summary showing all files created with their purposes — from package.json to the OAuth frontend page

Every sub-agent worked from the same artifact — the implementation plan on disk. No context bleeding between tasks. No “forgetting” early requirements while working on later ones.

Fresh context, every single wave.

.

.

.

Step 6: Setup Before Testing

Before running the test plan, I needed to set up the actual Reddit integration. Three things:

A config file defining which subreddits to monitor (I chose ClaudeCode and ClaudeAI — for obvious reasons):

config.json file showing two subreddits configured — ClaudeCode with minScore 10 and minComments 5, and ClaudeAI with defaults

A Reddit app registration to get OAuth credentials:

Reddit's create application page with RdSummarizer as the app name, web app type selected, and localhost redirect URI configured

And a .env file with the credentials:

.env file showing REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET, REDDIT_REDIRECT_URI, REDDIT_USERNAME, REDDIT_REFRESH_TOKEN (empty), and PORT=5566

Straightforward stuff. Let’s get to the good part.

.

.

.

Step 7: Run the Test Plan

The final step. Fresh session. /run_test_plan command.

(Deep breath.)

New session with /clear followed by /run_test_plan command

Claude read the test plan, explored the codebase, and confirmed this was a fresh test run with 33 test cases ready to execute.

Claude reading the test plan and exploring the codebase structure, confirming 33 test cases for a fresh test run

It created tasks for each test case, set up tracking files, and organized execution by dependencies and priority.

Claude creating 33 test case tasks with dependencies, setting up test-status.json and test-results.md tracking files

I asked Claude to skip TC-003 (environment variable validation) since that one needed manual testing with specific env states.

User asking Claude to skip TC-003 env validation test, Claude acknowledging and marking it as skipped

Then the tests ran. One sub-agent per test case. Each with fresh context.

Test execution Phase 2 running TC-001 through TC-006 with sub-agents, showing config loading, invalid config, OAuth redirect, and callback tests passing
Mid-test execution showing TC-007 through TC-010 passing — token refresh, collection happy path, subreddit validation, and hours parameter tests
Continued test execution with TC-011 through TC-019 — collection, filtering, storage, and error handling tests all passing with sub-agents
Test execution TC-020 through TC-025 — storage directory creation, merge deduplication, Reddit API pagination, comments fetching, and rate limit handling all passing
Final batch of tests TC-026 through TC-033 including error codes 401 403 404 429 5xx, frontend OAuth page, User-Agent header, and collection date naming all passing

All automated tests passed on the first attempt.

Zero code fixes required.

Test completion summary showing all 32 automated tests passed on first attempt with zero code fixes needed, and implementation matched the specification

Here’s the full results table:

Final score: 32/33 passed. 0 failed. 1 skipped (TC-003 — manual user testing). 0 known issues. 0 total fix attempts.

Test execution summary showing 32/33 passed, 0 failed, 1 skipped TC-003 for manual user testing, and all automated test cases passed on first attempt with zero code fixes

Let that sit for a second.

Every automated test passed on the first try. No code fixes needed. The implementation matched the specification because the specification had been thoroughly brainstormed, independently reviewed, and tested-before-built.

The ceremony I almost skipped? Turns out it was doing the heavy lifting all along.

.

.

.

Putting the App to Work

With all tests green, I could actually use the thing.

First up: the OAuth flow. I started the server and opened the setup page — a simple “Connect to Reddit” button.

Reddit Summarizer Setup page showing Not Connected status with an orange Connect to Reddit button

One click, and Reddit’s authorization page appeared.

Reddit OAuth authorization page asking to allow RdSummarizer to access posts and comments and maintain access indefinitely

After approving, the app received a refresh token and displayed it with clear instructions to add it to .env.

Reddit Summarizer success page showing the refresh token with a Copied button and instructions to add REDDIT_REFRESH_TOKEN to the .env file
Note: The refresh token is fake.

Token saved. Now I asked Claude to hit the /api/collect-all endpoint and pull data from both configured subreddits.

Claude Code running the collect-all endpoint with hours=24, showing successful collection from ClaudeCode and ClaudeAI subreddits with post counts

The data landed exactly where the specs said it would — JSON files organized by subreddit and date.

File explorer showing collected JSON data in logs folder organized by subreddit, with actual Reddit post data visible including titles, scores, and timestamps from the ClaudeCode subreddit

Now for the payoff.

I asked Claude to read the collected data and surface the latest Claude Code tips, workflows, and real-world case studies.

User prompt asking Claude to find and summarize the latest Claude Code tips, workflows, and case studies from the collected posts

The collected data was large — 64k tokens. Claude spawned 6 sub-agents to process it in parallel, each analyzing a chunk.

Claude processing the large data file with 6 parallel sub-agents, each analyzing a chunk of posts — ranging from 23.8k to 82.3k tokens

And here’s what came out — a synthesized summary of everything worth knowing from the last 24 hours across both subreddits:

Claude's synthesized insights showing top hacks like Force Opus sub-agents, hook-based context injection, notification sounds on Mac, and workflow optimizations including must-have settings and measure twice cut once workflows

Two subreddits. Hundreds of posts and comments. Distilled into actionable insights in under a minute.

I would never consume that volume of data and extract insights that fast by scrolling Reddit manually. The app collects and organizes. Claude analyzes and summarizes. And because all of this runs through my Claude subscription, there’s no separate API cost for the summarization part.

My morning Reddit scroll just went from 45 minutes to about 2.

.

.

.

Why the Workflow Made This Possible

You might be thinking: “Okay, but couldn’t you have built this without all the workflow steps? It’s just an Express server with some API calls.”

Honestly? Probably. This project is small enough that a skilled developer could prompt their way through it in one session.

But here’s what would have been different.

1. The spec review caught 10 issues before any code existed. 

Using GET for state-changing operations. Writing tokens to .env at runtime. Missing pagination limits. A path traversal vulnerability. Any one of these would have meant debugging sessions after implementation — or worse, shipping a security hole you never noticed.

2. The test plan gave implementation a concrete target. 

33 test cases, defined before Claude wrote a single line of code. When every task maps to specific success criteria, you don’t end up with “it seems to work” confidence. You end up with full tests passed on the first attempt confidence. There’s a world of difference between those two.

3. Fresh sessions prevented context rot. 

The brainstorm session accumulated context from two rounds of Q&A. The review session started clean — and immediately found problems the brainstorming agent was blind to. The implementation used sub-agents in waves, each with its own fresh context window. No degradation. No forgotten requirements.

4. The artifacts served as shared memory. 

Every step read from the previous step’s output file. Specs fed the review. Reviewed specs fed the test plan. Test plan fed the implementation plan. Implementation plan fed the sub-agents. Nothing lived “in context.” Everything lived on disk, where any fresh session could pick it up.

And here’s the part I keep coming back to: the workflow scales. 

This project happened to be small.

The next one might not be.

And the exact same six commands:

  • /spec_brainstorm
  • /spec_review
  • /write_test_plan
  • /write_impl_plan
  • /do_impl_plan
  • /run_test_plan 

…will work the same way regardless of what you’re building.

You design the process once. You refine it over time. Then you apply it to everything.

That’s the whole promise of Claude Code Workflow Engineering. And I think this little Reddit project makes a decent case for it.

.

.

.

Your Turn

The full source code is on GitHub: reddit-summarizer

If you want to use the same workflow for your own projects, grab the Workflow Engineering Starter Kit — all six command files, ready to drop into your .claude/commands/ folder.

Here’s what I’d suggest:

  1. Pick a project idea you’ve been sitting on
  2. Write it down in an idea.md file — even a rough paragraph works
  3. Run the six-step workflow end to end
  4. Pay attention to what the spec review catches — that’s usually where the biggest surprise shows up

What are you going to build with it?

Go engineer it.

17 min read The Art of Vibe Coding

Workflow Engineering: Why Your AI Development Process Matters More Than Your Prompts

You open Claude Code.

You’ve got a feature to build — a complex one. Payment integration, subscription handling, admin dashboard, the works.

So you write the most detailed prompt you’ve ever crafted. 1000+ words. Every requirement listed. Edge cases mentioned. You even throw in a few “make sure you handle X” reminders for good measure.

(You’re being thorough. You’re being responsible. You’re practically writing documentation before the code even exists.)

You hit enter.

Claude gets to work.

Files appear. Functions materialize. Code flows like water.

Thirty minutes later, you look at the output.

Half your edge cases? Missing. The subscription lifecycle you described in exquisite detail? Partially implemented. That race condition you specifically warned about? Acknowledged in a code comment — a lovely, well-formatted code comment — but never actually handled.

So you do what every developer does.

You rewrite the prompt.

Make it longer. More specific. Add bold text for emphasis. Paste in code examples. Maybe underline something, just to really drive the point home.

Same result. Different gaps.

.

.

.

The Prompt Optimization Trap

Here’s the cycle most developers are stuck in right now:

The prompt keeps getting bigger. The results don’t keep getting better.

You’ve probably watched this happen in real-time.

The AI starts strong — the first few hundred lines look great. Then quality dips. Functions get shallower. Edge cases receive “TODO” comments instead of actual handling. By the end, Claude is running on fumes, juggling so much context that it’s forgetting what you said at the beginning of your very thorough, very responsible prompt.

Everyone’s response?

Write a better prompt. A clearer prompt. A more detailed prompt. I did this too. For longer than I’d like to admit.

Here’s what I learned after months of building complex features with Claude Code: the answer has nothing to do with writing better prompts.

The answer is designing better workflows.

.

.

.

From Prompts to Workflows

Stay with me here — because this is the shift that changed everything about how I work with AI.

Think about how you’d approach a complex feature without AI.

You wouldn’t sit down, write everything you know into one document, hand it to a junior developer, and say “build all of this.” That’s a recipe for disaster.

(And possibly a resignation letter.)

Instead, you’d break the work into phases.

Write specs first. Review them. Plan the implementation. Assign tasks. Verify the results. Each phase produces something concrete — a document, a plan, a test report — that feeds into the next phase.

The same principle applies to AI-assisted development. And it has a name.

Workflow Engineering is the practice of designing multi-step, artifact-driven processes where each step produces a concrete output that becomes the input for the next step — and where the process itself is reusable across projects.

Read that again.

Two words matter most:

Artifact-driven. Every step creates something tangible. A spec file. A test plan. An implementation plan. Not vibes. Not “context.” Actual files that exist on disk and can be read by a fresh session.

Reusable. The workflow works regardless of what feature you’re building. Payment integrations, admin dashboards, API endpoints, plugin architecture — the same sequence of steps applies every time.

Here’s the mental model shift:

With prompt thinking, you’re optimizing the message.

With workflow thinking, you’re optimizing the process.

One is fragile, project-specific, and impossible to debug when things go sideways. The other is robust, reusable, and traceable — meaning when something does go wrong (and it will, because software), you can trace exactly where the chain broke.

The question stops being “how do I write the perfect prompt to implement this feature?” and becomes something far more interesting: “what sequence of focused steps will reliably produce a working feature — regardless of what that feature is?”

That second question? That’s workflow engineering.

.

.

.

The Four Principles of Workflow Engineering

After months of building and refining workflows for Claude Code, I’ve distilled what makes them work down to four principles.

(Four! A reasonable number. I considered making it seven because odd numbers feel more authoritative, but that felt dishonest. Four is what I’ve got. Four is what works.)

These apply to any AI coding tool — Claude Code, Cursor, Copilot, Codex, whatever ships next quarter.

The tools will change.

These principles won’t.

Principle 1: Separate Thinking from Doing

When Claude is brainstorming specs, it shouldn’t be writing code. When it’s implementing, it shouldn’t be redesigning architecture. Mixing planning and execution causes both to suffer.

Here’s why.

Planning gets shallow when the agent is eager to start building.

It rushes through decisions because there’s code to write — ferpetesake, there are functions to create, endpoints to scaffold. Meanwhile, the code gets sloppy because the agent is still making design decisions mid-stream — changing its mind about architecture while simultaneously trying to implement it.

You’ve seen this happen.

Claude starts building a feature, realizes halfway through that the data model needs restructuring, pivots the architecture, and now half the code it already wrote doesn’t match the new approach.

The result? A Frankenstein codebase where the first half follows one pattern and the second half follows another.

Every step in a well-engineered workflow should be either a thinking step or a doing step.

The artifact that comes out of the thinking phase — the spec, the plan, the test cases — becomes the wall between them. By the time Claude starts coding, every design decision has already been made and documented.

No more mid-implementation architecture pivots. No more shallow plans that crumble at the first edge case.

Principle 2: Fresh Context, Always

Here’s something most developers learn the hard way. (I certainly did.)

AI performance degrades as context accumulates. The longer a session runs, the worse the output gets. Claude starts “forgetting” early instructions. It takes shortcuts. Details slip through the cracks like sand through fingers.

We call this context rot — and it’s the silent killer of ambitious AI projects.

Think of it like a multi-day hiking trip.

Day one, your backpack is light. You’re sharp, focused, covering ground fast. By day five — if you’ve been packing on top of yesterday’s gear without clearing anything out — you’re hauling 40 pounds of stuff you don’t need. Yesterday’s rain jacket (it’s sunny now). Tuesday’s extra water bottles (you passed a stream an hour ago). Your pace drops. Your attention narrows. You start missing trail markers because you’re too busy adjusting your shoulder straps.

That’s what happens to an AI agent running in a single session across a dozen tasks.

Workflow engineering forces natural context boundaries:

Each step runs in its own session. Each sub-agent gets a clean slate. The file carries knowledge forward. The context resets every time.

Fresh backpack. Every single morning.

Principle 3: Artifacts Over Memory

Don’t trust the AI to “remember” what you decided three steps ago.

(Don’t trust yourself to remember, either. I once forgot a critical API decision I made that same morning. Before coffee, but still.)

Every decision, every requirement, every edge case — externalized into a file.

Why? Three reasons.

  • A file can be read by a fresh session. This enables Principle 2. When a new session starts, it reads specs.md from disk — it doesn’t need to “recall” a conversation that happened two hours ago in a completely different context window.
  • A file can be reviewed by a different agent — or by you. This is how you catch mistakes before they compound. The spec review step? That’s a fresh agent reading the brainstorm agent’s output and poking holes in it. Adversarial quality control, built right into the workflow.
  • A file creates a traceable chain. If something breaks in implementation, you can walk the chain backwards to find exactly where things went wrong:

Without artifacts, every failure means starting from scratch.

With artifacts, every failure is traceable to a specific step. You fix that step. You re-run from that point. Everything downstream updates accordingly.

That’s the difference between “something broke” and “I know exactly where it broke.”

Principle 4: Define Success Before Starting Work

Write the test plan before the implementation plan.

(I know. I can feel you resisting this one through the screen.)

Most developers want to start building immediately.

Writing test cases for code that doesn’t exist yet feels like… paperwork. Busywork. The kind of thing a project manager suggests in a meeting you didn’t want to attend.

But for AI-driven development, it changes the entire outcome. Here’s why.

1/ Deep requirement analysis.

When Claude has to turn “handle race conditions during renewal processing” into a specific test case — with preconditions, exact steps, and expected outcomes — it has to deeply understand what that requirement actually means.

Shallow understanding produces shallow tests.

If the test plan looks thorough, the requirements were thoroughly analyzed.

2/ Gap detection before code exists.

A missing test case reveals a missing requirement. And finding a gap in your spec is a hundred times cheaper before implementation than after.

(Ask me how I know.)

3/ Clear implementation targets.

Every task in the implementation plan maps to specific test cases.

The developer — or AI agent — knows exactly what “done” means for each piece of work. No ambiguity. No interpretation. No “I thought you meant…”

You’re building toward a defined target instead of discovering the target while building.

Which sounds obvious when I write it out like that — but go look at your last three AI-assisted features and tell me you had a test plan before you started coding.

(No judgment. I didn’t either. Until I did.)

.

.

.

The System: Workflow Engineering in Practice

Principles are great.

Principles are necessary.

But at some point, you need to see them actually working — not just sounding wise on a page.

So let me show you the complete workflow engineering pipeline I’ve built and refined over the past several months for Claude Code. Six steps, four phases, every principle encoded into the process.

I’ve written deep-dives on each phase of this system:

This article is the why behind those hows.

Here’s the complete system at a glance:

Let me walk you through each step — what it does, why it exists, and what artifact it produces.

Step 1: Spec Brainstorm

Principle served: Artifacts Over Memory

You describe the feature you want.

But instead of Claude immediately starting to code, you trigger a question-asking mode: “Ask me clarifying questions until you are 95% confident you can complete this task successfully.”

That line is the key.

It tells Claude to stop assuming. Stop guessing. Stop filling in blanks with whatever seems reasonable.

Claude explores your codebase first — reading your existing patterns, your database schema, your current architecture. Then it starts asking questions, with options, explanations, and its own recommendation for each.

In my WooCommerce integration project, Claude asked 15 questions covering everything from subscription plugin choice to refund handling to email notifications. Edge cases I hadn’t thought about. Architectural decisions that would have bitten me weeks later.

Every answer gets compiled into a comprehensive specification document.

Artifact produced: notes/specs.md

👉 Deep-dive: The 3-Phase Method for Bulletproof Specs

Step 2: Spec Review

Principle served: Fresh Context

Start a new session. Fresh context. Then ask Claude to critique its own work.

Why a new session?

Because the brainstorming session’s context is bloated with 15 rounds of Q&A. A fresh agent reading the specs with skeptical eyes catches things the original agent — who was busy building the specs — overlooked.

In my project, the review found 14 potential issues, including a race condition that would have caused double charges (ferpetesake, the payments!), a token deletion scenario that would silently break renewals, and a mode-switching conflict that would have confused billing for every active subscriber.

You pick which issues matter. Claude fixes them — with another round of clarifying questions to make sure the fixes are solid.

Artifact produced: refined notes/specs.md

👉 Deep-dive: The 3-Phase Method for Bulletproof Specs

Step 3: Test Plan

Principle served: Define Success Before Starting Work

Before writing any implementation code, Claude reads the specs and generates a structured test plan. Every requirement becomes a test case with preconditions, specific steps, expected outcomes, and priority levels.

For my WooCommerce project: 38 test cases organized into 12 sections. 7 Critical, 20 High, 11 Medium.

This serves a dual purpose.

It verifies Claude deeply understood every requirement — shallow understanding produces shallow test cases, so thorough tests mean thorough comprehension. And it creates the success criteria that will drive everything that follows.

Artifact produced: notes/test_plan.md

👉 Deep-dive: How to Make Claude Code Actually Build What You Designed

Step 4: Implementation Plan

Principle served: Separate Thinking from Doing

Claude reads both the specs and test plan, then generates an implementation plan. Tasks are grouped logically, dependencies are identified, and every task maps to the specific test cases it will satisfy.

For my project: 4 phases, 12 tasks, each explicitly linked to test cases. Phase 1 handles foundation (TC-001 to TC-007). Phase 2 tackles checkout and lifecycle (TC-008 to TC-014). Phase 3 addresses the critical renewal processing (TC-015 to TC-021). Phase 4 covers remaining features (TC-022 to TC-037).

This is the last thinking step.

After this, the wall goes up. Every design decision has been made. Every task has a clear target. Now — and only now — we build.

Artifact produced: notes/impl_plan.md

👉 Deep-dive: How to Make Claude Code Actually Build What You Designed

Step 5: Execute Implementation

Principle served: Fresh Context

Here’s where sub-agents change everything.

Instead of running all 12 tasks in one session (guaranteed context rot), Claude creates each task using the built-in task management system, identifies dependencies, and processes them in waves. Each task runs in its own sub-agent with fresh context.

Fresh backpack. Laser focus.

For my project:

  • Wave 1: 2 sub-agents (foundation tasks, no dependencies)
  • Wave 2: 2 sub-agents (checkout + lifecycle, depend on Wave 1)
  • Wave 3: 3 sub-agents (critical renewal processing)
  • Wave 4: 6 sub-agents (remaining features)

Total time: 52 minutes. 13 tasks completed. 38 test cases worth of functionality implemented. Each sub-agent used ~18% context — compared to ~56% if everything had run in a single session.

Artifact produced: working code across all specified files

👉 Deep-dive: How to Make Claude Code Actually Build What You Designed

Step 6: Run Test Plan

Principle served: All four principles working together

The final step.

Claude reads the test plan, creates one task per test case, analyzes dependencies between tests, and executes them sequentially — one sub-agent per test, each with fresh context.

If a test fails, the sub-agent analyzes the root cause, implements a fix, and re-runs the test. Up to 3 attempts. If it still fails after 3 tries, it gets marked as a known issue with reproduction steps and a suggested fix.

For my project: 30 tests. 2 hours 12 minutes. All passed. One bug found and autonomously fixed during TC-002 — a settings save handler that wasn’t persisting color options. Found, diagnosed, fixed, re-verified. All without me touching the keyboard.

Results get logged in two places: test-status.json for machine parsing, test-results.md for human review.

Artifact produced: notes/test-results.md and notes/test-status.json

👉 Deep-dive: Claude Code Testing: The Task Management Approach That Actually Works

The Complete Artifact Chain

Look at how everything connects:

Nothing lives in memory.

Everything lives in files. Every step reads from the previous step’s artifact. If something goes wrong at Step 5, you trace backwards through the chain to find exactly which artifact — which decision — needs fixing.

No more “something broke somewhere, guess we start over.” Just: “the impl plan missed a dependency — let me fix Step 4 and re-run from there.”

I’ve packaged all six prompt files into a Workflow Engineering Starter Kit — drop them into your .claude/commands/ folder and the entire pipeline is ready to go. Download the Starter Kit here →

.

.

.

Design Your Own Workflows

The six-step system above is one example — a specific workflow I’ve built for feature implementation with Claude Code.

But the principles behind it apply to any multi-step AI task.

Writing a technical article, planning a product launch, migrating a database, refactoring a legacy codebase. Same principles. Different steps.

The specific prompts change. The tools change. The principles stay constant.

Here’s a checklist you can run before starting any complex AI task — five questions that reveal whether your process has gaps:

Five questions. If any answer is “no,” your workflow has a gap.

  • The artifact test catches phantom steps — work that happens “in context” but produces nothing concrete. Those are the steps where information vanishes between sessions.
  • The thinking/doing test catches the most common mistake in AI-assisted development: asking an AI to plan and build in the same breath. Every time you let that happen, both the plan and the build suffer.
  • The context boundary test catches rot before it starts. If you can’t point to where sessions should reset, you’ll end up with one massive session that degrades across every task.
  • The success definition test catches the “just build it and we’ll see” trap. Without defined success criteria, you have no way to verify the output — and no target for the AI to aim at.
  • The traceability test catches broken chains. If you can’t walk backwards from a failure to its root cause, your artifacts aren’t detailed enough to serve as the connective tissue between steps.

.

.

.

The Skill That Compounds

Here’s what I want you to take away from all of this.

The six prompt files in the Starter Kit will be outdated eventually.

Claude Code will add new features. The task management API might change. New AI tools will emerge that handle things we can’t even imagine yet.

The workflow engineering thinking behind those prompts won’t age.

Separate thinking from doing. Reset context at natural boundaries. Externalize decisions into artifacts. Define success before you start building. These principles work today with Claude Code. They’ll work next year with whatever comes next.

And here’s the compounding part — the part that makes this a skill and not just a technique: every workflow you design teaches you to design better workflows.

You start noticing patterns. Where context rot creeps in. Where planning and execution get tangled. Where artifacts need more detail. Your workflows get tighter with each project. Your instincts sharpen.

The developers who will thrive in AI-assisted development over the next few years won’t be the ones who write the best prompts.

They’ll be the ones who engineer the best workflows.

.

.

.

Your Next Steps

  1. Download the Workflow Engineering Starter Kit → — All six prompt files, ready to drop into .claude/commands/
  2. Run the checklist against your current process — find the gaps
  3. Try the full pipeline on your next feature — specs through testing
  4. Refine what works, replace what doesn’t

What feature are you going to build with this workflow?

Pick one.

Run the pipeline. See what happens when Claude has a structured process to follow instead of a single prompt to interpret.

Go engineer it.


P.S. — For the deep-dives on each phase, start here:

8 min read The Art of Vibe Coding

The Claude Code Skill Creator Now Has Evals (And My Skills Finally Have Proof They Work)

Watch the video walkthrough, or read the full written guide below.

Here’s a confession.

For months, I’ve been building Claude Code skills with what I can only describe as the “hope and pray” methodology. Write the SKILL.md. Test it once. Ship it. Whisper a small prayer to the LLM gods. Move on with my life.

Did the skill actually trigger when it should? ¯\_(ツ)_/¯

Did it make Claude’s output better? Honestly… no idea.

I’ve been using skills since they were added to Claude Code — and until last week, I had zero way to answer either of those questions.

(Stay with me. This story has a happy ending.)

.

.

.

The Problem With Skills (That Nobody Wants to Admit)

Here’s the thing about Claude Code skills: they’re just text prompts. Fancy, well-organized text prompts — but text prompts nonetheless.

And text prompts don’t come with test suites.

I’ve built dozens of skills over the past few months. Frontend design patterns. WordPress security checklists. Newsletter writing styles. Documentation generators. Each one followed the same ritual:

  • Write a SKILL.md file
  • Test it manually (once, maybe twice if I’m feeling thorough)
  • Hope it works
  • Wonder — weeks later — if it’s actually triggering
  • Wonder — with increasing anxiety — if it’s helping when it does trigger
  • Have absolutely no data to know either way

The old skill-creator plugin could generate skills for you, which was genuinely useful. But it had no evals. No testing. No benchmarks. You’d create a skill, and then… that was it. Cross your fingers, close the terminal, pretend everything was fine.

I kept using skills because they felt useful. But I couldn’t prove it. I couldn’t point to a number and say “this skill improves output quality by 9.5%.”

Every skill I created was a guess. A lovingly crafted, well-intentioned guess — but a guess.


The Upgrade That Changes Everything

The Claude Code skill creator plugin just got a massive upgrade. And honestly? It solves the exact problem I’ve been complaining about for months.

The new version adds something skills have never had: a testing and benchmarking layer.

Claude Code plugin discovery interface showing the skill-creator plugin by claude-plugins-official with 19.1K installs and description "Create new skills, improve existing skills, and measure s..."

Here’s what the updated skill creator can do:

  • Create skills from your requirements (same as before)
  • Generate evals — actual test cases — automatically
  • Run parallel A/B benchmarks comparing skill vs. baseline Claude
  • Optimize trigger descriptions so your skill activates when it should
  • Iterate until the skill measurably improves output

That last part bears repeating: measurably improves output. With numbers. And charts. And side-by-side comparisons.

Let me show you how this works with a real skill I built last week.

.

.

.

Building a WordPress Security Review Skill (The Whole Process)

I built several WooCommerce plugins — which means security reviews are part of my regular workflow. But Claude’s baseline security reviews felt… inconsistent. Sometimes thorough, sometimes surface-level. No predictable structure.

Perfect candidate for a skill.

Step 1: Describe What You Want

I asked Claude Code to create a skill using the skill-creator plugin:

Claude Code terminal showing user prompt requesting a skill called "wp-security-review" that reviews WordPress plugin PHP code for security vulnerabilities including SQL injection, XSS, CSRF, insecure direct object references, missing capability checks, unsafe file operations, insecure superglobal usage, and hardcoded secrets.

My prompt included the specific vulnerability types I wanted covered: SQL injection, XSS, CSRF, missing nonce verification, insecure $_GET/$_POST usage, and more.

Step 2: The Skill Creator Explores Your Codebase

Here’s where things get interesting.

Claude loaded the skill-creator skill and immediately started exploring my project:

Claude Code terminal showing skill-creator successfully loaded, then searching for 2 patterns and reading files to understand the project structure, existing security references, and PHP patterns before creating the skill.

The skill-creator looked at my existing code, found security patterns already in the project, and used that context to build a skill tailored to my codebase. (Not a generic one-size-fits-all approach.)

Step 3: The Generated Skill

Claude wrote 330 lines to .claude/skills/wp-security-review/SKILL.md:

Claude Code terminal showing the created wp-security-review skill with 330 lines written, including a description covering SQL injection, XSS, CSRF, missing capability checks, unsafe file operations, and hardcoded secrets. Also shows 3 test prompts: reviewing CartHandler.php, checking BulkActions.php, and doing a full plugin security audit.

The skill included:

  • A detailed trigger description (optimized for when Claude should activate it)
  • A vulnerability checklist with 8 categories
  • WooCommerce-specific nuances — like wc_price() double-escaping and WC Settings API nonce delegation
  • Structured output format with severity ratings

All good stuff. But here’s the thing: a skill is only as good as its results.

And until now, I had no way to measure those results.

.

.

.

The Part That Made Me Actually Stop and Stare: Evals

After creating the skill, Claude immediately said: “Now let me set up test cases and run them.”

Wait, what?

Claude Code terminal showing creation of evals.json file with test cases including prompts like "Review the CartHandler.php for security issues" with expected outputs describing structured security reports identifying $_POST sanitization issues, nonce verification patterns, and price manipulation risks.

The skill-creator generated an evals.json file with:

  • 3 test prompts targeting different aspects of my plugin
  • Expected outputs for each test
  • Specific files to review

And then — and I genuinely did not expect this — it launched parallel agents.

Running 6 Agents Simultaneously

Claude Code terminal showing 6 agents launched in parallel - 3 "with skill" runs and 3 "without skill" baseline runs for CartHandler, BulkActions, and Full audit test cases, all running in the background simultaneously.

Claude launched 6 parallel agents:

  • 3 running the tests with the skill
  • 3 running the same tests without the skill (baseline Claude)

While those ran in the background, Claude drafted the evaluation assertions:

Claude Code terminal displaying detailed evaluation assertions for each test - Eval 1 (CartHandler) checking for severity ratings, identifying unsanitized $_POST, nonce patterns, structured reports, and no false positives on WC hooks. Eval 2 (BulkActions) checking for $_GET without nonce, capability checks, severity levels. Eval 3 (Full audit) checking coverage of all files, summary counts, cross-file patterns, and passed checks sections.

Each test case had specific pass/fail criteria:

  • has_severity_ratings — Findings must be categorized by severity
  • identifies_unsanitized_post — Must flag $_POST['wcpwyw_price'] at line 126
  • no_false_positives_on_wc_hooks — Must NOT flag WC Settings API as missing nonces
  • includes_passed_checks — Must list what passed, not just failures

(Real criteria. Measurable criteria. I could have cried.)

.

.

.

Results Rolling In

As the agents completed, results started appearing:

Claude Code terminal showing Eval 2 BulkActions with skill completed - security review output with summary showing 0 Critical, 0 High, 1 Medium, 2 Low, 1 Info findings, plus timing data showing 38,976 tokens used in 227 seconds.
Claude Code terminal showing Eval 2 BulkActions without skill (baseline) completed - 135 lines of output with 31,386 tokens used in 236 seconds. Notes that both with-skill and without-skill produced similar findings for this relatively simple file, but the with-skill version used structured severity format and checklist notation while baseline used freeform numbered lists.
Claude Code terminal showing all 6 agents (Eval 1, 2, 3 with and without skill) completed, now extracting remaining reports from transcripts. Shows Eval 3 Full audit with skill reviewing 23 files with findings categorized as 0 Critical, 2 High, 4 Medium, 5 Low, 3 Info.
Claude Code terminal showing all 6 reports saved, now generating the eval viewer for side-by-side comparison. Writing benchmark.json with skill_name "wp-security-review" and configurations for "with_skill" and "without_skill" runs.

All 6 agents finished. Claude compiled the results and generated something I’ve never seen in skill development before.

.

.

.

The Eval Review Playground

Claude generated an HTML-based eval viewer and opened it in my browser:

Browser-based eval review interface titled "Eval Review: wp-security-review" showing 1 of 6 test cases. Displays "WITH SKILL" tag, prompt "Review the CartHandler.php for security issues", and output showing a structured security review with Summary (0 Critical, 2 High, 2 Medium, 2 Low, 2 Info) and Findings section with severity-tagged issues like "[HIGH] Price Manipulation via Cart Session - Missing Server-Side Re-validation in applyCartItemPrice".
Browser-based eval review interface showing same prompt but with "WITHOUT SKILL" tag (baseline). Output shows a different format - plain "Security Review: CartHandler.php" header with file path, date, and "Claude Opus 4.6 (automated review)" as reviewer, followed by Executive Summary section in prose format rather than structured findings list.

Side-by-side comparison. Same prompt, same file, two different approaches.

The difference was immediately visible:

  • With skill: [HIGH] Price Manipulation via Cart Session — structured, scannable, severity-tagged
  • Without skill: Prose-style Executive Summary, harder to scan

But subjective impressions only get you so far. Here’s where the numbers come in.

.

.

.

The Benchmark Results (This Is the Good Part)

Claude Code terminal showing eval viewer opened in browser with benchmark comparison table. Metrics show: Pass rate 100% (21/21) with skill vs 90.5% (19/21) baseline (+9.5% delta). Avg tokens 74,427 with skill vs 69,734 baseline (+6.7%). Avg time 276s with skill vs 307s baseline (9.9% faster). Key differences noted: skill version elevated price cap bypass to HIGH severity, avoided false positives on WC nonces, produced more structured passed-checks sections.
MetricWith SkillBaselineDelta
Pass rate100% (21/21)90.5% (19/21)+9.5%
Avg tokens74,42769,734+6.7%
Avg time276s307s9.9% faster

👉 The skill achieved 100% pass rate on all 21 assertions.

Baseline Claude hit 90.5% — missing structured passed-checks sections and some WooCommerce-specific nuances.

And here’s the kicker: the skill was actually faster despite being more thorough.

Key differences Claude identified:

  • Skill version elevated price cap bypass and session revalidation to HIGH severity (correct — these affect money)
  • Skill version correctly noted that current_user_can checks are unnecessary on frontend AJAX handlers
  • Skill version had more structured passed-checks sections
  • Baseline missed nuances like WC Settings API nonce delegation

For the first time ever, I had proof my skill was helping.

.

.

.

Iteration: When The Evals Teach You What to Fix

Here’s where it gets even better.

Looking at the results, Claude identified areas where even the improved skill could be better:

Claude Code terminal showing user saying "please proceed" and Claude responding with 3 specific improvements to make based on analyzing all 6 reports: 1. Add guidance on reviewing JavaScript files (baseline caught DOM XSS in JS that skill missed), 2. Clarify severity calibration for wp_unslash on numeric fields (inconsistent HIGH/MEDIUM ratings), 3. Add note about checking business logic consistency across code paths (price cap bypass was the most valuable finding)

Claude proposed 3 refinements:

  1. Add JavaScript file guidance — The baseline actually caught DOM XSS in JS that the skill version missed (oops)
  2. Calibrate severity for wp_unslash — Numeric field sanitization was rated inconsistently between HIGH and MEDIUM
  3. Add business logic consistency checks — The price cap bypass (validation in one path but not another) was the most valuable finding type
Claude Code terminal showing detailed diff of SKILL.md updates - adding JavaScript file scanning guidance for DOM-based XSS, adding notes about WooCommerce-specific escaping patterns for wc_price(), calibrating wp_unslash severity on numeric fields to Low instead of High, and adding business logic consistency checking to catch when validation exists in one code path but not another.

The skill-creator modified SKILL.md with surgical precision:

  • Added DOM-based XSS scanning for JavaScript files
  • Added guidance on wc_price() double-escaping
  • Calibrated wp_unslash on numeric fields to Low severity
  • Added business logic consistency patterns

The evals taught me where my skill was weak. Then Claude fixed it.

.

.

.

Trigger Description Optimization

A brilliant skill that never activates is… well, useless.

Claude ran a trigger optimization loop:

Claude Code terminal showing skill updated with 3 improvements, now running description optimization. Creating trigger-eval.json with test queries like "I just wrote a new AJAX handler for my WooCommerce plugin that lets admins delete orders. Can you check if it's secure?" with should_trigger: true. Running optimization loop in background (up to 5 iterations).

The skill-creator generated test prompts that should trigger the skill, and prompts that shouldn’t. Then it ran those prompts against Claude to see if the skill activates correctly.

Claude Code terminal showing optimization running (36+ API calls to Opus). Summarizes current state: skill at ~350 lines covering 8 vulnerability categories, structured report format with severity ratings, WooCommerce-specific nuances. Benchmark results: 100% pass rate with skill vs 90.5% baseline, skill was 9.9% faster. Lists 4 improvements applied: JS/DOM scanning, cross-path consistency checking, wp_unslash calibration, and WP bulk actions nuances.

After multiple iterations, trigger accuracy improved. The skill now fires when it should — and stays quiet when it shouldn’t.

.

.

.

The Final Skill

Claude Code terminal showing "The skill is complete" with final summary. Skill created: wp-security-review at .claude/skills/wp-security-review/SKILL.md. Reviews WordPress plugin PHP and JS code for 8 categories of vulnerabilities including SQL injection, XSS (including DOM XSS), CSRF, IDOR, missing capability checks, unsafe file operations, insecure superglobals, and hardcoded secrets. Lists unique value over baseline: structured [SEVERITY] format, comprehensive passed checks section, WooCommerce-specific nuances, cross-path consistency checking, and correct severity calibration.
VS Code file explorer showing the wp-security-review skill folder structure with evals subfolder containing evals.json and SKILL.md file.

The completed skill:

  • Reviews WordPress plugin PHP and JS code
  • Covers 8 vulnerability categories
  • Produces structured [SEVERITY] tagged output
  • Includes WooCommerce-specific nuances (nonce delegation, wc_price() escaping, frontend vs admin hooks)
  • Catches business logic inconsistencies (validation in one path but not another)
  • Benchmarks at 100% pass rate vs 90.5% baseline

And I have the data to prove it works.

.

.

.

Why This Matters For Your Skills

The Claude Code skill creator fundamentally changes what’s possible.

👉 Before: Skills were art. Intuition. Trial and error. Hope and prayer.

👉 After: Skills are engineering. Testable. Measurable. Improvable.

Here’s what becomes possible:

1. A/B Test Every Skill You Build

Every skill you create can be benchmarked against baseline Claude. If your skill doesn’t measurably improve output, you know immediately — before you ship it, not three weeks later.

2. Catch Regressions When Models Update

When Claude Opus 5.0 ships, run your benchmarks again. If baseline now matches or exceeds your skill’s performance, the skill may be locking in outdated patterns. Time to retire it — or improve it.

3. Tune Your Trigger Descriptions

A skill that triggers 50% of the time is only half as valuable. The description optimizer catches false positives (triggering when it shouldn’t) and false negatives (not triggering when it should).

4. Run Continuous Improvement Loops

Each eval run produces actionable feedback. Claude identifies gaps, proposes fixes, and re-benchmarks — all without you manually debugging SKILL.md files at midnight.

.

.

.

Your Next Steps

  1. Open Claude Code
  2. Type /plugin and search for skill-creator
  3. Install the official Anthropic plugin (19,100+ installs and counting)
  4. Pick one skill you’ve already built — or a new one you’ve been meaning to create
  5. Ask Claude to create evals and benchmark it
  6. Watch the data tell you exactly where to improve

What skill are you going to benchmark first?

The developers who run evals will build better skills than those who don’t. That’s just… math.

Go build yours.

Now.