Skip to content

Category: The Art of Vibe Coding

11 min read The Art of Vibe Coding

Generate Images in Claude Code (using Codex plugin)

Generate Images in Claude Code — Without Even Asking for a Command
Watch the video walkthrough, or read the full written guide below.

You’re three hours into a Claude Code session.

The feature works. The tests pass. All that’s left is the picture at the top — a featured image, a hero illustration, something to make the thing look finished.

So you ask for it. Plainly, the way you’d ask a teammate sitting next to you:

“Generate an image of a minimalist line-art lighthouse on a dark navy background.”

And Claude Code apologizes.

Claude Code responding to a plain image request with "The image generation isn't something I can do — I don't have an image generation tool available in this environment," then offering SVG line-art or ASCII art instead

It offers to write you an SVG. Or some ASCII art.

Helpful — in the way a hardware store is helpful when you walked in for a sandwich.

I’ve done this more times than I want to admit — alt-tabbed to Codex for a quick image, got sidetracked tweaking the prompt, and came back to Claude Code ten minutes later having forgotten what I was about to commit. That little shrug is the whole problem in one frame. The capability you want lives one tool over, in Codex. The usual move is to alt-tab away to go get it — and the flow you spent three hours building quietly evaporates the moment you leave the window.

By the end of this post, that same sentence — no command, no tool-switch — produces a real PNG sitting neatly inside your project.

And the way it gets there opens a door bigger than any single image.

.

.

.

Why Claude Code Can’t Make Images (And Codex Can)

Let’s be fair to Claude Code first.

The limitation is a deliberate product boundary.

Claude Code is a coding harness — it edits files, runs commands, reasons about your codebase, and wires things together. Picture generation was simply never wired into that toolset. The model underneath can reason about images perfectly well; the harness around it just has no tool to make one.

So Claude does the honest thing and tells you.

You can feel how badly people want that gap filled.

Search around and you’ll find a small industry of “how to generate images in Claude Code” guides — an MCP server here, an external CLI there, a paid wrapper somewhere else. When that many workarounds exist for one missing capability, the demand is obvious. And so is the native answer, which is still no.

Codex took the other road.

In April 2026, OpenAI shipped gpt-image-2, a purpose-built image model, and made it the default for image work in Codex — a clear step up from the version it replaced, with sharper output and the ability to reason about a layout before it draws. Within weeks the older DALL-E models were retired from the API entirely, which left the new model as the whole story.

Codex exposes it through a built-in skill you trigger with $imagegen.

You describe what you want, Codex generates it, sizes it, and saves it.

Native.

So here’s the state of Claude Code image generation in mid-2026: the thing you want exists, it’s excellent, and it’s sitting in the tool next door. Keeping Codex open in a second terminal works — but it drags back the exact switching tax I wrote about all the way back in Claude Code vs Codex: Why I Use Both (And You Should Too).

Copy a path, switch windows, lose your place, switch back.

There had to be a way to borrow the capability without leaving home.

.

.

.

The Bridge Already Exists — And It Already Works

There is.

I’ve written about it before.

Back in April I covered Codex Reviews My Code Inside Claude Code — But I Don’t Trust It Blindly — OpenAI’s official Codex plugin that runs Codex inside a Claude Code session. That post used it for code review. But the plugin reaches well past review — it’s a general bridge to Codex, and one of its commands, /codex:rescue, can hand an arbitrary task to Codex and let it run.

Which left me poking at an obvious question: if the plugin gives me Codex inside Claude Code, and Codex has $imagegen… can I get a real image generated without ever leaving Claude Code?

So I tried it.

I called /codex:rescue and told it — in so many words — to use $imagegen for the lighthouse.

Claude Code prompt invoking /codex:rescue with the instruction "use the $imagegen to Generate an image of a minimalist line-art lighthouse on a dark navy background"

Codex woke up, took the task, and got to work.

The codex:codex-rescue subagent running inside Claude Code and reporting "Done" — 2 tool uses, 10.0k tokens, 2m 11s

A couple of minutes of waiting while Codex did its thing in the background, and a finished image existed — without a single window switch. The terminal I was working in never lost focus. The session I’d spent three hours building never broke.

It worked. Here’s what came out:

The generated lighthouse — minimalist white line-art on a flat dark navy background, beam lines radiating from the lantern room, gentle waves at the base

A real image, generated by gpt-image-2, without leaving Claude Code.

Proof of concept — done.

.

.

.

But You Have To Know The Magic Words

Here’s where the proof-of-concept stops being something you’d actually want to use day to day. Two frictions — and they’re the whole reason this post has a part two.

Friction one: it only fires when you summon it by name.

/codex:rescue is a command you have to remember and type, phrased just so. Ask for the image the way a human naturally asks — the plain sentence from the very top of this post — and nothing happens. Claude Code tells you it can’t make images and offers you that SVG again. The capability is reachable, but only if you already know the secret handshake. Forget the handshake and you’re back at the shrug.

Friction two: the file lands wherever Codex feels like dropping it.

Run the rescue route and the image shows up loose at the root of your project, sitting right alongside your config and your docs.

Project root file tree showing the generated lighthouse-line-art.png dropped in alongside AGENTS.md, CLAUDE.md, JOT.md, and MANIFEST.md

One stray PNG next to your project files is no crisis.

But picture a real working afternoon: a hero image for the landing page, two illustrations for a docs page, a placeholder avatar, a quick thumbnail to test a card layout. Five generations, five files scattered across your project root. By the third experiment I had four stray PNGs sitting next to my CLAUDE.md. I caught myself doing a manual cleanup and thought: this is exactly the kind of chore a script should handle.

There’s a sharper trap hiding in that route, too.

The first time I tried scripting this myself, I stared at a blank output for a solid ten minutes before I realized the shell had eaten the dollar sign. The $imagegen token starts with $, so the shell quietly expands it to nothing before Codex ever sees it — the command runs fine, no image appears, and nothing tells you why.

So here’s the scoreboard after the experiment: the capability is real and it runs in-session, but reaching it means knowing a command most people won’t, and using it means cleaning up after it. That’s exactly the kind of rough seam worth wrapping once — so you never feel it again.

.

.

.

The Fix: Say It In Plain English, Get A Tidy File

This is the part I want you to steal.

I built a small skill — codex-imagegen — that wraps the whole messy path into something that responds to how you’d actually ask.

Install it, then type the same natural sentence you’d have typed anyway.

No /codex:rescue.

No command to memorize:

Generate an image of a minimalist line-art lighthouse on a dark navy background.

I half-expected to have to type /codex-imagegen the first time. Instead I just asked for the image the way I normally would — and watched Claude load the skill on its own. That was the moment I knew the wrapper was worth building.

Claude Code automatically loading the codex-imagegen skill from the same plain-English prompt, running the bundled script with the codex plugin runtime, and reporting the image saved to .codex-image/lighthouse.png

Same words that earned a shrug two sections ago. This time: “Successfully loaded skill,” the plugin runtime picked automatically, and a finished PNG — 1200×1200, flat navy #0d1b2a — saved to .codex-image/lighthouse.png.

No handshake.

And look where the file went.

Project file tree showing the image neatly inside a dedicated .codex-image folder rather than loose at the project root

Instead of cluttering your project root, the image lands in .codex-image/ — its own corner, out of the way, easy to find later.

For readers new to skills: a skill is a small, reusable capability you teach Claude Code once and trigger by describing what you want. (I wrote a whole series on them, starting with Claude Skills: Your “I Know Kung Fu” Moment Has Arrived (Part 1 of 3).)

Here’s what the skill handles so you don’t have to:

Without the skillWith codex-imagegen
Remember and type /codex:rescue with the right phrasingAsk in plain English; the skill triggers on intent
Fish the file out of the project rootLands organized in a .codex-image/ folder
Works only when the plugin is set up just soUses the plugin if present, falls back to the Codex CLI if not

If you’d rather be deliberate, the explicit slash form /codex-imagegen "your prompt" is there too.

Most of the time you won’t reach for it — plain language is the point.

That fallback matters more than it sounds. It means the skill works on a teammate’s machine that only has the Codex CLI, or on a server with no plugin installed, with zero changes on your end. And the shell-eats-the-dollar-sign trap from earlier? Never reaches you. The script handles that token safely every single time.

One quiet bonus: Codex keeps a local cache of every image it generates and never cleans it up. The skill tidies that cache behind the scenes after each run, so the folder doesn’t quietly balloon to hundreds of megabytes while you’re not looking.

.

.

.

What This Unlocks: An Image Generator Other Skills Can Call

Here’s where it gets interesting — and why I think the natural-language trigger matters more than the convenience.

Because the skill responds to plain intent instead of a hard-coded command, it stops being something only you invoke. It becomes a building block other skills and workflows can lean on.

Picture a publishing workflow that notices a finished post has no featured image, writes a prompt from the post’s own title, calls codex-imagegen, and drops the result in the right folder — no human in the loop. Or a project scaffolder that fills in placeholder icons and hero art as it sets up a new repo, instead of leaving you a wall of empty boxes to fill later. Or a slide-deck skill that generates a custom illustration for each section as it builds the outline, so the deck arrives already looking like someone designed it.

Whiteboard-style hub-and-spoke diagram: three skills (Blog Publisher, Project Scaffolder, Slide Deck Builder) each pointing inward to a central codex-imagegen box, which outputs a cluster of images into the project

In every one of those, image generation is no longer a thing you stop and do by hand. It’s a step another skill takes on your behalf, mid-flow, because the door is now wide enough for a machine to walk through.

The principle underneath: a capability wrapped as a natural-language skill becomes composable.

One skill can hand off to another, and image generation turns into a primitive your automations reach for — rather than a manual detour you take by hand. None of that works if the only way in is a command a human has to remember to type.

I’ve been building exactly this kind of skill-calls-skill workflow, and it deserves its own post to do it justice. I’ll walk through a real one soon — if you don’t want to miss it, subscribe.

.

.

.

Install It (It’s Open Source)

The skill is open-source, in my agent-skills collection. Two ways in, depending on your setup.

For Claude Code, install the packaged plugin:

/plugin marketplace add nathanonn/agent-skills
/plugin install codex-imagegen@nathanonn-agent-skills

For any other agent — Codex, Cursor, Copilot, or Claude Code itself — use the open Agent Skills CLI:

npx skills add nathanonn/agent-skills --skill codex-imagegen

One prerequisite: the skill rides on Codex, so you need either the Codex plugin in Claude Code or the Codex CLI installed locally (npm i -g @openai/codex, then codex login). Auth runs off your OpenAI account. If neither runtime is present, the skill tells you exactly what to install rather than failing silently.

And because it ships in the cross-agent collection, the same skill works outside Claude Code too. Same plain-language trigger, in whichever agent you happen to be living in that day.

.

.

.

The Harness Gap Keeps Closing

Step back and the trend line is hard to miss.

A year ago, the honest answer to “Claude Code or Codex?” was both, in two terminals. Then the Codex plugin let one tool review the other’s code in a single session. Now a small skill lets Codex generate images inside Claude Code — and not as a command you summon, but as something Claude reaches for the moment you simply ask.

The move worth internalizing: when your main tool can’t do something, you don’t always have to switch tools or wait for a feature to ship.

Sometimes you borrow the capability from the tool next door and wrap the seam so cleanly that you — and your other skills — stop noticing it was ever a seam. The Codex plugin opened that door. A skill that triggers on plain language walks the rest of the way through it.

Claude Code image generation went from impossible here to just ask in the span of one small skill.

That’s the whole shift — a feature that used to require leaving the room now happens without you breaking stride.

So install it, open your next project, and ask for the image you need — in plain English, the way you’d ask anyone. Watch it land in your project a couple of minutes later. Then go build the rest of whatever you were making, and tell me what came out.

11 min read The Art of Vibe Coding

The Three Files That Made Codex /goal Reliable Enough to Walk Away From

Person wondering "Can I walk away?" beside three documents — GOAL (what to build), VERIFY (how to prove it), and PROGRESS (what happened) — feeding into Codex /goal, producing a completed app. Caption: "Trust comes from evidence, not vibes."

The Hard Part Was Walking Away

The first time I ran a codex goal command on something that mattered, I sat there for twenty-eight minutes pretending to check email while the terminal scrolled.

I wasn’t doing other work. I was watching.

Making Codex write code has never been the hard part.

The hard part is walking away — stepping out of the room while an autonomous agent builds something you actually care about, with no way to know if it’s going sideways until it’s done.

Over five weeks and five builds, the same question kept surfacing: How do I know it did the right thing while I was gone?

Here’s what I learned.

A good /goal skill gives Codex a clear system to follow — and gives you something concrete to audit after the run ends. That system comes down to three files: GOAL, VERIFY, and PROGRESS.

  • GOAL defines what “done” means.
  • VERIFY maps each requirement to an actual check.
  • PROGRESS records what happened during the run so you can review the evidence instead of guessing.

Together, they turn a hands-off Codex run from “hope it works” into “review the receipts.”

Person wondering "Can I walk away?" beside three documents — GOAL (what to build), VERIFY (how to prove it), and PROGRESS (what happened) — feeding into Codex /goal, producing a completed app. Caption: "Trust comes from evidence, not vibes."

.

.

.

Five Builds, One Pattern

This pattern came from building things — running the codex goal command across five projects and watching what actually made the difference between a run I could trust and one I couldn’t.

BuildScaleTrust Lesson
WordPress plugin1 goalCodex needs a clear finish line
CLI tool2 goalsConnected goals need clear verification
Browser game8 goalsSequencing matters
Expanded game7 goalsSeam checks catch hidden bugs
WooCommerce plugin10 goalsLong runs need receipts

Each build gave Codex a bigger slice of autonomous work. Different stacks, different scales, anywhere from 28 minutes to nearly five hours of runtime. The trust structure underneath stayed the same.

Here’s the thing. The first build — one goal, 28 minutes — was small enough to verify by hand. By the fifth — ten goals, nearly five hours — manual verification would have taken longer than the build itself. (Stay with me on that: a five-hour autonomous run where you come back and check the receipts instead of babysitting. That’s the payoff these three files unlock.)

The full walkthrough for each build is linked at the end of this post.

.

.

.

Does Codex Know What Done Means?

GOAL answers the first trust question: What exactly are we building?

Without a clear definition of done, Codex invents its own finish line. It wanders, second-guesses scope, and eventually gives up having shipped half the requirements. Every long run that went sideways in my five weeks traced back to one thing: the spec was too vague for Codex to grade its own work.

And Codex re-reads the goal text constantly. It uses that same document as both the to-do list and the test for “done.” A vague spec gives it nothing to grade against — so it either keeps going in circles or declares victory on a hunch.

A good /goal skill solves this by writing GOAL from evidence. It reads your project first — the language, the framework, the folder layout, the naming patterns already in place. Then it asks targeted questions, each carrying a recommended answer and a one-line reason. By the time it generates GOAL.md, the document is grounded in your actual codebase.

GOAL.md should include:

  • Objective — one sentence describing what this goal produces
  • Repo context — what the skill learned by reading the project
  • Requirements — the specific features or behaviors to build
  • Assumptions — what the skill inferred and confirmed with you
  • Boundaries — what Codex should not touch
  • Definition of done — the outcomes that constitute “finished”
  • Stop conditions — when Codex should stop and ask instead of guessing

That last one matters more than it looks.

Stop conditions are the guardrail that prevents Codex from filling in gaps with assumptions. When the spec runs out of detail, a good skill tells Codex to pause rather than improvise.

In practice, the skill handles most of this for you. It shows up with a draft of what “done” looks like and asks you to confirm or edit — which is always faster than writing it from scratch. (The whole exchange feels like confirming a restaurant reservation. “Table for one? Near the window? 7 PM?” Yes, yes, yes.) Most of the time, the recommended answers are right. When they’re wrong, editing one line is cheaper than discovering the gap mid-run.

A messy feature request passes through a funnel (read repo, clarify assumptions) and becomes GOAL.md — with objective, requirements, boundaries, definition of done, and stop conditions — pointing toward a checkered finish flag. Caption: "GOAL tells Codex what done means."

.

.

.

Proof, or Just a Green Terminal?

VERIFY answers the second trust question: How do we know the work is correct?

GOAL defines what done looks like. VERIFY maps each requirement to an actual check that proves it was built correctly. Those checks have to use real commands from your project — real test runners, real build steps, real linters. Invented checks are worse than no checks, because they hand you false confidence.

A green terminal is comforting.

But if the check doesn’t trace back to a specific requirement in GOAL, it proves nothing useful. The WooCommerce build had ten goals running across nearly five hours. Without explicit traceability from each requirement to its proof, a clean summary could easily mask three missing features — and you’d only discover them after deploying.

VERIFY.md should include:

  • Requirement-to-check mapping — each GOAL requirement paired with its verification method
  • Real commands — test, build, lint commands that exist in the project
  • Manual checks — for anything that can’t be automated (UI polish, UX flow), explicitly listed as manual
  • Expected results — what a passing check looks like
  • Environment notes — anything the checks depend on (ports, services, seed data)
  • Seam checks — for multi-goal runs, checks that test the boundaries between goals

Seam checks earn a highlight.

I almost skipped the full-loop check in the Ion Viper build — every individual goal had passed its own tests. Why bother? Four hidden bugs at the transitions between goals is why. Stale state on restart, timing conflicts between systems, projectiles accumulating silently across scene boundaries. Each goal looked fine in isolation.

The full loop revealed the assumptions that no single goal’s tests could catch.

👉 A green terminal is not proof if the check doesn’t map back to the requirement.

Three requirements linked by chains to VERIFY.md, which maps each to a proof step. An arrow leads to Codex running verification with a "No fake checks" callout, producing a PASS stamp. Caption: "VERIFY turns done into proof."

.

.

.

What Happened While You Were Gone

PROGRESS answers the third trust question: What happened while I was gone?

GOAL gives the standard. VERIFY gives the proof plan. PROGRESS is the running record of what Codex actually did — what it changed, what it checked, what passed, what broke, and how it responded.

For a quick, single-goal run, you might glance at the terminal and move on. For a ten-goal build that ran for nearly five hours, terminal output is useless as an audit tool. PROGRESS.md is the structured receipt that lets you review the entire run without scrolling through hundreds of lines of terminal history.

PROGRESS.md should include:

  • Goals started and completed — with timestamps
  • Files changed — what Codex touched in each goal
  • Checks run — which verification steps executed
  • Results — pass/fail for each check
  • Errors found and fixes made — what broke and how Codex handled it
  • Evidence paths — where to find the artifacts (test output, screenshots, logs)
  • Remaining issues — anything Codex flagged but couldn’t resolve
  • Resume point — where to pick up if the run is interrupted
  • Final summary — the overall status in one paragraph

Resume points matter for long runs.

If a build fails at goal 7 of 10, you don’t want to re-run goals 1 through 6. PROGRESS records exactly where the work stopped and what state it was in, so the next run picks up cleanly.

The WooCommerce build was the first time I actually left the desk. Nearly five hours. I came back, opened PROGRESS.md, and had the full story in under three minutes — what passed, what broke, what Codex fixed on its own, and what it flagged for me to look at. A few minutes of structured review instead of hours of babysitting. That’s the trade these files offer.

A developer walks away from a desk. A robot works on code. PROGRESS.md records: Goal 1 PASS, tests run, bug fixed, evidence saved, resume point. A human returns and reviews the document with a magnifying glass. Caption: "PROGRESS is why walking away is not blind trust."

.

.

.

The Chain That Holds It Together

The three files work because they connect.

  • GOAL defines what done means.
  • VERIFY maps those requirements to proof.
  • PROGRESS records whether the proof held up during the actual run.

Any important requirement should be traceable across all three.

Here’s the traceability test: pick any requirement from GOAL. You should be able to find its matching check in VERIFY, the expected result, the recorded outcome in PROGRESS, and the evidence a human can review. If the chain breaks at any link — and I’ve had it break — you catch it during review, before anything ships.

Let me show you what that looks like.

A GOAL requirement says “admin users can generate auto-login links.” VERIFY maps that to a specific test command plus a manual browser check. PROGRESS records that both passed, with the test output saved to a file path you can open. From that single thread, you or anyone reviewing the build can verify the claim without re-running anything.

Each file fills a role the others can’t cover. GOAL without verification is a wish list — requirements with no accountability. A proof plan with no execution record can’t confirm the tests actually ran. And an execution log with no standard to measure against is a diary that tells you what happened but can’t tell you whether it was right.

The traceability between them is what turns three documents into a system you can rely on. Hand a reviewer the three files from any goal, and they can reconstruct the full story: what was supposed to happen, how it was supposed to be tested, and what actually happened. The evidence sits on disk.

.

.

.

Where Trust Still Breaks

The biggest risk is an unclear spec.

I wrote “build a wishlist plugin” once — three words, no detail. The result had wishlist functionality, technically. Just not the ones I needed. (If you’ve ever gotten back exactly what you asked for and realized the problem was what you asked for, you know the feeling.)

That costs real time and money.

The WooCommerce plugin build used roughly $131 worth of subscription usage across ten goals and nearly five hours. A vague spec that forces a second attempt doubles that cost. The three files pay for themselves by making the first run more likely to be the only run.

Codex also doesn’t do design polish. It builds things that work, but the look and feel comes out plain. Admin interfaces are functional and ugly. That’s fine — the hours the three files freed up from implementation are hours you can spend on refinement instead.

And I still review everything.

The three files don’t replace human judgment. They move it to better places: before the run, you define the right goal. After the run, you audit the receipts and polish what needs a human eye.

The codex goal command handles the middle part.

Your job is the beginning and the end.

.

.

.

Trust Is Engineered

The goal was always the same: give Codex a system that leaves evidence you can audit.

GOAL sets the finish line, VERIFY maps the proof plan, and PROGRESS keeps the receipts.

Together, they turn a hands-off run from “hope it works” into “review the evidence.” That’s what made walking away possible — across five builds, five different stacks, and run times stretching from 28 minutes to nearly five hours.

The skills keep getting better, too. Every gotcha you teach one is a gotcha it handles next time. The second project on a given stack goes smoother than the first, and the fifth smoother still. The investment compounds.

The specific stack doesn’t matter, either. Whether you’re building WordPress plugins, CLI tools, browser games, or something else entirely, the three-file structure translates. Build the skill once for your domain, teach it your stack’s hard-won rules, and future projects start ahead.

Build the system once. Document the evidence. Review the receipts.

All four skills from this series are open source:

npx skills add nathanonn/agent-skills --agent codex

The repo is at github.com/nathanonn/agent-skills.

The series, if you’re catching up:


More workflows like this — AI-assisted development with Claude Code, Codex, and the tools between them — land in The Art of Vibe Coding newsletter every week. If this one was useful, the next one probably will be too.

14 min read The Art of Vibe Coding

I Gave Codex a Requirements Doc and Got a CodeCanyon-Grade Plugin Back

Assembly line illustration for turning a requirements doc into a CodeCanyon-grade WordPress plugin with Codex /goal

The first time I used Codex /goal, I sat at my desk for twenty-eight minutes pretending to do other work while an autologin plugin built itself from a one-paragraph spec.

That was How to Use Codex /goal to Build WordPress Plugins (My Spec-to-Ship Workflow). One feature. One goal. The kind of experiment where you peek at the terminal every 90 seconds and try to look casual about it.

This time, my input was a full requirements document and this single line:

./run-goals.sh

Then I walked away. For nearly five hours.

When I came back, a complete WooCommerce plugin was sitting in the repo — an admin grid for bulk-editing stock quantities across products, including per-variation stock for variable products. That’s the exact kind of WooCommerce complexity that breaks naive implementations. The genre of plugin that sells on CodeCanyon for $30–60.

All built while I made dinner, watched half a movie, and checked the terminal exactly once. (More on that later.)

VS Code file explorer showing the starting state with only wp-requirements-to-goals skill, playwright-cli skill, and requirements.md — the entire human input is one requirements file plus two skills

Everything I’ve built with the codex goal command up to this point has fit inside a demo. The autologin plugin took twenty-eight minutes. How I Chained Two Codex /goal Runs to Build a Complete CLI Tool scaled the pattern to two linked goals. How I Used 8 Codex /goal Runs to Build a Browser Game From Scratch pushed it to eight.

The question I’ve been carrying — and maybe you have too — is whether /goal survives contact with real software. Multi-feature. Edge cases. Settings pages. The kind of product someone would actually pay for.

This post is where I find out.

The honest caveat lands early, same as always: /goal produced the code, but the requirements produced the outcome. And this time the spec was a full requirements document, decomposed by a skill into a layered tree of goals — each with its own contract, its own verification, its own proof.

(If you’re new to the series, the autologin post covers what /goal is and how the goal trio works. Everything here builds on that foundation.)

.

.

.

The Requirements Are the Real Work

A paragraph was enough for an autologin plugin.

A full product needs a full brief.

I learned this the hard way on a previous build. The requirements were loose enough that the agent met every acceptance criterion — and still missed what I actually wanted. (If you’ve ever written a Jira ticket and gotten back something that was technically correct and completely wrong, you know the feeling.)

That gap is where I started treating the requirements doc as the real product.

(Full requirements: https://github.com/nathanonn/wc-bulk-edit-stock/blob/main/requirements.md)

The requirements for this build carried tagged user stories with explicit acceptance criteria, edge cases around out-of-stock states and variable-product handling, and cross-cutting concerns like validation and save resilience:

  • US-01: Quickly update a single product’s stock from a filterable admin grid
  • US-02: Set a group of products to out-of-stock at once (bulk action)
  • US-03: Edit per-variation stock for variable products inline
  • Edge cases: WooCommerce inactive, concurrent edits, deleted staged products, 100+ variations
  • Cross-cutting: Save/validation resilience, filtering/search, batch selection

That doc is the product brief, the architecture, and the test plan — all in one file. The better it is, the less you touch the build.

I wrote about the upstream discipline in How to Write Better Requirements with Claude (Stop Letting AI Assume). That post produces the input this post consumes. If you’re going to try this workflow, start there.

Here’s the thing: the codex goal command runs on evidence, and the requirements doc is where that evidence gets defined. Every acceptance criterion becomes a checkbox the machine has to satisfy before declaring a goal complete. Write the criteria well, and you’ve written the test plan. Write them vaguely, and the build reflects that vagueness right back at you.

The leverage point from the autologin post still holds — the autonomy /goal provides downstream is paid for upfront, in the spec. Here the spec is bigger, so the downstream autonomy stretches wider too.

.

.

.

Meet wp-requirements-to-goals — The Skill That Decomposes

The autologin post introduced a skill that turns a vague paragraph into one goal trio. One input, one output.

This post’s counterpart is wp-requirements-to-goals.

Same family, different scale. It takes a structured requirements doc and produces an entire project — a goals plan, a root scaffold, and a layered tree of goals ready to execute. When I first ran it against the bulk stock manager requirements, the decomposition it produced was almost exactly what I would have designed myself — except it took minutes instead of an afternoon of whiteboarding.

The layering follows a consistent pattern:

LayerWhat it builds
00-foundationWalking skeleton — plugin activates, settings register, one artifact renders
Per-US goalsOne goal per user story, acceptance criteria copied verbatim from requirements
Non-US feature goalsCross-cutting concerns that don’t map to a single story
Integration goalRe-verifies every prior goal + cross-cutting edge cases

Each goal carries its own GOAL.md, VERIFY.md, and PROGRESS.md — the same trio from the autologin post, repeated across the full tree. Acceptance criteria are copied verbatim from the requirements document. Never paraphrased. That’s what keeps the machine’s definition of “done” identical to yours.

The integration goal at the end re-runs every previous verification — the same QC checkpoint idea readers of Your Codex Skills Should Evolve With Your Project (Ion Viper Part 2) will recognize, now baked into the WordPress skill instead of manually authored.

And before asking any questions, the skill probes the repo. It checks for existing config files, reads the slug, namespace, WordPress version, and PHP target from whatever’s already on disk. The clarification rounds stay short because the filesystem already answered most of the questions.

(Smart enough to look before it asks — which, let’s be honest, puts it ahead of a lot of people I’ve worked with.)

Codex terminal showing the wp-requirements-to-goals skill invoked against the requirements file

.

.

.

One-Shot or Phased — and the Q&A That Sets the Plan

The skill’s first question is a mode decision: generate goals phased or one-shot?

Phased writes the plan first, pauses so you can review and edit, then generates the goal files and scaffold. Safer for a first run — because the plan decomposition is the highest-risk decision. If the skill slices the requirements poorly, every downstream goal inherits the mistake.

One-shot generates the plan, scaffold, and all goal folders in a single pass. Faster, and what I chose here. The requirements doc was clean enough that I trusted the decomposition, and I wanted to see how far the unattended pipeline could stretch.

Codex asking whether to generate goals phased or one-shot, with three options: Phased recommended, One-shot, and None of the above
Selecting One-shot option to generate all goals and scaffold in one pass

After the mode decision, the skill ran through a handful of clarification rounds. I went with the recommended option on every one — the repo probe had already answered the identity questions, so these were mostly confirming sensible defaults.

(The whole exchange felt like confirming a restaurant reservation. “Table for one? Near the window? 7 PM?” Yes, yes, yes.)

First Q&A round with scaffold questions answered using recommended defaults — project identity, WordPress baseline, goal slicing, edge-case ownership
Second Q&A round covering test seeding method, derived acceptance criteria, and integration verification policy — all answered with recommended options

Then Codex laid out its five-step generation plan and started working.

Codex updated plan showing five generation steps: Phase 1 config, scaffold, foundation goal, per-US and non-US goals, integration goal

About 19 minutes later, the scaffold was done. Ten goal folders sitting in the goals directory. A root config, a plugin bootstrap folder, a verification protocol, and the bash script to run them all. Every contract written. Nothing implemented yet.

The project was runnable.

VS Code showing the finished scaffold — 10 goal folders from 00-foundation through 09-integration in the goals directory, plus root config files, ready to run

.

.

.

The Part That’s New: One Bash Command Runs Every Goal

Here’s what changed between this post and every previous one in the series.

In every prior build, I pasted each /goal command by hand. Copy the command, swap the folder name, press enter, wait, repeat. The build was autonomous within each goal, but the handoff between goals was manual. ME, copying and pasting. Every. Single. Time.

run-goals.sh removes that last handoff.

It chains every goal in order — starts the WordPress environment, runs the first goal, and when that one completes it auto-proceeds to the next, all the way through the integration goal at the end. One trigger, then leave.

Two pre-flight steps first. Install the local WordPress tooling:

Terminal showing npm install output — 404 packages installed for wp-env

Start the local WordPress environment:

wp-env start output with WordPress dev site at localhost:8888 and test site at localhost:8889

Then the trigger:

./run-goals.sh
Running ./run-goals.sh — the script starts wp-env, then launches Goal 00-foundation with danger-full-access sandbox and never approval

A practical note on plan tiers: on a ChatGPT Pro (x5) plan, the full unattended run fits inside usage limits. On a lower plan like Plus, you’d run goals in chunks to stay within limits — and the script supports exactly that:

./run-goals.sh --from 00 --to 02   # run goals 00, 01, 02
./run-goals.sh --only 03           # run a single goal

The foundation goal finished in about 14 minutes. The script committed the result and moved straight to the next goal without pausing.

Goal 00-foundation completed in 14 minutes 17 seconds, auto-proceeding to Goal 01-access-control with no human input

That auto-proceed is the whole point. The autologin post removed the per-step approvals. This one removes the per-goal handoffs. You are now outside the loop for the entire multi-goal build.

.

.

.

The Nearly-Five-Hour Black Box

The first time I left a single /goal run alone, the gap was 28 minutes. That felt long.

Nearly five hours is a different animal entirely. Ten goals. The entire implementation of a multi-feature WooCommerce plugin, start to finish, with nobody at the keyboard.

I won’t pretend the first time you let a run that long go feels comfortable. The trust window is ten times wider than the autologin post, and the stakes are proportionally bigger — more goals means more surface area for things to go wrong.

About two hours in, I opened the terminal tab. Just a glance — the kind where you tell yourself you’re checking “out of curiosity,” not because you’re nervous. Goal 05 was running. I closed the tab and made dinner.

Here’s what made the absence workable:

Each goal’s VERIFY.md defines what counts as proof. The continuation prompt refuses to declare a goal complete without mapping every acceptance criterion to evidence. Scope boundaries in each GOAL.md keep Codex from wandering into unrelated files. And the integration goal at the end — which alone took 91 minutes, about a third of the total runtime — ran a full regression sweep three times, re-verifying every prior goal’s work against the live WordPress environment.

Let me say that again. A third of the total build time was pure verification.

That regression discipline carries through the whole chain. Each goal re-checks the ones that came before it. A late goal breaking an early one would surface in that goal’s own verification pass, long before the integration sweep catches it again. The tests compound across the chain, and what you’re left with is a result you can audit from the artifacts alone.

283 minutes, 9 seconds. Ten goals completed, zero skipped.

Terminal showing 10 goals completed in 283 minutes 9 seconds with 0 skipped, followed by wp-env shutdown

.

.

.

What It Cost

I’ve been writing this series for months without ever putting a dollar figure on the autonomy.

This one does.

Before this experiment, I’d browsed CodeCanyon for bulk stock managers. The $40–60 listings had mixed reviews and half of them hadn’t been updated in a year. I wanted to know whether a clean spec and under five hours of machine time could land in the same category — so I built a bash script that totals input and output tokens across the full run and applies current GPT-5.5 API pricing.

Here’s what the 10-goal build cost:

Cost calculation output showing GPT-5.5 pricing: 10 completed goals, 4.71 hours, 208M input tokens with 206M cached, 0.43M output, Short Cost $131.40, Long Cost $254.46

How to think about that number:

Hiring a freelance WordPress developer to build a multi-feature WooCommerce admin plugin from a requirements doc would cost anywhere from $500 to several thousand dollars, depending on the complexity and the developer’s rate. Buying an existing CodeCanyon plugin and customizing it runs $30–60 for the license, plus hours of adaptation time to make it fit your exact spec.

$131 for a working, tested, multi-feature plugin built from your exact requirements — with zero hands-on coding time — lands in a genuinely interesting spot.

.

.

.

Does It Actually Work? (And the UI Taste Caveat)

Closed the terminal. Opened the browser. Tested the plugin like a regular human would.

The honest caveat first: the generated admin UI is functional but plain. GPT-5.5 builds things that work, but its visual design sense is weaker than Claude models. The admin page has the right columns, the right filters, the right controls — everything the requirements specified. The layout and styling are just… adequate. Functional without any flair.

The generated Bulk Edit Stock admin page showing a product table with search, category filter, stock status filter, and columns for product name, type, stock managed, stock quantity, and stock status — functional but visually plain

A day of CSS polish from a human — or a Claude session focused on UI — would bring it up to marketplace standard. The functionality, though, is the part the requirements controlled. And the functionality held up.

Here’s the test that matters most.

I edited stock for a simple product (set quantity to 20) and for a variable product’s “Small” variation (set quantity to 19), then hit Save Changes.

Bulk editing stock quantities — WC BES G09 Seasonal Two changed to 20, Small variation changed to 19, with Save Changes button and 2 products modified indicator

Then I opened the actual WooCommerce product edit screens to check whether the values persisted. The simple product showed 20.

WooCommerce product edit page for WC BES G09 Seasonal Two showing stock quantity of 20 persisted correctly after bulk edit, with red arrow pointing to the quantity field

The variation showed 19.

WooCommerce variation edit page for Small variation showing stock quantity of 19 persisted correctly after bulk edit, with red arrow pointing to the stock quantity field

Per-variation stock on variable products is exactly where a lazy plugin implementation falls apart — WooCommerce stores variation stock separately from the parent product, and the save path requires hitting variation-specific meta fields.

That complexity is the reason I chose this plugin as the test case. And it held up.

👉 What this series keeps landing on: /goal offloads the implementation so you can spend your time being a good tester. Hours of machine work freed me to focus entirely on verification. Opening the browser, clicking through the plugin, checking that values persisted — that’s where my time belongs now.

.

.

.

Grab the Plugin

The full project is on GitHub: wc-bulk-edit-stock. Every goal folder, the bash script, the complete Codex run history — all of it. You can walk through the entire build, goal by goal, in the commit log. (It’s one of those repos where the journey is the documentation.)

If you just want the finished plugin, the releases page has a downloadable zip. Drop it into any WooCommerce site and you’ve got yourself a working bulk stock manager.

.

.

.

Use the Skill for Your Own Plugin

Install the skill:

npx skills add nathanonn/agent-skills --skill wp-requirements-to-goals --agent codex

The repo is at github.com/nathanonn/agent-skills.

One prerequisite to know about: the verification step in each goal uses playwright-cli for browser-based tests against the running WordPress environment. If you want the full workflow — including automated verification — you’ll need it installed. The playwright-cli README covers the setup.

Decomposition, scaffolding, and goal generation — that’s what the skill handles. Execution is on the bash script. But both are only as good as the requirements doc you feed in. Vague requirements produce vague goals, and the build reflects that.

The real prerequisite — ferpetesake — is learning to write requirements well. Start with How to Write Better Requirements with Claude if you haven’t already.

.

.

.

The Bigger Picture

Five entries in this series. One pattern. An input that keeps shrinking.

The autologin post started with a paragraph and a pasted command — one feature. This one started with a requirements doc and one bash command — a complete, multi-feature product.

The skill carries the domain knowledge. /goal runs the execution loop. PROGRESS.md proves the work. What changed is the ceiling — the scope of what you can build without writing code or babysitting the build.

The human’s job has compressed to two things: writing the requirements well and verifying the result. Everything between those two — decomposition, scaffolding, implementation, testing, regression — is now machine work you can trigger and walk away from. Like leaving a slow cooker on and coming back to a finished meal. (Except the meal is a WooCommerce plugin, and the slow cooker cost $131.)

The codex goal command reaches marketplace-grade complexity here, and that’s the claim this post earns. A bulk stock manager with per-variation editing, cross-cutting validation, and a full integration sweep is the kind of plugin people actually sell. The build handled it.

The honest forward edge: the UI taste gap is real, the $131 cost is real, and “marketplace-grade functionality” still needs a human’s polish and judgment before it’s ready for paying customers. Functional code and a shippable product are different things — the gap between them is taste, branding, documentation, and support. All human work.

But the part AI is getting genuinely good at — executing a well-specified plan, unattended, across an entire multi-feature build — just took another visible step.

Your job is to get good at writing the plan.


More workflows like this — AI-assisted development with Claude Code, Codex, and the tools between them — land in The Art of Vibe Coding newsletter every week. If this one was useful, the next one probably will be too.

11 min read The Art of Vibe Coding

Your Codex Skills Should Evolve With Your Project (Ion Viper Part 2)

How I Used 8 Codex /goal Runs to Build a Browser Game From Scratch ended with a line I’d been sitting on for the whole post: “Part 2 is where it gets interesting.”

Let me deliver on that.

Seven more /goal runs. ~133 minutes of autonomous Codex time. Boss fights, three enemy archetypes, a power-up system, randomized waves, and a New Game Plus loop that keeps escalating difficulty. The game that was fun for 90 seconds now has a real ending — and a reason to keep playing past it.

Here’s the thing, though.

The more important story is what had to change in the skill before any of that was possible.

webg-spec-to-goal from Part 1 could scaffold a brand-new project and generate all its goals in one pass. Powerful for a greenfield build. Useless for extending a game that already had a working codebase and dozens of passing tests.

It needed to learn a second mode.

VS Code showing SKILL.md with new/extend mode dual flow beside the existing goals folder

.

.

.

Where Part 1 Left Off

Quick recap for context:

Part 1
Goals8
Autonomous time~78 minutes
Tests39
Skillwebg-spec-to-goal — scaffold + generate all goals at once
ResultPlayable, fun for ~90 seconds

The previous post covers the full foundation — what /goal is, how goals are structured, and how each goal builds on the last. Read it here if you haven’t.

This post assumes you have.

.

.

.

You Can’t Plan Everything Upfront

The original webg-spec-to-goal skill had one mode: start from scratch.

Write a paragraph describing a game. Get a starter project, a plan, and all goals. Run them in sequence. Done.

That works for an MVP.

It breaks the moment you want to add features to an existing codebase.

The first time I tried to add features to Ion Viper after Part 1, I instinctively reached for the same skill. It took about ten seconds to realize the problem — the skill’s first step was scaffolding, which would overwrite the entire project I’d just built.

After Part 1, Ion Viper was a working project — code, tests, game systems all in place. The codex goal command had built it, and the skill had no way to continue from it.

Three concrete problems:

  1. No project awareness. The skill always scaffolded a new project. Running it again would overwrite everything — scenes, tests, config, all of it.
  2. No goal continuation. Goal numbering started at 00. There was no mechanism to pick up from 08 and continue the sequence.
  3. No awareness of existing game state. New goals needed to know what systems already existed. Without that awareness, they’d duplicate work or introduce conflicts with existing tests.

If you use autonomous coding tools long enough, every project hits this moment. The initial plan is exhausted, the codebase is substantial, and the tool needs to extend what’s there rather than start over.

The solution: teach the skill a second mode.

.

.

.

Teaching the Skill to Extend

Two changes turned webg-spec-to-goal from a one-shot scaffolder into a tool that grows with the project.

Extend mode. When the user asks to add features to an existing game, the skill switches behavior:

  • Reads the existing project to understand what’s already built
  • Skips scaffolding entirely (the project already exists)
  • Continues goal numbering from the highest existing goal
  • Appends new goals to the existing plan
  • Generates only the new goal folders

QC Checkpoint. A new goal type that always lands last in any extension sequence.

It adds no features. Instead, it validates the full gameplay loop from start to finish and catches integration bugs that individual goals might miss.

(More on this one later — it earned its own section.)

Under the hood, the skill now branches early.

It detects whether you’re starting fresh or extending, and adjusts accordingly — skipping the scaffolding step and generating only the new goals while leaving everything else intact.

Here’s what makes extend mode reliable: it reads the existing plan before generating anything. That document captures the genre, the progression so far, and what’s already been built. The skill doesn’t guess what’s already there — it reads the evidence.

Codex recognizing the existing Phaser project and entering extend mode — reading goals-plan.md, state-bridge.ts, listing existing goals

.

.

.

The Enhancement Prompt

The enhancement description was six features in plain English:

  1. Rebrand from “Raiden Shooter” to “Ion Viper”
  2. Ion Blast power-up — timed multi-projectile firing
  3. Boss fight — multi-phase, health bar, victory screen
  4. New Game Plus — difficulty escalation on restart
  5. Enemy archetypes — at least 3 types with unique behaviors
  6. Randomized wave positioning
The six-feature enhancement description typed into the Codex composer with the $webg-spec-to-goal skill

Compare that to Part 1’s four-sentence game description.

The enhancement prompt is longer because it describes specific features to add to a game that already exists. Adding to a working system demands more precision than describing one from scratch.

The skill took those six features, confirmed scope in one clarification round (“Go with your recommendations” — same as Part 1), and generated 7 new goals (08–14) in a single 14-minute invocation.

Here’s what it produced:

#GoalWhat it adds
08Rebrand to Ion ViperGame identity and metadata
09Ion Blast Power-UpTimed multi-projectile pickup
10Enemy ArchetypesDrifter, shooter, charger — three distinct behaviors
11Randomized WavesBalanced random spawn positions and timing
12Boss FightMulti-phase boss, health bar, victory screen
13New Game PlusDifficulty loop, multipliers, clean restart
14QC CheckpointFull-loop validation and integration testing
Codex output showing all 7 new goal folders generated, goals-plan.md updated, ~14 minutes

.

.

.

Goals 08–13: The Enhancement Assembly Line

Same pattern as Part 1.

Paste the /goal command, swap the folder name, press enter, walk away.

Goal 08 — Rebrand to Ion Viper (7m 15s, 42 tests). A metadata pass — updated the game title, browser tab, and package info. Lightweight by design. Confirms the codebase is stable before real feature work begins.

Goal 09 — Ion Blast Power-Up (19m 36s, 47 tests). The first real feature addition. A timed pickup grants multi-projectile firing for a limited window. At 19 minutes, this was the longest non-QC goal — adding a brand-new game system on top of an existing one takes more work than extending a familiar pattern.

Goal 09 completion — Ion Blast implemented, 47 tests passing, 19m 36s

Goal 10 — Enemy Archetypes (14m 30s, 53 tests). Three enemy types replaced the single drone from Part 1. Basic drifters float downward predictably. Shooters fire projectiles back at the player. Chargers telegraph with a flash, then rush. Each type has its own health, speed, score value, and behavior — and the system is designed so adding more types later is a config change, not a rewrite.

Goal 11 — Randomized Waves (15m 30s, 57 tests). Spawn positions, timing, and lane spacing are now randomized within fair bounds. Players can’t memorize patterns anymore. The randomization layer plugs into the existing wave and spawning systems without replacing them.

Goal 12 — Boss Fight (13m 57s, 62 tests). After all waves clear, a multi-phase boss spawns with a visible health bar. Three attack phases, each with different patterns. Defeat it and you reach the victory screen. The game finally has a real ending.

Goal 12 completion — boss fight implemented, VictoryScene created, 62 tests passing, 13m 57s

Goal 13 — New Game Plus (16m 58s, 65 tests). The victory screen now offers a restart into a harder loop. Enemies get faster, tougher, and the boss scales up — all through multipliers that layer on top of the base difficulty. Loop 2 is harder than loop 1. Loop 3 escalates further.

The test count across the six goals tells the story: 42 → 47 → 53 → 57 → 62 → 65. Each goal adds its own automated tests and runs every previous test too. By Goal 13, every system from foundation to New Game Plus is covered.

.

.

.

Goal 14: The QC Checkpoint — Where the Bugs Were Hiding

Stay with me on this one, because it changed how I think about autonomous goal chains.

The QC checkpoint is a different kind of goal.

Goal 14 adds no mechanics, no art, no content.

It runs the full gameplay loop — boot, menu, play, waves, boss, victory, New Game Plus, restart, lose, game over, restart — and checks that everything reports correctly through every transition.

I expected Goal 14 to be a formality. A quick pass, green tests, done in five minutes. Instead it ran for 31 minutes and surfaced four bugs I hadn’t noticed during any of the previous six goal runs.

~31 minutes.

The longest goal in the entire project, longer than Part 1’s 26-minute polish goal.

The /goal command for the QC checkpoint — same paste-and-go format as every other goal

What the QC checkpoint caught and fixed:

  • Stale game-over state. Game data wasn’t clearing properly on restart. A player who died and restarted carried ghost data from the previous run.
  • Ion Blast firing edge case. The SPACE key needed hardening to prevent firing during pickup collection — a timing conflict between the weapon system and the power-up system.
  • Wave projectile cleanup. Enemy projectiles from cleared waves weren’t cleaning up on scene transition. They accumulated silently across wave boundaries.
  • Final wave timing. Timing adjustments were needed to ensure wave completion triggers the boss reliably instead of leaving the player in a dead state.

Every one of these is an integration bug. Each individual goal passed its own tests. But when the full loop ran — menu to waves to boss to victory to restart to game over to restart again — transitions between systems exposed assumptions that no single goal’s tests would catch.

👉 The QC checkpoint tests the seams between goals.

67 tests at the end. All passing. Including a new end-to-end test that exercises both the full win path and the full loss path.

When seven autonomous goal runs build on each other without human review between them, compound integration risk is real. Catching it automatically — before manual playtesting — makes the entire chain trustworthy.

Goal 14 completion — regressions found and fixed, 67 tests passing, integration spec added, ~31 minutes

.

.

.

Does It Actually Work?

Same instinct as every post in the series: close the terminal and play it.

Part 1’s version was fun for 90 seconds.

Part 2’s version kept me playing for longer than I’d admit.

Open the browser. Menu screen, now branded “Ion Viper.” Press SPACE.

Enemies descend — but they’re not uniform anymore. Drifters float down predictably while shooters fire back. Chargers telegraph with a flash, then rush straight at you. Spawn positions are randomized, so every playthrough arranges differently.

A glowing pickup drops. Collect it — Ion Blast activates. Your ship fires a spread of projectiles for a limited window. Clear the waves fast.

And then the boss arrives. Health bar at the top of the screen. Three attack phases. Dodge patterns change as the boss takes damage. Beat it — victory screen. “Continue to New Game Plus.” Accept.

Loop 2 starts.

Enemies are faster and tougher. The boss has more health. Die — game over, final score, restart option.

(I restarted four times before I remembered I was supposed to be writing this post.)

What Part 2 added: Ion Blast power-up, 3 enemy archetypes, randomized wave positioning, boss fight with phases, victory screen, New Game Plus difficulty loops.

What’s still absent: multiple power-up types, more than one boss, mobile controls, persistent leaderboards.

The game is playable.

It has a real ending, replayability through difficulty escalation, and it delivers what the enhancement prompt asked for.

Play it yourself: https://stunning-paprenjak-6fc3dd.netlify.app/

.

.

.

The Numbers

Part 2 build summary:

  • Skill invocation: 1 (extend mode — generated 7 new goals)
  • Goal runs: 7 (Goals 08–14)
  • Total autonomous time: ~133 minutes (~14 min skill + ~119 min goals)
  • Total human input: one enhancement paragraph + “Go with your recommendations” + 7 /goal pastes
  • Tests: 67 (up from 39 at end of Part 1)
  • Source files: 28 (up from 18)
  • Lines of game code: ~3,265 (up from 1,463)
  • Assets: 19 files (images + audio + data)

Combined totals (Part 1 + Part 2):

Part 1Part 2Combined
Goals8715
Autonomous time~78 min~133 min~211 min
Tests396767 (cumulative)
Lines of code1,4633,2653,265 (cumulative)

Series comparison (all four posts):

PostDomainGoalsAutonomous timeTests
1WordPress plugin128 minbrowser checks
2CLI tool280 min35
3Browser game (Part 1)878 min39
4Browser game (Part 2)7133 min67

Worth noting: Part 2’s per-goal average (~17 minutes) is higher than Part 1’s (~10 minutes). Feature goals on an existing codebase are more complex than greenfield goals — more existing code to understand, more systems to coordinate, more tests to run. The codex goal command does the same thing in both cases, but the surrounding context is heavier.

.

.

.

Skills Are Living Documents

Three posts ago, wp-spec-to-goal generated one goal. Two posts ago, cli-spec-to-goal detected that a project needed splitting and generated goals one at a time. One post ago, webg-spec-to-goal generated all goals at once from a single paragraph.

This post: the same skill learned a second mode.

The pattern across the series is consistent.

Each skill starts simple and gets smarter based on what the project demands. Domain knowledge accumulates in the skill — in its instructions, its templates, and its logic for deciding when to scaffold versus when to extend.

What webg-spec-to-goal knows now that Part 1’s version didn’t:

  • How to read an existing project’s state before generating goals
  • How to continue goal numbering from an existing sequence
  • How to inherit existing game state without duplicating it
  • That a QC checkpoint at the end of a long chain catches integration bugs that individual goals miss

None of this was planned in advance.

The skill evolved because the project demanded it.

Here’s what changed in how I think about this: the investment in the skill compounds. Build webg-spec-to-goal once for Part 1, extend it for Part 2, reuse the extend mode and QC checkpoint for the next game after that. The codex goal command executes what the skill produces — and the skill keeps getting better at producing the right thing.

The skill is open source at github.com/nathanonn/agent-skills. The game repo is at github.com/nathanonn/ion-viper. Clone it, run npm install && npm run dev, and play the version that 15 autonomous goal runs built.

What’s next: skills for other domains.

The three sibling skills (wp-spec-to-goal, cli-spec-to-goal, webg-spec-to-goal) share the same structural patterns but each one carries domain knowledge the others lack.

The question is where that pattern goes next.

10 min read The Art of Vibe Coding

How I Used 8 Codex /goal Runs to Build a Browser Game From Scratch

One paragraph describing a game idea.

Eight /goal commands. Seventy-eight minutes of autonomous Codex time.

A playable vertical shooter in the browser.

The WordPress plugin post was one goal, 28 minutes. The CLI tool post was two goals, 80 minutes. This time: eight goals, ~78 minutes of autonomous Codex time, and the output is something you can play with a keyboard.

Here’s the paragraph that started it:

Build me a Raiden-type vertical shooter web game. The player ship is at the bottom of the screen, enemies come down from the top, and the player shoots bullets upward to destroy them. Standard 800×600 resolution. Use pixel art style.

That paragraph took three tries.

My first version was “make me a fun shooter game” — too vague for a skill that needs to generate 8 goals from it. The final version named the genre, the perspective, the resolution, and the art style.

Four sentences. Five minutes of thinking to save 78 minutes of supervision. What came out the other end is Ion Viper — a Phaser 3 game with player movement, shooting, pooled enemies, wave progression, scoring, health, a HUD, pixel art, particles, screen shake, and sound effects.

The honest caveat, same as always: the codex goal command produced the code, but the spec produced the outcome. This time the spec wasn’t a single goal trio — it was eight of them, generated all at once by a skill that understands how games decompose.

If you’re new to the series, the WordPress plugin post covers what /goal is, the goal trio (GOAL.md, VERIFY.md, PROGRESS.md), and the continuation prompt.

This post assumes you’ve read at least one of the previous two.

VS Code file explorer showing a clean repo with only .codex/skills containing webg-spec-to-goal and playwright-cli folders, beside a Codex 0.138.0 terminal session with gpt-5.5 xhigh model pointed at ~/Dev/raiden-shooter
Codex terminal showing the $webg-spec-to-goal skill invocation with the plain-English game description typed into the composer — gpt-5.5 xhigh model, 0% context used

.

.

.

Meet webg-spec-to-goal — The Skill That Generates Every Goal at Once

The WordPress plugin post introduced wp-spec-to-goal, a Codex Agent skill that turns a paragraph into the goal trio. That skill produced one goal.

The CLI tool post introduced cli-spec-to-goal, which detected the project needed splitting and generated one goal at a time — with a plan for the rest.

webg-spec-to-goal generates ALL goals in a single pass. All three skills from this series are open source at github.com/nathanonn/agent-skills.

Why?

Because games have a predictable build order that the skill can exploit:

#Goal LayerWhat it builds
00FoundationVerify scaffold boots, state bridge works
01Core MechanicThe thing that makes this game this game (shooting)
02ContentOpposition (enemies)
03–04Progression + SystemsScoring, health, HUD
05–06DepthWaves, difficulty
07PolishArt, sound, particles, juice

The skill inspects the game description, auto-detects the genre (shoot-em-up, card game, platformer, tower defense, puzzle), tailors the goal decomposition to that genre, and writes everything — the scaffold, the plan, and all 8 goal trios.

One concept needs explanation: the state bridge.

Every goal adds typed fields to window.__GAME_STATE__, and Playwright tests read that object to verify game behavior without screenshot pixel comparison. Goal 00 adds scene and ready. Goal 01 adds playerPosition and playerAlive. By Goal 06, the state bridge carries 13 fields. Verification tests assert against these fields — meaning an automated test can confirm “the player is alive and at position (400, 500)” without trying to parse pixels from a rendered canvas.

That’s what makes automated testing possible for a visual, interactive medium.

From the prompt, the skill detected “shoot-em-up” and proposed defaults:

  • name/slug (raiden-shooter, matching the repo name),
  • resolution (800×600),
  • mechanics (single weapon, no boss, no power-ups for a tight MVP),
  • and art style (pixel art with a later polish goal for audio).

I confirmed with “Go with your recommendations.”

Phaser domain knowledge also lands in AGENTS.md — scene lifecycle, object pooling, delta-time movement, state initialization in init() instead of constructors — so Codex has it available during every /goal run.

Codex output showing the webg-spec-to-goal skill running pwd, rg --files, and ls -la to explore the empty repo before asking any scoping questions
Codex presenting 5 scoping decisions with recommended defaults — name/slug, genre detection (shoot-em-up), scope (player ship, bullets, enemies, health, score), mechanics (single weapon, no boss, no power-ups), and audio (include in later polish goal) — with the user replying "Go with your recommendations"

.

.

.

The Scaffold and the Plan

The skill writes two things.

The scaffold is a runnable Phaser 3 project — TypeScript, Vite, Playwright, scene shells, state bridge — that boots at localhost:8080 and passes 3 tests out of the box.

The plangoals-plan.md — maps all 8 goals in sequence with dependencies, acceptance criteria, and the state bridge growth table.

Here’s what the skill produced — and what each goal actually took:

#GoalTimeTests
00Foundation7m 40s3
01Player Ship6m 17s9
02Player Weapons6m 06s14
03Enemies8m 57s19
04Scoring & Health8m 54s24
05HUD6m 22s29
06Wave System7m 16s34
07Polish26m 32s39

Those are the real numbers from the actual run.

The test count column tells a story on its own. Each goal adds its own Playwright tests AND runs all previous ones. By Goal 06, the 34-test regression suite covers every system built so far. The “fields accumulate, never remove” rule in the state bridge contract is what keeps those regressions honest.

Codex output showing the skill reading scaffold and goal templates, confirming the Phaser 3 + Vite + Arcade Physics + state bridge + Playwright scaffold, and beginning to write files
Split view — left: VS Code file explorer showing the full scaffolded structure with goals/00-foundation through 07-polish, src/, tests/, public/. Right: Codex completion summary listing all created files, verification results (npm install, npm run typecheck passed, npm test 3/3 boot tests passed), and the dev server running at http://127.0.0.1:8080

.

.

.

Goal 00: Foundation — The Sanity Check

The /goal command is identical in shape to the WP and CLI versions:

/goal Complete goals/00-foundation/GOAL.md. Use goals/00-foundation/VERIFY.md
as the verification contract. Update goals/00-foundation/PROGRESS.md continuously.
Treat uncertainty as incomplete.

Paste. Press enter. Walk away.

Codex terminal showing the /goal command for 00-foundation pasted into the composer, ready to execute

Codex customized the constants, wired the menu-to-game transition, set up the state bridge base fields, and wrote Playwright tests covering canvas size, console errors, and scene transitions.

7 minutes and 40 seconds. 3 tests passing. Menu screen rendering.

A lightweight sanity check — confirm the scaffold actually works before building on top of it. Every game starts here.

Goal 00 completion summary showing changed files (constants.ts, MenuScene.ts, helpers.ts, boot.spec.ts, PROGRESS.md), verification results (npx tsc --noEmit, Playwright tests, npm test), screenshot artifacts (menu.png, game-scene.png), and final time of 7m 40s

.

.

.

Goals 01–06: The Assembly Line

Six goals, each one building on the last, each one autonomous, each one inheriting the full codebase state from the run before it.

Goal 01 — Player Ship (6m 17s): Player sprite at bottom-center, WASD and arrow keys, clamped to bounds. 9 tests.

Goal 02 — Player Weapons (6m 06s): SPACE fires pooled bullets upward with fire-rate limiting. 14 tests.

Goal 03 — Enemies (8m 57s): Enemies spawn from the top, move downward, and are destroyed by bullet overlap. 19 tests. This is the moment the game becomes a game — something that absorbs your shots and shoots back.

Goal 03 completion showing pooled enemies, timed spawning, downward movement, bullet-enemy overlap destruction, offscreen recycling, enemy state bridge reporting — 19/19 tests passed, including full regression, in 8m 57s

Goal 04 — Scoring & Health (8m 54s): Score on kill, health on contact, invulnerability frames, game over on death. 24 tests.

Goal 05 — HUD (6m 22s): Parallel HUD scene showing score, health, and wave number without blocking gameplay input. 29 tests.

Goal 06 — Wave System (7m 16s): Data-driven waves replace endless spawns. Difficulty escalates. Clearing the final wave triggers a win state. 34 tests.

Goal 06 completion showing data-driven WaveSystem.ts, wave configuration in waves.ts, win-state implementation, 5 new tests plus full 34-test regression — all passed. Screenshots captured. 7m 16s

Here’s what makes the assembly line work:

Each codex goal command run starts by reading the entire codebase that previous goals built. Codex understands the existing scene structure, the existing test patterns, the existing state bridge fields. Goal 06 extends what Goals 00 through 05 established — it doesn’t start from scratch.

Somewhere around Goal 05…

Pasting the /goal command stopped feeling like an experiment and started feeling like filling out a form. Copy the command, swap the folder name, press enter. The novelty was gone by the HUD goal. That’s the point — when the eighth paste feels boring, the pattern has landed.

And the test regression discipline kept compounding.

Goal 06 runs 34 tests — 5 new ones for the wave system plus all 29 inherited from previous goals.

Nothing broke.

That’s the state bridge contract at work: fields accumulate, never get removed, and every prior test still finds what it expects.

.

.

.

Goal 07: Polish — Where the Game Comes Alive

The final goal is the longest — 26 minutes and 32 seconds — and the most visually dramatic. Placeholder rectangles become pixel art. Silence becomes sound.

Codex terminal showing the /goal command for 07-polish pasted into the composer

What Codex did in this goal:

  • Generated pixel art assets using $imagegen — player ship, enemy drone, bullets, explosion particles, space background, parallax stars — all following the magenta chromakey pipeline documented in AGENTS.md.
  • Created sound effects (fire, hit, explosion, player damage) and background music.
  • Built a FeedbackSystem: particle emitters on enemy destruction, camera shake on player damage.
  • Added parallax scrolling background with a tiling star layer.
  • Polished MenuScene and GameOverScene presentation.
  • Ran all 39 tests — 5 new polish tests plus the full 34-test regression. All green.

One honest limitation Codex flagged: it can verify that audio assets load and trigger correctly, but whether they actually sound good is a human judgment.

26 minutes and 32 seconds. More than three times the average of the other seven goals.

Polish is where the token budget earns its keep — art generation, audio integration, particle tuning, and full regression across the entire test suite.

Goal 07 completion showing pixel art assets generated, WAV sound and music integrated, BootScene loading, parallax stars, particles, camera shake, FeedbackSystem wiring, state bridge fields unchanged, 39/39 tests passed including full regression, with the audio balance caveat noted. 26m 32s

.

.

.

Does It Actually Work?

Same instinct as the previous two posts: close the terminal and test it like a real player.

The first time I opened localhost:8080 after Goal 07 finished, I expected colored rectangles with sound effects layered on top. Pixel art loaded instead — a ship, a background with parallax stars, enemies with actual sprites. The gap between what I typed into the composer and what appeared in the browser was wider than any of the previous posts.

Open the browser. Menu screen with the game title. Press SPACE.

Player ship at the bottom, pixel art sprites, parallax background scrolling. Enemies descend in waves. Bullets fire upward with sound effects. Enemies explode with particles and a satisfying pop. Get hit — screen shakes, health drops, the ship flashes with invulnerability frames. Clear all waves — win state. Die — game over screen with the final score and a prompt to restart.

Here’s the gameplay:

What’s there: movement, shooting, enemies, waves, scoring, health, HUD, pixel art, sound, music, particles, screen shake. All built by Codex across 8 autonomous runs.

What’s not there: power-ups, boss fights, multiple enemy types beyond the basic drone, weapon upgrades, mobile controls.

The game is playable.

It’s fun for about 90 seconds.

It delivers exactly what the spec asked for.

(You’ll notice some screenshots show “Raiden Shooter” in file paths and title screens — that was the working name during development. The published game is Ion Viper.)

.

.

.

The Numbers

Concrete summary of the build:

  • Skill invocation: 1 (generated all 8 goals at once)
  • Goal runs: 8
  • Total autonomous time: ~78 minutes across all goals
  • Total human input: one paragraph + “Go with your recommendations”
  • Source files: 18 TypeScript files across scenes, objects, systems, and configs
  • Test files: 8 Playwright spec files
  • Tests: 39 (all passing)
  • Assets: 9 image files, 5 audio files
  • Lines of game code: 1,463

Series comparison:

PostDomainGoalsAutonomous timeTests
1WordPress plugin128 minbrowser checks
2CLI tool280 min35
3Browser game878 min39

The trend: more goals, same pattern, roughly the same time per goal (~7–10 minutes each, except polish at 26 minutes).

.

.

.

What’s Left (And Why That’s the Point)

The game is complete as specified. All 8 goals passed. All 39 tests are green. Every verification contract is satisfied.

And it’s clearly not done.

No power-ups. No boss fights. One enemy type. No weapon upgrades. Three waves. No difficulty curve beyond wave configuration. No mobile controls. No deployment.

This is intentional.

The MVP proves the pattern works for games.

The next post (Part 2) will:

  1. Add new goal slices — power-ups, boss fights, more enemy types, weapon upgrades, difficulty tuning.
  2. Improve the webg-spec-to-goal skill based on what we learned.

The split mirrors how real game development works.

You build the core loop, validate it, then layer features on top. Goal chaining makes that natural — each new slice inherits the full state of what was built before.

If you want to play with it now, the repo is public at github.com/nathanonn/ion-viper. Clone it, run npm install && npm run dev, open localhost:8080, and shoot some drones.

Part 2 is where it gets interesting.

.

.

.

The Bigger Picture

Three posts, three domains, one pattern.

The codex goal command works the same way whether you’re building a WordPress plugin, a CLI tool, or a browser game. Skill generates spec, /goal executes spec, PROGRESS.md proves it.

The skill is the variable.

wp-spec-to-goal knows WordPress hooks and wp-env. cli-spec-to-goal knows Commander.js and exit codes. webg-spec-to-goal knows Phaser scenes, object pooling, and the state bridge pattern. Domain knowledge lives in the skill. The execution loop stays the same.

What scales: each new skill is a one-time investment.

Build webg-spec-to-goal once, use it for every Phaser game from now on. Five minutes turning a paragraph into 8 goals — the same investment whether you’re building your first game or your tenth.

What changed from March — when the Reddit summarizer needed six supervised steps — to now, when the same kind of project needs one skill invocation and a paste.

Your job is still writing the plan. The plan just got more structured.

10 min read The Art of Vibe Coding

How I Chained Two Codex /goal Runs to Build a Complete CLI Tool

One paragraph describing a CLI idea.

Two /goal commands. Eighty minutes of letting the machine work.

A complete CLI tool at the end — 35 tests passing across 6 files, build green, typecheck green, every acceptance criterion mapped to evidence.

Last week’s post showed /goal building a single WordPress plugin in 28 minutes. One goal, one feature, one walk-away-and-come-back. That was the proof of concept.

This time: two goals, chained. The first goal built the MVP command surface in 32 minutes. The second added OAuth authentication in 47 minutes. Both ran fully autonomously while I was doing something else.

The vehicle for this build is one you might recognize — the same Reddit summarizer I built back in March using the 6-step Workflow Engineering process. That version was an Express server with REST endpoints and active supervision throughout. It worked. But every time I wanted an AI agent to pull Reddit data, it had to start the server, make HTTP calls, parse responses — burning tokens on ceremony. A CLI tool that an agent can invoke directly from the command line, get structured JSON back, and move on? That consumes a fraction of the tokens and saves real money on Claude and Codex subscriptions.

So I rebuilt it.

Back in March, building that summarizer required active supervision across six steps — spec brainstorm, review, test plan, implementation plan, execute, test. This time: two skill invocations and two /goal commands, with less than ten minutes of human input across the whole thing.

The codex goal command scales beyond single features.

When a project is too big for one goal, you slice it — and the skill can help you find the seams.

VS Code file explorer showing only .codex/skills/cli-spec-to-goal and playwright-cli folders — the starting state with skills checked in but no implementation code

.

.

.

Meet cli-spec-to-goal — The Skill That Splits

Last week’s post introduced wp-spec-to-goal, a Codex Agent skill that turns a vague paragraph into the GOAL.md / VERIFY.md / PROGRESS.md trio that Codex needs to finish a goal autonomously. That skill was designed for WordPress plugins.

cli-spec-to-goal is its counterpart for CLI tools.

Same core workflow — take a vague idea, ask focused questions, produce the goal bundle plus an optional project scaffold. One key difference sets it apart.

Automatic complexity detection.

The WP skill always produced one goal.

The CLI skill inspects the spec and judges whether the project fits in a single goal or should be split into multiple slices. When it decides to split, it writes a goals-plan.md at the project root with numbered slices and generates the first goal bundle only — leaving the rest for later invocations.

(That split detection turned out to be the most interesting part of the whole build. More on that in a moment.)

Every goal it generates includes six AI-agent-friendly patterns: --json output mode, stdout/stderr separation, TTY detection, meaningful exit codes, structured errors, and --dry-run previews. These patterns make the resulting CLI safe for AI agents to invoke directly. The repo has the full list in the skill’s reference templates.

For the invocation, I typed one paragraph.

It described what the existing Express server does, said I wanted a CLI version that’s easier for AI agents to interact with, and pointed at the server codebase as a read-only reference.

Codex terminal showing the cli-spec-to-goal skill invocation with a plain-English description of converting the Express/TypeScript server into an AI-agent-friendly CLI tool, with the server codebase path provided as reference

The skill probed both repos before asking anything.

It read the empty target repo, then explored the Express server’s source files — config, Reddit API client, storage patterns, route handlers, rate limiting, test structure. It identified this as a project broad enough to warrant splitting.

Codex exploring the empty target workspace and reading the Express server's source files, noting the project is shaped like a new CLI that can reuse the server's Reddit client and storage logic

.

.

.

The Split Decision

This is the moment the skill earned its keep.

After probing both repos — the empty target and the existing Express server — it came back with three decisions to confirm:

  1. Scope shape: Split plan + first goal (recommended) vs. one combined goal
  2. Target repo: Build the CLI in the new repo, using the server as a read-only source reference
  3. First CLI surface: MVP core commands — collect, collect-all, logs list, logs read, health — using an existing env-based refresh token. No OAuth login in the first slice.
Codex presenting three decisions with recommended defaults: split plan plus first goal for scope shape, build in the new repo for target, and MVP core commands for the first CLI surface, with reasoning for each recommendation

I replied “use recommendations” and the skill started writing.

It pulled its reference templates — goal, verify, and progress — and began generating. Two and a half minutes of file creation later, here’s what appeared:

A goals-plan.md listing three proposed goal slices (MVP, auth, config polish).

A complete goal trio for the first slice: GOAL.md with 4 user stories and 15 acceptance criteria, VERIFY.md with binary smoke checks, functional checks, exit code checks, and integration checks, and a skeleton PROGRESS.md ready for Codex to fill in during the goal run.

It also produced the exact /goal command to paste — tailored to the file paths it had just created.

Total time from invocation to handoff: 4 minutes 45 seconds.

Skill completion output listing the generated files — goals-plan.md plus goals/reddit-cli-mvp/GOAL.md, VERIFY.md, and PROGRESS.md — with the tailored /goal command ready to paste and a 4m 45s elapsed time
VS Code file explorer showing the generated structure: .codex/skills with cli-spec-to-goal and playwright-cli, goals/reddit-cli-mvp with GOAL.md, PROGRESS.md, and VERIFY.md, plus goals-plan.md at the root

.

.

.

Goal 1: The MVP (32 Minutes)

The handoff: paste the /goal command, press enter, walk away.

/goal Complete goals/reddit-cli-mvp/GOAL.md. Use goals/reddit-cli-mvp/VERIFY.md
as the verification contract. Update goals/reddit-cli-mvp/PROGRESS.md continuously.
Treat uncertainty as incomplete.

Short command.

Heavy lifting lives in the files the skill already wrote.

Codex terminal showing the /goal command for reddit-cli-mvp pasted into the composer, ready to execute

Codex activated the goal.

It read GOAL.md, VERIFY.md, and PROGRESS.md, inspected the current repo state, checked for any existing scaffold or generated files, and started implementing. It used the Express server source as a read-only reference — pulling the Reddit client patterns, filtering logic, and JSON log structure — while building the TypeScript CLI from scratch.

Codex goal activation showing it reading the goal trio files, exploring the repo tree, checking for existing package.json and README.md, and beginning its implementation plan

Then the black box.

I left. Codex worked.

I went to make coffee. Checked YouTube while it brewed. The Codex session kept running through file edits, test runs, and self-audits in the background. The session was busy. I was elsewhere.

32 minutes and 44 seconds later, the goal was marked complete.

What shipped:

  • The full TypeScript ESM CLI scaffold with Commander.js,
  • 5 commands (collect, collect-all, logs list, logs read, health),
  • 26 source files,
  • 5 test files with 15 tests — all passing. npm install, npm run build, npm run typecheck, and npm test all green. Binary smoke checks and functional JSON/error/log checks from VERIFY.md all passed.

If you read last week’s post, this shape should look familiar. The spec defines the boundaries, the continuation prompt refuses to declare victory without evidence, and you trust the result because the audit trail is sitting right there in PROGRESS.md.

One honest note from the completion summary: no live Reddit API check was run, because no real credentials were available in the build environment. Tests used mocks, as required by the verification contract. Codex noted this explicitly — the kind of transparency you want from an autonomous run.

Goal completion summary showing 5 test files, 15 tests passed, all verification commands green including npm install, build, typecheck, and test, with binary smoke checks and functional checks from VERIFY.md passed. Goal usage 1955 seconds, worked for 32m 44s, with a "Goal achieved (32m)" badge

.

.

.

Goal 2: OAuth Auth (47 Minutes, Same Pattern)

The MVP left a deliberate gap.

It required an existing REDDIT_REFRESH_TOKEN in .env to call the Reddit API — functional for a developer who already has credentials, but no way to acquire them through the tool itself. The goals-plan.md had already named the next slice: reddit-cli-auth.

VS Code showing goals-plan.md with three proposed goal slices, the second slice (goals/reddit-cli-auth for OAuth auth commands) highlighted with a red box, indicating the natural next step

I invoked the skill again, this time with one sentence: “Now that we have completed goals/reddit-cli-mvp, I want to proceed with the next goal: goals/reddit-cli-auth.”

Codex terminal showing the cli-spec-to-goal skill invocation with the instruction to proceed with the next goal, referencing the goals-plan.md

The skill scanned the now-implemented codebase.

The repo that was empty an hour ago now had 26 source files, a working test suite, and a complete CLI structure. It read the existing Commander setup, Vitest test patterns, error handling conventions, and .env configuration. It also searched Reddit’s OAuth2 documentation to verify endpoint details and token flow specifics.

Codex reading the implemented MVP source files — cli.ts, config.ts, reddit.ts, errors.ts, types.ts — and searching Reddit OAuth documentation for authorization code flow details, redirect URI scope, and token endpoints

The GOAL.md it produced fit into the existing codebase. It referenced the same Commander program, the same test framework, the same error types, and the same .env storage approach. The skill verified there were no unresolved template placeholders, confirmed it had only generated the /goal contract files (no implementation), and cross-checked OAuth endpoint details against Reddit’s official documentation.

Skill output listing the generated goal bundle — goals/reddit-cli-auth/GOAL.md, VERIFY.md, and PROGRESS.md — with the tailored /goal command and a note that OAuth endpoints were checked against Reddit's OAuth2 documentation

Paste the second /goal command. Press enter. Walk away again.

/goal Complete goals/reddit-cli-auth/GOAL.md. Use goals/reddit-cli-auth/VERIFY.md
as the verification contract. Update goals/reddit-cli-auth/PROGRESS.md continuously.
Treat uncertainty as incomplete.
Codex terminal showing the second /goal command for reddit-cli-auth pasted into the composer

Codex activated the second goal.

Same startup pattern — read the goal trio, explore the repo, compare current implementation against the verification contract, build a plan.

Codex starting the second goal, reading goal files and exploring the existing codebase structure to understand what's already implemented before making changes

47 minutes and 15 seconds later…

auth login (full OAuth Authorization Code flow with a temporary localhost callback server), auth status (machine-readable token and identity info), and auth logout (with optional --revoke flag to invalidate the token server-side) — all wired into the existing CLI. 35 total tests across 6 files. Build, typecheck, and test all green.

Second goal completion showing auth commands implemented with OAuth/auth core in src/auth.ts, auth login/status/logout wired in src/cli.ts, access-token identity verification in src/reddit.ts, test coverage in tests/auth.test.ts and tests/cli.test.ts, README updated. Worked for 47m 15s

An interesting wrinkle surfaced during this run.

The second time I pasted a /goal command and walked away, it felt… ordinary. The novelty was gone. I didn’t hover over the terminal wondering if it would work. I just left.

That’s the point. When the second walk-away feels routine, the pattern has landed.

.

.

.

Does It Actually Work?

Same instinct as the WP post: close the terminal and test it like a real user.

First, the dry run:

node bin/reddit-summarizer.js collect --subreddit ClaudeCode --dry-run --json
Terminal showing the dry run command with clean JSON output: dryRun true, command collect, wouldCallReddit false, wouldWriteLog true, subreddit ClaudeCode, minScore 10, minComments 5, hours 4, commentsPerPost 10, output log

Clean JSON showing exactly what would happen without calling Reddit or writing files. dryRun: true, wouldCallReddit: false, wouldWriteLog: true. This is the --dry-run pattern in action — an AI agent invoking this CLI can preview any command before committing to side effects.

Then the real run:

node bin/reddit-summarizer.js collect --subreddit ClaudeCode --hours 24 \
  --min-score 10 --min-comments 5 --comments-per-post 3 --output both --json
Terminal showing the full collect command with flags for hours, score threshold, comment threshold, comments per post, and output mode

Real Reddit data came back. Posts from r/ClaudeCode with titles, authors, scores, comment counts, URLs, flairs, timestamps, and threaded comment data — machine-parseable JSON that any agent can consume directly.

Terminal showing real Reddit post data in JSON format — posts from r/ClaudeCode with titles like "Most important skill with agent coding learned so far", author names, scores, comment counts, URLs, flairs, and nested comment threads with body text and timestamps

The log file landed exactly where GOAL.md said it would — logs/ClaudeCode/2026-05-13.json — with the same structured data persisted to disk for downstream processing.

VS Code editor showing logs/ClaudeCode/2026-05-13.json with structured Reddit post data including post IDs, titles, authors, scores, comment counts, URLs, and nested comments, with the file tree showing the logs folder structure on the left

Two autonomous goal runs.

Zero manual coding.

The tool works against a live API.

The dry-run test proves the agent-safety patterns work.

The live run proves the Reddit integration works.

.

.

.

When to Split (And When Not To)

The skill detected the split automatically, but the heuristic is learnable.

Here’s when you should slice a project into multiple codex goal command runs:

Split when:

  • The project has more than ~5 acceptance criteria spanning unrelated concerns
  • There’s a natural “core first, then extensions” shape — MVP then auth, core then plugins
  • One slice needs credentials or setup that another doesn’t (OAuth login needs Reddit app registration; the MVP only needs an existing token)
  • The total scope would exhaust a single goal’s token budget

Keep as one goal when:

  • Everything shares the same test fixtures and setup
  • The feature is a single vertical slice — one user story, 3-5 acceptance criteria
  • Splitting would create artificial boundaries that increase integration risk

Here’s how goals-plan.md ties it all together:

  1. number your slices, give each a one-line description, and generate one goal at a time.
  2. Run them sequentially.
  3. Each /goal run inherits the codebase state from the previous one — the skill detects what already exists and generates goals that fit into the structure that’s already there.

.

.

.

The Bigger Picture

The pattern is repeating.

Last week showed the codex goal command with wp-spec-to-goal for a WordPress plugin. This post shows it with cli-spec-to-goal for a CLI tool. Skill generates spec, /goal executes spec, PROGRESS.md proves it. The domain changed, the workflow stayed the same.

Goal chaining is the multiplier.

One goal proved the concept. Two goals proved it scales. Each goal run inherits the full context of what was built before, because the codebase itself is the shared state. No context window to manage between goals — just files on disk.

And here’s the full circle.

The Reddit summarizer started as a 6-step Workflow Engineering build in March — spec brainstorm, review, test plan, implementation plan, execute, test — with active supervision at every step. The same project, rebuilt as a CLI, took two skill invocations and two /goal commands. The human work was describing what to build. The machine work was everything else.

The repo is public at github.com/nathanonn/cli-reddit-summarizer. Inspect the actual GOAL.md, VERIFY.md, and PROGRESS.md files for both goals. Grab the cli-spec-to-goal skill from .codex/skills/ if you build CLI tools.

Start thinking about your next project in terms of goal slices.

19 min read The Art of Vibe Coding

How to Use Codex /goal to Build WordPress Plugins (My Spec-to-Ship Workflow)

I typed /goal.

Walked away from the keyboard. Half-expected to come back to a mess.

Twenty-eight minutes later — no mess.

A working WordPress plugin was sitting there instead. Acceptance criteria mapped to evidence. Verification commands run. Browser screenshots captured by Playwright. A PROGRESS.md audit file in git, waiting for me to read it like a report card I didn’t have to study for.

Codex completion summary showing the goal marked complete after 28 minutes elapsed, with implementation details, test results, and Playwright browser evidence

That gap — between typing the command and seeing the result — was the whole point of the experiment.

A year ago at WordCamp Johor Bharu 2025, I was on stage demoing a five-tool workflow that took fifty minutes. Today I trigger one command and leave the room.

(Progress looks a lot like laziness if you squint.)

Here’s what changed.

OpenAI shipped the /goal command in Codex CLI 0.128.0 on April 30, 2026. The official description calls it “persisted goal workflows with app-server APIs, model tools, runtime continuation, and TUI controls.”

Translated for humans: you give Codex an objective, and Codex keeps working toward it until evidence says it’s done.

That’s a different shape of AI assistance from the usual prompt-and-watch loop. Worth pausing on — because the honest caveat lands fast. The codex goal command is not magic. Garbage spec in, garbage outcome out. The other half of the win was an Agent skill I built called wp-spec-to-goal, which turns a vague paragraph into the GOAL.md, VERIFY.md, and PROGRESS.md trio that Codex actually needs to finish.

(Last August I wrote about vibe coding a WordPress plugin in 50 minutes with Claude Code. That post was honest at the time. Fifty minutes felt fast. Looking at it now? Most of those minutes were me clicking “approve,” reading diffs, and playing air traffic controller for an AI that didn’t need one.)

This post is what happened when I removed myself from that loop entirely.

By the end of it you’ll know what /goal is, how to turn it on, how to write a goal that actually completes, and how I scaffolded the spec for a working WordPress plugin in under five minutes.

Stay with me.

.

.

.

What /goal Actually Is (And Why It Changes How You Build)

Here’s the thing about a normal Codex prompt: it says “do this task once.”

/goal says something different. It says “keep pursuing this objective until evidence says done.”

Subtle distinction. Enormous consequences.

When you start a goal, Codex attaches a persisted objective to your thread. The runtime quietly tracks what you asked for, the current status, how much time has passed, and how many tokens are gone.

Then a small loop kicks in — kind of like a dog that won’t stop fetching until you take the ball away.

Codex finishes a turn. The session goes idle. The runtime checks whether the goal still needs work. If yes, Codex gets a continuation prompt and picks the next action. The cycle repeats until completion criteria are met, the token budget runs dry, or you pause it yourself.

Four states matter:

  • active
  • paused
  • complete
  • budget_limited

The TUI summary shows them when you type /goal on its own.

Now here’s the small (but load-bearing) detail that makes everything else in this post possible. If you peek at Codex’s open source continuation prompt template, you’ll find the model is told to map every requirement to concrete evidence — files, command output, test results — and to treat uncertainty as not-done.

Read that last part again. Treat uncertainty as not-done.

That’s what makes 28 minutes of absence possible. Codex won’t mark a goal complete on vibes. The continuation prompt forces a real audit against real artifacts every single turn.

Compare that to a Ralph-style outer loop, where you script the iteration yourself. Or to a single long prompt that just keeps going until the context window gets tired.

(I’ve watched enough Ralph loops drift past the third iteration to recognize /goal as a different beast entirely.)

With /goal, the runtime tracks the objective, decides whether to continue, and refuses to declare victory without proof. You hand Codex an objective and a definition of done — then step out of the way.

👉 That mental model is the foundation for everything else in this post.

.

.

.

Three Commands and a Restart

Before any of the autonomy stuff works, you need to flip a couple of switches. Don’t worry — it’s quick. Like, “faster than making instant noodles” quick.

Update Codex first. The version that introduced /goal is 0.128.0, so anything older won’t even show the command.

npm install -g @openai/codex@0.128.0

Or if your install supports the built-in updater:

codex update

Confirm with codex --version. You want 0.128.0 or newer.

The /goal command is gated behind a feature flag, so you have to flip it on before it appears in the TUI. Run codex features list and look for goals.

Codex feature list output showing the goals row marked under-development with a value of false, highlighted with a red box

The label under development is honest, isn’t it? Functional, but flagged. I treat that as a reminder to actually read the PROGRESS.md output afterward instead of blindly trusting the run. You should too.

Enable it:

codex features enable goals
Codex terminal showing the command codex features enable goals returning the message Enabled feature goals in config.toml

Restart Codex inside your repo. The launch banner will warn you that under-development features are enabled, and /goal will appear in the slash-command menu the moment you type /.

Codex 0.128.0 launch screen with a warning that under-development features goals is enabled and the slash-goal command in the composer ready to autocomplete

Three commands and a restart. That’s it. The whole setup.

.

.

.

The Spec Is the Work — Meet wp-spec-to-goal

Here’s where most people trip.

The temptation with a shiny new feature like /goal is to write one sentence, press enter, and hope for the best. And honestly — for trivial tasks, that works fine.

For WordPress? It falls over fast.

There are too many quiet failure modes lurking in WordPress land — capability checks the agent forgets to add, environment gates between local and production, input sanitization before lookup, output escaping in admin HTML, hook timing, and the difference between wp-cli running on your host versus inside the wp-env Docker container.

(I’ve shipped each of those mistakes at least once. My shenanigans are your free education.)

An agent that doesn’t know about those things produces code that looks correct and behaves badly. Which — if we’re being honest — is worse than code that obviously breaks. At least broken code has the decency to announce itself.

So I built an Agent skill called wp-spec-to-goal to handle the spec layer. It lives at .codex/skills/wp-spec-to-goal/, and its only job is to take a vague paragraph and produce a Codex-ready bundle:

  • A scaffolded plugin folder (PSR-4 layout, composer.json, .wp-env.json, AGENTS.md, .gitignore, package.json) — but only the parts that don’t already exist.
  • A goals/<slug>/ directory with three files: GOAL.md, VERIFY.md, PROGRESS.md.
  • A tailored /goal command to copy and paste into Codex.

The skill follows six steps: probe the repo, judge complexity, ask clarifying questions in batches, scaffold what’s missing, write the goal trio, hand off the final command.

Here’s the starting state for this build — an empty repo with only the two relevant skills checked in.

VS Code file explorer showing a clean repo with only .codex/skills/playwright-cli and .codex/skills/wp-spec-to-goal folders, no plugin files yet, beside a Codex terminal session

I invoked the skill with one paragraph. No formatting. No structure. No acceptance criteria. Just the rough shape of what I wanted — like handing someone a napkin sketch and saying “make this real.”

The wp-spec-to-goal skill prompt typed into Codex, describing in plain English a WordPress plugin that lets an AI agent log in via a URL with a username or email parameter and switch users automatically

The skill probes the repo first. It runs ripgrep across .wp-env.json, composer.json, package.json, AGENTS.md, the goals folder, and the plugin source. It reads its own template references. It builds a picture of what already exists before asking me anything.

Codex output showing the wp-spec-to-goal skill exploring the repo with ripgrep file searches, finding only .codex and .agents folders, and noting the repo is essentially empty aside from skills and git metadata

Then comes the clarification round.

The skill judges this as a single-goal feature, flags the security boundary (a public autologin URL is intentionally dangerous outside local/dev) as the main uncertainty, proposes the slug wp-login-for-ai, and asks for confirmation before writing any files.

Codex showing the skill's analysis: a single-goal feature with security boundary as the main uncertainty, proposing the slug wp-login-for-ai and asking the user to reply yes to proceed

I replied “yes.” The skill loaded four template references, checked git status, and started writing files.

Codex output showing the wp-spec-to-goal skill confirming defaults, loading the four reference templates, running git status, and creating the wp-login-for-ai plugin and goals directories

A few minutes later, the scaffold was done.

The skill produced a summary listing the scaffolded files, the generated goal trio, the exact /goal command to paste, and a validation note. (No unresolved template placeholders, all JSON files parse, PHP lint deferred to wp-env since the host machine doesn’t have PHP installed.)

The wp-spec-to-goal skill completion summary listing scaffolded files including the plugin entry composer.json wp-env.json package.json gitignore and AGENTS.md, plus generated GOAL.md VERIFY.md PROGRESS.md, the slash-goal command to paste, and validation results

In VS Code, the new file tree showed up clean.

VS Code file explorer showing the newly scaffolded structure: goals/wp-login-for-ai with GOAL.md PROGRESS.md VERIFY.md, the wp-login-for-ai plugin folder with composer.json and the PHP entry file, plus root-level .gitignore .wp-env.json AGENTS.md and package.json

👉 The takeaway is simple: the autonomy /goal provides downstream is paid for upfront, in the spec. Five minutes here bought me 28 minutes there.

That’s not a bad trade.

.

.

.

What the Goal Trio Actually Contains

Three files, three jobs. No moonlighting.

  • GOAL.md describes what must be true when the work is done.
  • VERIFY.md describes how Codex proves it.
  • PROGRESS.md records what happened along the way.

Why three instead of one? Because /goal continues across many turns, and the continuation prompt re-reads these files every time. Mix the responsibilities and the audit gets confused — like giving one person three different job titles and hoping they remember which hat they’re wearing. Keep them separate and Codex always knows what it’s looking at.

Here’s a slice of the actual GOAL.md the skill generated for the autologin plugin:

### US-003 - Fail safely

As a site owner,
I want the shortcut constrained to local development and invalid requests
handled safely, so that the plugin cannot become a production backdoor.

Acceptance criteria:

- [ ] AC-003.1 - The shortcut only runs when `wp_get_environment_type()` is
      `local` or `development`.
- [ ] AC-003.2 - The shortcut only runs for local development hosts such as
      `localhost`, `127.0.0.1`, or `[::1]`.
- [ ] AC-003.3 - Requests for an unknown username or email fail without
      changing the current logged-in user.
- [ ] AC-003.4 - Blocked or invalid requests return a safe machine-readable
      error and do not emit PHP warnings or notices.

Notice that every acceptance criterion has an ID. Those same IDs show up later in the completion audit table that Codex fills in. That linkage is your insurance policy — it’s how you check that the run wasn’t just theatre.

GOAL.md also closes with a Definition of Done section:

## 13. Definition of Done

The goal is complete only when:

- [ ] Every acceptance criterion is implemented.
- [ ] Every required verification command in
      `goals/wp-login-for-ai/VERIFY.md` passes or has a documented external
      blocker.
- [ ] New or changed behavior has tests where practical.
- [ ] Existing behavior is not regressed.
- [ ] `README.md` is updated.
- [ ] `goals/wp-login-for-ai/PROGRESS.md` contains final evidence.
- [ ] /goal has performed a completion audit mapping each AC to evidence.

See that seventh bullet? The Definition of Done explicitly references the audit. Codex can’t declare victory without filling in that table. No shortcut. No “close enough.”

VERIFY.md is the verification contract — the commands Codex must run before completion, the smoke checks it must perform, and the evidence format for PROGRESS.md.

Here’s a key detail that matters more than it looks: every WordPress command routes through npx wp-env run cli rather than running native wp or composer on the host machine. Why? Because native commands target a different PHP/MySQL environment and produce results that look right but lie.

(Results that lie are — and I cannot stress this enough — the worst kind of results.)

So the skill emits this rule once, VERIFY.md enforces it again, and AGENTS.md repeats it a third time for /goal to bump into from any angle. Triple-redundant on the rules that matter.

PROGRESS.md starts as a skeleton — status: not started, empty completed list, empty commands table. By the end of the run, Codex fills it in. The most important section is the completion audit. Here’s a representative row from the final state:

| AC-001.1 | wp-login-for-ai/tests/run.php evidence-AC-001.1; npm run test:smoke;
playwright-cli screenshot B-001/admin-dashboard.png | Pass |

Every acceptance criterion gets a row. Every row points to a real file, a real command output, or a real screenshot saved on disk. The evidence is the artifacts themselves, sitting right there on your filesystem. Anyone can audit them after the fact.

That’s the contract /goal operates under. Three files. One linkage. No completion without evidence.

.

.

.

Pasting the Command and Stepping Back

The handoff itself is small. Almost anticlimactic.

Open Codex inside the project, paste the tailored command from the skill output, and press enter.

Codex new session showing model gpt-5.5 high and the slash-goal command pasted into the composer: Complete goals/wp-login-for-ai/GOAL.md. Use VERIFY.md as the verification contract. Update PROGRESS.md continuously. Treat uncertainty as incomplete.

The command itself is short:

/goal Complete goals/wp-login-for-ai/GOAL.md. Use goals/wp-login-for-ai/VERIFY.md
as the verification contract. Update goals/wp-login-for-ai/PROGRESS.md
continuously. Treat uncertainty as incomplete.

Wondering why such a tiny command does so much? Because the heavy lifting already lives in the files the skill wrote. /goal just needs the contract and a few rules of engagement.

That last sentence — Treat uncertainty as incomplete — mirrors the exact wording in Codex’s own continuation prompt. Speaking the same language as the runtime is a small thing, but it helps Codex stop the right way when something blocks it.

Codex’s first turn tells you the autonomy is kicking in. Watch what it does: explores the repo, reads all three goal files, inspects the existing scaffold, then lays out a concrete plan. Implement the handler. Run PHP checks. Run browser checks with playwright-cli. Write the final PROGRESS audit before marking the goal complete.

Codex first goal turn showing it exploring the repo, reading the goal trio files, inspecting the plugin entry and composer.json, and producing an updated five-item plan with implement test verify and audit steps

That plan is the autonomy warming up.

There’s a moment when you paste the command and your finger hovers over enter — two seconds of “did I trust the spec enough?” — and then you commit.

From here forward, I stopped paying attention.

.

.

.

The 28-Minute Black Box

There are no screenshots between this section and the next.

Nothing happened on screen worth showing you.

Codex worked. I went to make coffee. Watched some YouTube while it brewed. Answered a few emails I’d been pretending didn’t exist.

The Codex session kept running through tool calls, file edits, test runs, and self-audits in the background. The session was busy. I was elsewhere.

The whole value of the codex goal command sits in this gap.

If you sit at the screen pressing approve every two minutes, you’re using Codex like a normal prompt — and missing the point entirely. The autonomy only pays off if you actually walk away. (This is harder than it sounds. The first time feels like leaving a toddler alone with a box of markers.)

So what makes the absence feel safe?

The spec, mostly. The model executes; the spec sets the boundaries.

Scope rules in GOAL.md keep Codex from refactoring random files. Stop conditions cover ambiguous architectural decisions. VERIFY.md defines proof. The continuation prompt refuses to declare victory without it.

Each layer is a guardrail, and together they let you trust a 28-minute run more than a five-minute supervised one.

(Yes, really.)

The trade-off is real, and worth naming out loud. You give up real-time control. You get back time. The honesty test is whether the spec was tight enough to let you trust the result when you come back.

Wrote the spec yourself? Your trust is calibrated by your confidence in your own writing. Used a benchmarked skill? Your trust is calibrated by how well that skill has been tested.

In my case both gates were green. So I left.

.

.

.

Reading the Receipts

When I came back twenty-eight minutes later, my first instinct wasn’t to celebrate.

It was to scroll through PROGRESS.md half-expecting Codex to have quietly lied.

It hadn’t.

The Codex TUI showed a clean completion message.

Codex showing the goal complete message after 28 minutes, listing the implemented autologwp flow, files added including verifier coverage and mounted eval artifacts, all required commands passing including npm install npx wp-env start composer install/test/lint and npm test:smoke, plus playwright browser checks B-001 through B-004 against localhost:8888

Three categories of evidence shipped together — implementation, tests, and browser proof:

Implementation. A full autologwp handler with environment gate, host gate, user lookup via WordPress APIs, cookie clearing, new auth, and a safe redirect.

Tests. PHPUnit tests inside the plugin, npm/wp-env start, composer install/test/lint via wp-env, plus npm run test:smoke, npm run lint, and npm test — all passing.

Browser proof. Four playwright-cli runs labeled B-001 through B-004, with screenshots saved to goals/wp-login-for-ai/test-artifacts/ for admin login, editor switch, email login, and invalid input handling.

I asked Codex for a summary with ASCII diagrams. The answer came back as a clean specification traced through the request lifecycle.

Codex output explaining the autologwp WordPress dev login shortcut with an ASCII diagram showing request flow from /?autologwp=username-or-email through env gate local/development, host gate localhost or 127.0.0.1, get_user_by email/login lookup, clear old auth cookies, set current user with auth cookie and wp_login hook, then wp_safe_redirect to wp-admin

Read that flow slowly.

Environment gate before host gate. Host gate before user lookup. Lookup before cookie clear. Cookie clear before new auth. Hooks before redirect. Codex understood the architecture down to the order of those security gates — the kind of summary you’d write yourself after spending an hour with the source code.

Then — unprompted — it produced the verification matrix.

Codex showing the changed files list with the plugin folder composer.json tests run.php and eval-artifacts, the goals folder PROGRESS.md and B-001 through B-004 screenshots, plus a verified table where every required VERIFY.md command and every targeted check P-001 P-002 B-001 B-002 B-003 B-004 reports PASS, with the final audit recorded in PROGRESS.md

Every required command from VERIFY.md marked PASS. Every targeted check — P-001 environment gate, P-002 invalid user, B-001 admin login, B-002 email login, B-003 editor switch, B-004 invalid JSON plus preserved session — marked PASS. Final audit recorded in PROGRESS.md.

That table is what makes me willing to trust the run.

Want to verify it yourself?

The artifacts are right there on disk. The screenshots live in goals/wp-login-for-ai/test-artifacts/. PROGRESS.md is checked into git. Anyone can re-run the commands and confirm the markings.

No theatre.

.

.

.

Trust, But Verify (The Old-Fashioned Way)

Codex saying it works and the plugin actually working are two different claims.

And the “green tests, red production” surprises stay with you long enough to make manual smoke tests a reflex. So I closed the terminal and tested the plugin like a regular human would.

The dev environment was already running at localhost:8888. Front-end, no logged-in session.

Browser at localhost:8888 in incognito mode showing the wp-login-for-ai dev site with the default WordPress Hello World blog post, no logged-in user visible

I typed the autologin URL into the address bar.

Browser address bar showing localhost:8888/?autologwp=admin being typed into a Chrome incognito window with a wp-login-for-ai tab

Hit enter. The redirect happened. The session switched. The wp-admin dashboard loaded with the admin user identity in the corner.

WordPress wp-admin dashboard at localhost:8888/wp-admin showing Howdy admin in the top right corner, the wp-login-for-ai site name, dashboard menu with Posts Media Pages Comments Appearance, and the Welcome to WordPress version 6.9.4 panel

I tried the email variant (?autologwp=wordpress@example.com) and the editor switch. Both worked. None of the edge cases I poked at suggested Codex had declared completion incorrectly.

👉 The bigger point: /goal doesn’t replace your QA.

It offloads the part of the build you don’t enjoy — writing the implementation — so you can focus on the part you should be doing anyway. Which is verifying the result.

(Turns out the most valuable developer skill in the age of AI agents is… being a good tester. Who saw that coming?)

.

.

.

When /goal Earns Its Keep (And When It Doesn’t)

So when does the codex goal command actually earn its keep?

Bounded objectives with clear acceptance criteria. That’s the sweet spot. The autologin plugin is a good example: one feature, defined inputs, defined outputs, a small set of scope boundaries, and a verification contract that fits on one screen.

Here’s where you should reach for it:

  • Bug fixes with reproducible failures and regression tests.
  • Refactors with a “behavior is preserved” success condition.
  • Single feature slices from a larger project — one user story at a time.
  • API integration work where the contract is well-specified upfront.

And here’s where you should hold back:

  • Vague objectives like “make the app better” or “refactor everything.” Codex can’t audit completion for those — they don’t have a finish line.
  • Multi-feature builds that should really be split into separate goals.
  • Anything where you can’t define “done” before you start.

(A “refactor this whole module” goal will hit budget_limited and stop, having shipped nothing you’d want. There’s no audit to run when there’s no definition of done to run it against.)

Here’s the framing that’s stuck with me: /goal works best as an inner loop. The project manager role still belongs to you. For larger work, split it into multiple goals — 001-data-model, 002-admin-ui, 003-rest-api — and run them one after the other. One coherent slice per goal.

Two practical caveats before you fire your first goal.

First: Plan Mode and /goal don’t mix. The runtime suppresses goal continuation while Codex is in Plan Mode, so if you trigger /goal from inside a plan you’ll sit there wondering why nothing’s happening. Plan first, leave Plan Mode, then start the goal.

Second: /goal still depends on the spec. Skip the wp-spec-to-goal step (or whatever the equivalent is for your stack), write a one-line objective, and you’ll get a one-line-objective result. Garbage in, garbage out — same rule as always. Ferpetesake, write the spec.

.

.

.

The Bigger Picture

Here’s what /goal actually represents — a shift toward evidence-based autonomy.

Codex doesn’t need a human in the loop because it has files in the loop. Those files define done, prove done, and capture what happened along the way.

Compare back to the 50-minute supervised Claude Code build I wrote about last year. Fifty minutes was impressive at the time. Looking at it now, most of those minutes were judgment calls — clicking approve, reading diffs, deciding whether the next step looked sane. The codex goal command moves that judgment upfront into the spec, so the same decision doesn’t get made forty times during execution.

The skill investment pays back too.

wp-spec-to-goal was real work to design and benchmark.

But after two or three uses, the math stops being subtle: five minutes turning a paragraph into a goal trio, twenty-eight minutes of nothing. Once you’ve done it, you can’t really go back to the supervised loop for bounded tasks. That’d be like going back to dial-up after you’ve tasted fiber.

The part of building software that AI is starting to get genuinely good at is executing well-specified plans without supervision.

👉 Your job is to get good at writing the plan.

If you want to try it, here’s a starting point.

Pick a small bounded task this week — a bug fix with a failing test, or a single feature slice from a project you’re already working on. Don’t reach for the big rewrite. Write a tight GOAL.md (even by hand from the templates in this post), pair it with a VERIFY.md, paste a /goal command, and walk away.

The 28 minutes only feel real when you’ve spent them yourself.

The plugin from this post is a public repo. You can clone it, inspect the actual GOAL.md, VERIFY.md, and PROGRESS.md files, look at the Playwright screenshots checked into goals/wp-login-for-ai/test-artifacts/, and grab the wp-spec-to-goal skill from .codex/skills/ if you want to use it on your own builds.

The repo is at github.com/nathanonn/wp-login-for-ai.

Go give Codex something specific to do, then leave the room.


More workflows like this — AI-assisted development with Claude Code, Codex, and the tools between them — land in The Art of Vibe Coding newsletter every week. If this one was useful, the next one probably will be too.

13 min read The Art of Vibe Coding

Never Let Claude Code Auto-Compact Again

Auto-compact fires when the context is full — not when your task is at a clean boundary. Here’s how to stay in control with a status line, manual compact instructions, and a HANDOFF.md habit.

Here’s the moment that made me religious about manual compaction.

I was deep in a Claude Code session with one hard rule: no parallel sub-agents. One at a time. Always. I’d stated it clearly at the start of the session — burn one agent at a time, not five.

Auto-compact fired mid-session. And with it, that rule vanished. Gone from context.

I kept going. Claude kept going.

Then I glanced at the status line.

Ten sub-agents. Running simultaneously. My five-hour budget torched in about four minutes. The constraint wasn’t in CLAUDE.md, so the compact summary had nothing to reload it from. Just… gone.

That was my last auto-compact.

Claude Code auto compact functions like a seatbelt — there to catch you at the hard limit, but with no awareness of where you are in your task. It fires when the system decides the window is too full. No knowledge of whether you’re mid-hypothesis or mid-debugging loop. No idea whether the work has reached a clean handoff point.

The system protects itself.

You lose state.

Nate Herk raised a useful heuristic in his video How to Never Hit Your Claude Session Limit Again: the 1M context window is insurance. His argument is that resetting around 120k tokens — rather than filling the full window — keeps the model operating at full quality across a long session.

I’ve adopted a version of this as my working rule. The context window is a budget you actively manage. Stop treating it like a lap pool you’re trying to fill.

Never let the session reach the “dumb zone.”

That’s the upper range where compaction is imminent, signal-to-noise is poor, and the model is sorting through stale logs and abandoned attempts on every turn.

By then, you’ve already paid the tax.

.

.

.

What Actually Lives in the Context Window

Here’s what most people don’t realize about the context window.

Everything costs.

Before you type a single message, the session is already carrying: the root CLAUDE.md and any auto-memory blocks, MCP tool names and schema, skill descriptions, output style instructions, system prompts, and any path-scoped rules triggered on load.

That’s a meaningful slice of the context window before any work begins.

As the session runs, more piles in.

Every file read. Every command result. Every hook output. Every tool call and response. The full assistant turn history. You think “I only sent 20 messages” — but the session is carrying all of the above, in full, on every turn.

  • Stale exploration logs from an hour ago? In there.
  • Error output you resolved three steps back? In there.
  • Assistant messages full of planning that’s now completely moot? Also in there.

On every single turn, the model processes the entire window.

Every bit of it.

So yes — that 20-message conversation might be carrying the weight of forty.

/context gives you a live breakdown: how much is used, by which category, with optimization suggestions. Run it at least once per session to get a feel for where the weight is. (It’s the closest thing to a profiler the session gives you.)

The /context command output showing live token usage breakdown by category

.

.

.

Why Auto-Compact Is Lossy by Design

Here’s the thing: compaction isn’t broken.

The tradeoff is real and explicit.

Compaction takes a long running session and converts it into a structured summary, then continues from that summary. The official docs are clear about it: requests and key code snippets are preserved; detailed instructions from earlier in the conversation may be lost.

The problem with Claude Code auto compact is timing.

When the system fires it, the session has no clean boundary. The compaction summarizes whatever is in the window at the moment of overflow — including partial plans, mid-hypothesis reasoning, and error threads still in flight.

That’s where my rule vanished.

That’s where your constraints vanish, too.

Stay with me, because understanding what survives is what makes the mechanism workable.

Auto-Compact Is a Lossy Filter

After compaction, these reload reliably: the root CLAUDE.md and auto-memory blocks. They come back because they’re read from disk — the filesystem is their source, not the summary.

These do not survive automatically:

  • Path-scoped rules and nested CLAUDE.md files. They existed in the session because matching files were read. After compaction, they’re gone — until those files are read again. If your project has a src/api/CLAUDE.md with API-specific rules, those rules are out of context post-compaction until Claude re-reads that file.
  • Invoked skill bodies are a middle case. They may reload with token caps applied — the full skill text might come back, or a compressed version, depending on what the token budget allows.

(The Decode Claude team has a thorough breakdown of how compaction actually works under the hood — worth reading if you want the full mechanism.)

The practical caution: compact when the work has natural shape.

  • After a feature lands and tests pass.
  • After a root cause is identified but before the fix starts.
  • Before switching from implementation to review.

Never compact mid-plan without first writing down the state you need preserved.

The mechanism works when you give it a retention policy. When auto-compact fires without one, you get whatever the summarizer decided mattered. /compact [instructions] gives you that control.

Use it.

.

.

.

Install a Context Meter in Your Status Line

The goal: always-visible context usage in the terminal status line.

Custom Claude Code status line showing model name, effort level, context percentage, and rate limit usage

Without it, you find out the window is at 78% when you run /context — which means you checked too late.

Here’s the script. Save it as ~/.claude/statusline.sh:

#!/bin/bash

input=$(cat)

MODEL=$(echo "$input" | jq -r '.model.display_name // "Claude"')
EFFORT=$(echo "$input" | jq -r '.effort.level // "n/a"')
PCT=$(echo "$input" | jq -r '.context_window.used_percentage // 0' | cut -d. -f1)

FIVE_H=$(echo "$input" | jq -r '.rate_limits.five_hour.used_percentage // empty')
WEEK=$(echo "$input" | jq -r '.rate_limits.seven_day.used_percentage // empty')

LIMITS=""
[ -n "$FIVE_H" ] && LIMITS=" | 5h:$(printf '%.0f' "$FIVE_H")%"
[ -n "$WEEK" ] && LIMITS="$LIMITS | 7d:$(printf '%.0f' "$WEEK")%"

echo "[$MODEL] effort:$EFFORT | ctx:${PCT}%$LIMITS"

Make it executable:

chmod +x ~/.claude/statusline.sh

Add this block to ~/.claude/settings.json:

{
  "statusLine": {
    "type": "command",
    "command": "~/.claude/statusline.sh",
    "padding": 2
  }
}

Here’s why that matters.

ctx:47% is the core signal.

Once it’s in your status line, context management becomes part of your session loop — you glance at it the same way you glance at a battery indicator. You stop waiting for it to reach critical before acting.

5h:71% prevents starting a heavy refactor when rate limits are already burning.

If you’re at 71% of your five-hour budget, a full session of parallel tool calls might hit the ceiling before the task finishes.

Better to know before you start.

.

.

.

The Operating Rule — Compact at Boundaries, Not at Panic

These are the zones I use as working heuristics.

Hard-won across real sessions — not universal thresholds from published research. If they feel arbitrary, calibrate them against your own work.

But start here:

  • Green (0–30%): Keep working. Avoid dumping unrelated files or running broad research in the main session — keep the window clean while you have room. The session is young. Let it breathe.
  • Yellow (30–50%): Start watching for task boundaries. Note where the natural stopping points are in your current work. You have runway. Use it intentionally.
  • Orange (50–60%): Finish the current micro-task, then compact or hand off. Do not start a new major branch. The window is narrowing faster than it feels.
  • Red (above 60%): The threshold I use before a major context reset. Do not start a new feature, refactor, or research thread without resetting context first. This is the zone where sessions start producing work you’ll have to redo.

1M Opus exception: If token budget matters, treat 15–20% as the practical reset band. Twenty percent of 1M tokens is already a large session with substantial context weight. The math changes when the window is enormous.

One important pushback worth stating clearly: do not interrupt a working implementation mid-flight just because the number crossed a threshold. If you’re in the middle of a function, a migration, or a debugging loop that’s producing real signal — finish the micro-task first.

Compact at a boundary.

Never mid-sentence.

The threshold is a trigger to watch for the next natural stopping point. The task shapes where compaction makes sense:

Choose the Reset at a Clean Boundary

.

.

.

Write /compact Like a Handoff Prompt

The biggest leverage point in this workflow is how you write the compact instruction.

Bad:

/compact summarize everything so far

That hands the retention decision back to the model. You get whatever the summarizer determined was important.

Ferpetesake.

Better:

/compact Preserve only what a fresh coding agent needs to continue safely: current goal, files changed, decisions, errors, tests, pending tasks, and exact next step. Drop stale exploration and repeated logs.

Even that is improvable.

The reusable KEEP/SUMMARIZE/DROP template:

/compact
KEEP:
- Current goal and acceptance criteria
- Exact files changed and why
- Important code decisions and rejected alternatives
- Open bugs, failing tests, console errors, and commands already tried
- The last 5 user/assistant turns in detail

SUMMARIZE:
- Earlier exploration
- Completed debugging paths
- General discussion

DROP:
- Repeated test output
- Long logs that no longer matter
- Dead-end ideas already ruled out

This format works because it makes compaction a retention policy.

You’re telling the model exactly what to keep verbatim, what to compress, and what to discard. The output is shaped by your instructions — not the summarizer’s defaults.

Three task-specific variants ready to copy:

Feature implementation:

/compact Preserve feature implementation state: goal, acceptance criteria, files changed, functions/components touched, business rules, test results, unresolved bugs, and exact next step. Summarize old exploration and drop repeated logs.

Debugging:

/compact Preserve debugging state: original bug, reproduction steps, exact error messages, hypotheses tested, files inspected, fixes attempted, current most likely root cause, and next verification command. Drop dead-end logs unless they explain a rejected approach.

Refactor:

/compact Preserve refactor state: target architecture, files already refactored, files not yet touched, compatibility constraints, naming conventions, migration risks, test status, and next file to inspect.

.

.

.

Add Compact Instructions to CLAUDE.md

One-off compact prompts work.

But if you’re running long sessions regularly, encoding the retention policy in CLAUDE.md means you stop writing it from scratch every time. The official docs support this directly: add a “Compact Instructions” section to CLAUDE.md, and Claude Code uses it during compaction.

Here’s the copy-paste version:

## Compact Instructions

When compacting, preserve working state for continuation, not chat history.

Always keep:

- Current goal and acceptance criteria
- Exact files changed, created, deleted, or inspected and why
- Important hooks, functions, classes, routes, settings, commands, and config keys
- Business rules and architectural decisions
- Rejected approaches and why they were rejected
- Errors, failed tests, commands run, and fixes attempted
- Pending tasks and the exact next step

Summarize:

- Completed exploration
- Older discussion
- Repeated command output

Drop:

- Verbose logs unless they contain unresolved errors
- Duplicate explanations
- Abandoned ideas that are no longer relevant

After compaction, re-read PLAN.md or HANDOFF.md if present before continuing.

One caution: if CLAUDE.md grows too large, it becomes its own context tax — loaded on every session, occupying window space before any work begins. The Compact Instructions block is worth including if it replaces ad-hoc reminders you’d otherwise type in long sessions.

Keep the file disciplined.

If you haven’t read The Single File That Makes or Breaks Your Claude Code Workflow, that’s the foundation. The Compact Instructions block sits inside it.

.

.

.

Write a Handoff File Before Compaction or Clear

The compact instruction controls the summary.

The handoff file makes that summary durable — written to disk so it survives the reset.

A Hacker News commenter put it well, and I’m passing this on the way a mentor would: get better results by asking Claude to write the important parts into a Markdown file, reviewing it, clearing context, and continuing from that file. The session state becomes explicit and inspectable. You can read it. You can verify it. You can hand it to a fresh session and have it pick up exactly where you left off.

That’s worth framing as a habit.

The moment before compaction is the moment to make your state visible.

The prompt to generate HANDOFF.md:

Create HANDOFF.md. Include:
- Goal
- Current branch/state
- Files changed
- Decisions made
- Commands run
- What failed
- What remains
- Exact next step

Make it complete enough that a fresh Claude Code session can continue without reading this chat.

Then compact with a focused instruction:

/compact Focus on HANDOFF.md, current git diff, unresolved errors, and next step. Drop old exploration.

Or for a truly fresh start:

/clear
Read HANDOFF.md, then continue from the exact next step.

The distinction between the three reset mechanisms:

  • /compact — same session continues. Old context is summarized, not erased. The compressed history is still present.
  • /clear + HANDOFF.md — clean slate. The continuation document is the only thread back to prior work.
  • /rewind — last branch was wrong. Use this to remove polluted attempts rather than summarizing them into the session memory.

HANDOFF.md works with all three.

Write it before the reset. Verify it’s complete. Then pick the right reset for the situation.

.

.

.

Rehydrate After Compaction

Compaction has a subtle failure mode: after the summary is generated, Claude may know a plan exists — but not have the plan’s actual content in context.

Let that land for a second.

The root CLAUDE.md and auto-memory reload — they’re read from disk. PLAN.md, HANDOFF.md, and path-scoped rules loaded from specific subdirectories are not automatically restored. The compacted summary might reference them. The content is not there.

The fix is explicit.

Tell Claude to re-read them.

After-compaction checklist:

After any compaction:
1. Re-read PLAN.md if it exists.
2. Re-read HANDOFF.md if it exists.
3. Re-check git diff --stat.
4. Confirm the current goal, unresolved risks, and exact next step.
5. Continue from the latest pending task, not from memory alone.

This checklist can live in CLAUDE.md under “Compact Instructions” — the last line of the block from the previous section already includes it. But it’s worth stating explicitly so the habit is clear: after compaction, re-read the durable files before continuing.

Do not assume the summary captured everything they contained.

The official docs confirm it: root CLAUDE.md reloads after compaction; path-scoped and nested instructions may need trigger files re-read. An explicit rehydrate checklist is more reliable than hoping the summary preserved those details.

.

.

.

The Decision Table — Continue, Compact, Clear, Rewind, or Subagent

Here’s where it all comes together.

When the context meter signals it’s time to act, this is the call:

SituationUseWhy
Same task, clean milestone reached/compact [instructions]Preserve state, drop noise
New unrelated task/clearAvoid dragging old context into new work
Last branch was wrong/rewindRemove polluted attempts instead of summarizing them
Heavy research or file explorationSubagentKeep raw reading out of main context
Long implementation needs continuityHANDOFF.md + /compactMake continuation state durable
Session feels confused far below the limitHANDOFF.md + /clearReset quality before wasting more turns

The new habit is five steps:

  • Watch ctx%.
  • Finish the current micro-task.
  • Write down state if needed.
  • Compact with a retention policy.
  • Re-read the durable plan.
  • Continue.

Never letting Claude Code auto compact means staying in charge of what survives.

The best sessions — the ones that actually ship — are the ones where context stays intentional. Build the status line. Write the compact instruction. Keep HANDOFF.md in the habit. Re-read the durable files every time.

That’s the playbook.

Own it.

If you want the full framework this sits inside, start with Context Engineering with Claude Code, Explained.

11 min read The Art of Vibe Coding

How to Run Firecrawl for Free in the Cloud (No API Key Needed)

Run the full Firecrawl stack on a free GitHub Codespaces 16 GB cloud machine — no API keys, 5-minute setup, and wired into Claude Code via a single tunnel command.

Watch the video walkthrough, or read the full written guide below.

The Hardware Problem Nobody Warns You About

I have an M1 Pro MacBook Pro. Base model. 16 GB of RAM.

I figured that was plenty.

So I read a tutorial about Firecrawl — the open-source tool that turns messy web pages into clean, LLM-ready markdown — and ran docker compose up without a second thought.

Then I opened Activity Monitor.

14+ GB of RAM. 3 GB spilling into swap.

Memory pressure glowing yellow — the macOS equivalent of a check-engine light.

What I’d forgotten (ferpetesake) was that VS Code, Chrome, and a dev server were already running. Firecrawl’s Docker stack — five services simultaneously: the API server, a Playwright browser cluster, Redis, RabbitMQ, and PostgreSQL — landed on top of my normal development tools like a brick on a soufflé.

Here’s what most tutorials skip: they say “just install Docker” and assume you have unlimited RAM under your desk. Firecrawl can allocate up to 12 GB of RAM on its own. If your machine is already breathing hard from your regular workflow, adding Firecrawl is the thing that tips it over.

But — and stay with me here — GitHub Codespaces gives you a 16 GB RAM, 4-core cloud machine for free. The free tier includes 30 hours of runtime per month. For development, tutorials, and on-demand scraping sessions, that’s more than enough.

I’ve packaged the entire setup into a template repo you can fork: firecrawl-codespaces. Five minutes from zero to a working Firecrawl instance, connected to Claude Code on your local machine.

Let me show you exactly how.

.

.

.

What You’re Actually Getting (Honest Assessment First)

Before we touch a single command, let’s set expectations.

I’d rather you know the trade-offs now than feel surprised after investing 5 minutes.

GitHub Codespaces (Free)Local Machine
RAM16 GB (4-core machine)Whatever you have
CPU4 coresWhatever you have
Cost30 hrs/month freeFree (hardware cost)
Setup time~5 minutes~10 minutes
Always-on?No — auto-stops after inactivityYes
API keys needed?NoneNone

Two honest limitations:

1. Not always-on. Codespaces auto-stops after 30 minutes of inactivity. There are workarounds (covered later), but if you need Firecrawl running 24/7 without any interaction — Codespaces is the wrong fit.

2. No anti-bot bypass. The self-hosted version of Firecrawl doesn’t include Fire-engine — the component that handles IP rotation and bot detection circumvention. For scraping documentation sites, GitHub repos, and public content (the 95% use case for Claude Code), you don’t need it. For scraping LinkedIn or heavily Cloudflare-protected sites, you do.

The verdict: Codespaces is perfect for development, learning, and on-demand scraping sessions. You spin it up when you need it, stop it when you don’t.

.

.

.

Prerequisites

Short list. Zero friction.

  • A GitHub account (free tier works; Pro gives 50% more hours)
  • The GitHub CLI (gh) installed on your local machine — install guide

That’s it.

No Docker Desktop. No Homebrew. No Node.

The entire Firecrawl stack runs inside the Codespace — the only thing your local machine needs is the gh CLI for the tunnel command.

.

.

.

The Setup — Step by Step

Step 1: Create the Codespace

Go to the firecrawl-codespaces repo on GitHub. Fork it (or use it directly).

Click the green <> Code button → select the Codespaces tab → click Create codespace on main.

Machine type matters. Select the 4-core (16 GB RAM) option. The 2-core machine only has 8 GB — Firecrawl will OOM (out of memory) on it.

GitHub "Create a new codespace" page showing the firecrawl-codespaces repo selected, main branch, "Firecrawl on Codespaces" dev container configuration, and 4-core machine type

The core-hour gotcha: Free hours are measured in core-hours, not wall-clock hours. A 4-core machine uses free hours 4x faster than a 2-core. The 120 core-hours/month free tier gives you 30 actual hours on a 4-core machine.

Step 2: Wait for the Automated Setup

The moment the Codespace provisions, it runs setup.sh automatically.

This is configured in the repo’s devcontainer.json via postStartCommand — meaning it runs every time the Codespace starts or resumes, not just on initial creation.

VS Code terminal inside the Codespace showing "Finishing up... Running postStartCommand... > bash setup.sh"

Here’s what setup.sh does behind the scenes:

  1. Clones Firecrawl from the official repo
  2. Creates a minimal .env — port 3663, no authentication, no API keys
  3. Copies a docker-compose.override.yaml that uses pre-built Docker images instead of compiling from source (cuts first-run startup from 5-15 minutes down to ~90 seconds)
  4. Starts the Docker stack with docker compose up -d
  5. Waits for the health check to confirm Firecrawl is responding

First run takes ~2-5 minutes (image pull). After that, resuming a stopped Codespace takes ~30 seconds.

Once the setup completes, the Ports tab shows Firecrawl API on port 3663 with a green indicator:

VS Code Ports tab showing "Firecrawl API (3663)" with a green status dot and forwarded address

Expected warning: You’ll see WARN — You're bypassing authentication in the Docker logs. Completely normal. USE_DB_AUTHENTICATION=false is the correct setting for self-hosted Firecrawl. Safe to ignore.

Step 3: Verify the Stack

Run docker ps inside the Codespace terminal. You should see all five containers running:

Terminal showing docker ps output with five containers running: firecrawl-api-1 (port 3663), rabbitmq, redis, postgres, and playwright-service — all showing "Up 4 minutes"

Five containers. All healthy. Firecrawl is running inside your Codespace.

Now you need to get it to your local machine.

Step 4: Connect From Your Local Machine

I’ll spare you the detour I took.

I spent an embarrassing amount of time messing with public port URLs and GitHub token authentication before discovering that gh codespace ports forward does everything in one command. Learn from my shenanigans.

Switch to your local machine’s terminal (not the Codespace). Run:

gh codespace list
MacBook terminal showing gh codespace list with one Codespace available on the main branch

Copy the Codespace name from the output, then forward port 3663:

gh codespace ports forward 3663:3663 -c <your-codespace-name>
MacBook terminal showing gh codespace ports forward command with output "Forwarding ports: remote 3663 <=> local 3663"

One command.

Firecrawl is now at http://localhost:3663 on your machine — exactly as if it were running locally. No public exposure. No authentication tokens. And here’s the bonus: the tunnel keeps the Codespace alive as long as it’s running. More on that later.

Verify by opening http://localhost:3663 in your browser:

Browser showing localhost:3663 with JSON response: {"message":"Firecrawl API","documentation_url":"https://docs.firecrawl.dev"}

Firecrawl API. Running. Accessible. Free.

Other connection methods: The tunnel is the recommended approach. Two alternatives exist — a public port URL and a private port with GitHub token auth — but both reset on every Codespace restart. The tunnel is simplest and has the bonus keep-alive benefit. See the repo README for details on the alternatives.

.

.

.

Wire It Into Claude Code

Firecrawl is running.

Now let’s make Claude Code actually use it. This is the firecrawl Claude Code setup that turns your coding assistant into a web-aware research agent — and it’s three steps.

Install the Firecrawl CLI

On your local machine:

npm install -g firecrawl-cli

Install Firecrawl Skills

firecrawl setup skills --agent claude-code

This clones 8 markdown skill files from the official Firecrawl CLI repo and installs them into Claude Code’s skills directory. Each skill teaches Claude Code how to use a different Firecrawl capability: search, scrape, crawl, map, interact, download, and agent-powered extraction.

Firecrawl CLI skills installer showing ASCII art "SKILLS" header, repository cloned from github.com/firecrawl/cli.git, "Found 8 skills" with a selection list including firecrawl, firecrawl-agent, firecrawl-crawl, firecrawl-download, firecrawl-interact, firecrawl-map, firecrawl-scrape, and firecrawl-search

After installation, type /firecrawl in Claude Code. You should see all available Firecrawl slash commands:

Claude Code prompt showing /firecrawl typed with autocomplete dropdown listing all Firecrawl slash commands and their descriptions

Add Firecrawl Instructions to CLAUDE.md

This is the critical step.

Without this, Claude Code won’t know to prefer Firecrawl over its built-in (and more limited) web tools.

Add this block to your project’s CLAUDE.md:

## Firecrawl

- **Always use Firecrawl skills** (firecrawl, firecrawl-scrape, firecrawl-search, etc.) for web searches and scraping. Avoid the built-in WebFetch/WebSearch tools.
- We are using the localhost version of Firecrawl. Use `firecrawl` command to interact with the service.
- **Always prefix `firecrawl` CLI commands with `FIRECRAWL_API_URL=http://localhost:3663`** so the CLI targets the localhost service instead of prompting for cloud authentication. Example: `FIRECRAWL_API_URL=http://localhost:3663 firecrawl scrape "<url>" -o
.firecrawl/page.md`.
- **NEVER run `firecrawl --status`** — it checks cloud API auth and always shows "Not authenticated" for localhost. Instead, check if Firecrawl is running with: `curl -s http://localhost:3663 > /dev/null 2>&1` (requires `dangerouslyDisableSandbox: true`).
- All Firecrawl-related commands (including server health checks) must run with `dangerouslyDisableSandbox: true`.
- **Sub-agents**: When spawning agents that may need web access, include these Firecrawl rules in the agent prompt so they use Firecrawl instead of built-in web tools.

Why each line matters:

  • The FIRECRAWL_API_URL prefix is essential. Without it, the Firecrawl CLI defaults to cloud authentication and prompts for an API key you don’t have. The environment variable tells it “talk to localhost instead.”
  • The --status trap — and I say this from personal experience — will burn you. I ran firecrawl --status and it said “Not authenticated.” I spent 20 minutes trying to generate an API key I didn’t need. My self-hosted instance was running perfectly the entire time. The command only checks cloud auth. It has no localhost awareness. Use the curl health check instead.
  • The dangerouslyDisableSandbox note is necessary because Claude Code’s sandbox blocks localhost network calls by default. Firecrawl commands need to reach port 3663.
  • The sub-agent rule prevents a common gotcha: you spawn a research sub-agent, and it uses built-in WebFetch instead of Firecrawl because it didn’t inherit the instructions.

.

.

.

See It In Action

Theory is nice. Let’s see it work.

I asked Claude Code to scrape a FluentCart REST API documentation page — the kind of task you’d do when building an integration and need to understand an endpoint’s parameters before writing any code.

Claude invoked the /firecrawl-scrape skill.

It first checked whether Firecrawl was running at localhost:3663, confirmed the health check passed, then ran the scrape with the FIRECRAWL_API_URL prefix:

Claude Code terminal showing the /firecrawl-scrape skill in action — Claude checks if Firecrawl is running at localhost:3663, then scrapes the FluentCart REST API docs page with the FIRECRAWL_API_URL prefix

The result?

Clean, structured markdown. Endpoint names, URL patterns, parameter tables with types and descriptions, and complete curl examples — all formatted and ready for Claude to work with:

Claude Code displaying scraped FluentCart API documentation showing "Bulk Insert Products" endpoint details, a parameter table with columns for Parameter, Type, Required, and Description, plus a formatted curl example with JSON body

Claude then saved the scraped content as a .md file in a .firecrawl/ folder for future reference:

VS Code showing the scraped content saved as fluentcart-products.md with clean markdown formatting — Products API documentation with headings, links, base URL, and structured endpoint listings

Compare that to what a raw HTTP fetch returns: the same page’s HTML would be 10x larger, stuffed with navigation menus, footers, tracking scripts, and CSS class names. Firecrawl strips all of that away and returns only the content that matters — clean markdown that fits neatly into Claude’s context window instead of bloating it.

.

.

.

The One Gotcha That Will Catch You: Idle Timeout

I learned this one the hard way.

I set up Firecrawl in a Codespace, walked away to make coffee, came back 40 minutes later — and everything was gone. The Codespace had stopped itself.

Here’s what happens:

  1. You start Firecrawl with docker compose up -d (detached mode)
  2. You close the Codespace browser tab
  3. Thirty minutes later, the Codespace auto-stops
  4. Firecrawl is gone

Why?

Codespaces measures inactivity as “lack of terminal input or output.” A detached Docker daemon running in the background produces no terminal output. From Codespaces’ perspective, nobody’s home.

Three fixes:

  • Fix 1: The tunnel keeps it alive (you’re already doing this). The gh codespace ports forward command counts as active interaction. As long as that tunnel is running on your local machine, the Codespace stays alive.
  • Fix 2: Stream logs. Inside the Codespace, run docker compose logs -f in the Firecrawl directory. Each log line resets the idle timer.
  • Fix 3: Extend the timeout. In GitHub Settings → Codespaces → Default idle timeout, set it to 240 minutes (the maximum).

And if it does stop? The postStartCommand in devcontainer.json auto-starts Firecrawl on every resume. Just re-run the tunnel command on your local machine and you’re back.

.

.

.

Free Tier Math and Alternatives

GitHub Free accounts get 120 core-hours/month.

MachineRAMFree wall-clock hours
2-core8 GB60 hrs (not enough CPU & RAM for Firecrawl)
4-core16 GB30 hrs
8-core32 GB15 hrs

What 30 hours gets you: roughly 4 full work days of active Firecrawl sessions. Enough for a serious project sprint, a full tutorial walkthrough, or hundreds of documentation page scrapes.

The storage caveat: Storage is billed even while the Codespace is stopped — $0.07/GB/month. With the Firecrawl repo and Docker layers, expect ~5-10 GB total. That’s ~$0.35-$0.70/month.

Pro tip: GitHub Pro ($4/month) bumps you to 180 core-hours — 45 hours on a 4-core machine. And add a $5 spending cap in GitHub Settings → Billing → Budgets to prevent surprise charges if you forget to stop the Codespace.

Where Codespaces Fits Among Your Options

OptionSetupCostAlways-onAnti-bot
Local machine~10 minFreeYesNo
Codespaces (this repo)~5 minFree (30 hrs/mo)NoNo
Railway~2 min$5+/moYesNo
Firecrawl Cloud0 min$16+/moYesYes

Local is best if your machine can handle it — no time limits, always-on. Codespaces is best for tutorials, learning, and on-demand sessions (what you just set up). Railway has an official Firecrawl deploy template — one click and $5/month for always-on hosting. Firecrawl Cloud is pay-per-use with the anti-bot bypass engine included.

.

.

.

The Bigger Picture

That yellow memory pressure warning on my MacBook Pro? Gone from the equation entirely.

The complete firecrawl Claude Code setup now runs on a 16 GB cloud machine that costs nothing — while my laptop handles what laptops should handle: VS Code, Chrome, and the dev server. No RAM fights. No swap memory. No jet engine fans.

But the real takeaway goes beyond Firecrawl.

The pattern is this: instead of installing powerful tools on every machine you own, run them once in a cloud environment and tunnel to them from wherever you are. GitHub built the tunneling right into the CLI. The free tier covers 30 hours a month. And the template repo makes setup a 5-minute operation.

Your AI coding assistant now has live web access. Documentation pages, API references, technical articles — anything Claude Code needs to read before writing code, Firecrawl can fetch.

Ready to set it up?

  1. Fork the repo: firecrawl-codespaces
  2. Create a 4-core Codespace
  3. Wait for the automated setup (~3 minutes)
  4. Run gh codespace ports forward 3663:3663 on your local machine
  5. Install the CLI and skills: npm install -g firecrawl-cli && firecrawl setup skills --agent claude-code
  6. Add the Firecrawl block to your CLAUDE.md

Six steps. Five minutes. Zero API keys.

Go build something with it.

12 min read The Art of Vibe Coding

The “Real” Context Engineering with Claude Code, Explained

I’ve written 40+ posts about Claude Code.

Sub-agents. CLAUDE.md files. Skills. Workflow engineering. Testing loops. Spec-driven development. Memory. Self-evolving rules.

I was outlining a post last week — when I stopped mid-sentence and stared at my screen. I had the outline open on one side, my published posts list on the other. And for the first time, I saw it.

Every single post was about the same thing.

Not “AI coding tips.” Not “Claude Code tricks.” Something deeper — a discipline I’d been teaching without realizing I was teaching it. I’d been circling the same idea for almost a year, approaching it from forty-four different angles, and I just didn’t have a name for it.

(That’s the annoying thing about patterns. They’re invisible until suddenly they’re not.)

.

.

.

The Name Drop

The name is context engineering.

Tobi Lütke (Shopify CEO) tweeted in June 2025 that he preferred “context engineering” over “prompt engineering.” Karpathy co-signed it. Anthropic published an official guide. The term stuck.

But here’s what nobody’s saying: if you’ve been following this newsletter, you’ve been a context engineer. You just didn’t know it yet.

Let me show you what I mean.

.

.

.

What Happens Without Context Engineering

Let me tell you about an afternoon that changed how I think about context windows.

I was adding a chat interface to a Next.js app using Vercel’s AI Elements library. Simple task — wire up useChat with <Conversation> and <Prompt>. Maybe thirty minutes of work.

So I did what felt responsible: I dumped the entire AI Elements documentation into Claude’s context. Every hook, every provider, every component. Thorough. Comprehensive. Professional.

And then Claude started… hedging.

Vague suggestions instead of concrete code. Recommendations that contradicted themselves across responses. Instructions I’d given three messages ago — forgotten entirely. I watched Claude’s quality degrade in real time, like a student cramming so hard for an exam they forgot how to spell their own name.

That’s context rot — when irrelevant information degrades the AI’s ability to focus on what matters.

I closed the session. Started fresh. This time I gave Claude only the docs for the two components I actually needed. It nailed the implementation on the first try.

Less context. Better results.

(I know. Counterintuitive.)

And here’s the part that really bakes your noodle: bigger context windows don’t make AI smarter. Past about 50% fill, performance actually degrades.

A senior engineer working on an 80k-line codebase posted on Reddit calling the 1M context window “a noob trap”.

They aggressively keep under 250k. And before you even type a word, 45,000 tokens are already loaded (system prompt, tool schemas, agent descriptions, memory files, MCP schemas). On the standard 200k window, that’s 20% gone at session start.

Context engineering is how you fight this.

.

.

.

What Context Engineering Actually Is

Here’s the definition I’ve landed on after (almost) a year of teaching these techniques:

Context engineering is the discipline of designing what information reaches your AI — the right knowledge, the right constraints, the right tools, at the right time — so it can actually do what you need.

The key distinction:

Your CLAUDE.md file isn’t a prompt. Your sub-agents aren’t just parallelism. Your skills aren’t just shortcuts. They’re all components of a context system that assembles the right information before the model ever sees a token.

Prompt engineering is choosing the right words. Context engineering is building the right world around the AI so it barely needs prompting at all.

.

.

.

The Context Engineering Stack

Here’s the framework I wish I had when I started.

Every context engineering Claude Code technique I’ve taught maps to one of six layers — each solving a different problem, each building on the one below it.

Let me walk you through each layer — bottom up.

.

.

.

Layer 1: Static Context (CLAUDE.md)

The “hello world” of context engineering.

A CLAUDE.md file loads automatically into every Claude Code session. It pre-loads project knowledge — your stack, conventions, patterns, gotchas — so every conversation starts with the essentials instead of from zero.

Without it, every session is amnesia.

Claude doesn’t know your project uses Tailwind, your team prefers functional components, or that your API has a weird auth flow. You spend the first five minutes of every conversation re-explaining things you explained yesterday.

(Sound familiar? Yeah.)

But — and stay with me here — there’s a paradox.

CLAUDE.md is incredible because it’s always loaded into context. And terrible for the exact same reason. Always-on context isn’t dynamic. Once your CLAUDE.md passes a few hundred lines, Claude starts ignoring nuances. The very file that’s supposed to help starts contributing to context rot.

The fix: keep CLAUDE.md lean — around 100 lines of essential universals. Load additional context dynamically with skills or custom commands. Prime, don’t hoard.

Deep dives: CLAUDE.md Guide → The Single File

.

.

.

Layer 2: Behavioral Context (Rules & Constraints)

Here’s a scenario that’ll make you wince.

Claude can’t get an API working, so it silently inserts a try/catch that returns sample data. Everything looks correct. All your tests pass. The UI renders beautifully. You demo it to your client on Thursday.

Three days later, you discover nothing was ever real.

(I’ll let that sink in for a moment.)

That’s what happens without behavioral context — instructions that shape HOW the AI behaves, not just what it knows. Knowledge without constraints is a liability.

The fix is a rule in your CLAUDE.md: “Never silently replace real functionality with mocked data. If something fails, fail loud.”

One sentence.

Prevents an entire category of mistakes.

Context engineering goes beyond feeding information in. It constrains behavior through instructions. Think of CLAUDE.md as a behavior contract:

  • “Always write tests before implementation” (TDD constraint)
  • “Never modify files outside /src without asking” (scope constraint)
  • “Use TypeScript strict mode” (quality constraint)

Every rule you add is a piece of behavioral context. And unlike knowledge — which can get stale — good behavioral rules compound. They prevent the same mistake from happening across every future session.

Deep dives: Project Rules → Self-Evolving Rules

.

.

.

Layer 3: Context Persistence (Memory & Evolution)

Every Claude Code session starts with amnesia. The AI doesn’t remember what it learned yesterday — that brilliant debugging approach it discovered at 2 AM, the edge case it finally cracked after four attempts, the architectural decision you both agreed on.

Gone. Every time.

Your CLAUDE.md handles project-level knowledge, but what about session-to-session learnings? That’s what this layer solves:

  • Memory skills that log discoveries, decisions, and patterns
  • Self-evolving rules that update themselves based on what the AI encounters
  • Compaction that snapshots state when a context window fills up

The progression looks like this:

When Claude Code’s context window fills up, it automatically summarizes the conversation — preserving architectural decisions and unresolved bugs while discarding redundant output. That’s automated context engineering built into the tool itself.

But the real power — the thing that still kind of amazes me — is when your rules evolve on their own. A memory skill logs what the AI discovers. Self-evolving rules incorporate those learnings. The next session starts smarter than the last.

Your context system learns while you sleep.

Deep dives: Memory Skill → Self-Evolving Rules

.

.

.

Layer 4: Context Modules (Skills)

If CLAUDE.md is your operating system’s default settings, skills are apps you install for specific tasks.

A skill is a packaged, reusable context bundle.

When you invoke one, you inject a curated set of instructions, examples, and constraints into the model’s context.

When you’re done, you unload it. Clean.

This matters because the alternative is cramming everything into CLAUDE.md — bloating your static context with domain knowledge that’s only relevant 10% of the time. Skills let you modularize. Load the right context for the right task. Unload it when done.

(Think of it like this: you wouldn’t keep every cookbook you own open on your kitchen counter while making scrambled eggs. You’d grab the one recipe you need.)

Even the creator of Claude Code, Boris Cherny, warns: “Too many skills and agents inflate context massively — be selective per project.”

Skills enable both sides of the equation: they reduce what goes into your default context, and they inject domain expertise exactly when you need it.

Context engineering in miniature.

Deep dives: Skills Part 1 → Part 2 → Part 3

.

.

.

Layer 5: Context Delegation (Sub-Agents)

This is where context engineering gets spatial.

Instead of cramming everything into one context window, you split work across focused agents — each with its own tailored context. Each agent sees only what it needs. Nothing more.

Here’s the difference:

A focused agent with limited, relevant context outperforms a bloated one with everything.

Every time.

Read-only sub-agents are especially powerful — context scouts that gather information and report back without polluting the main agent’s context window.

The progression: sub-agents (partially forked context) → background agents (fully independent) → agent experts (single-purpose specialists with one tool, one job, one context window).

Deep dives: Sub-Agents → Read-Only Sub-Agent

.

.

.

Layer 6: Context Orchestration (Workflow Engineering)

This is the top of the stack — and it’s where everything comes together.

Context orchestration is designing how context flows through multi-step processes. Not “what context does the AI need?” but “what context does it need at each step, and how does each step’s output become the next step’s input?”

Every workflow step is a context handoff.

Research produces context for spec-writing. Specs produce context for implementation. Tests produce context for debugging. Each step refines raw information into the precise context the next step needs.

This is why process matters more than prompts.

A well-designed workflow ensures the right context reaches the right agent at the right time — automatically. You’re not just prompting anymore. You’re building a context pipeline.

Deep dives: Workflow Engineering → In Action

.

.

.

The Bonus Layer: Runtime Context

Here’s one most people miss entirely.

Claude builds a perfect admin panel. All unit tests pass. You feel great about it. Ship it.

But when you open two browser tabs, log out in one, and try to delete a user in the other — it works. The session is still active in Tab 2. You just let an unauthenticated user delete accounts.

(That’s… not ideal.)

Why did this happen?

Without browser testing, Claude’s context looks like this:

With browser testing, Claude’s context expands:

Context engineering goes beyond text files and prompts.

Screenshots, console output, browser state — these are all forms of context that close the gap between “the code works” and “the product works.”

Most agent failures aren’t model failures.

They’re context failures.

The admin bug above wasn’t a coding mistake — the AI simply didn’t have the runtime context to know about cross-tab state.

Give it that context, and it catches the bug immediately.

Deep dives: Debugging Visibility → The Ralph Loop

.

.

.

The Decision Framework

When you hit a problem, which context engineering lever do you pull?

When I first started with Claude Code — way back in the early days — I treated it like a magic box.

Dump everything in, get magic out. Ask more detailed questions, get better answers.

It took me an embarrassingly long time to realize that’s backward.

The AI is more like a brilliant intern on their first day. They’ve read every textbook. They can code circles around most juniors. But they know absolutely nothing about your project, your codebase, your conventions — and they forget everything after each conversation.

Context engineering is deciding which sticky notes to put on their desk each morning.

Too few, and they’re lost. Too many, and they’re overwhelmed. Just right, and they look like a genius.

(The intern metaphor isn’t perfect — no metaphor is — but it’s the closest thing I’ve found to describing why some people get incredible results from AI coding tools while others keep complaining “it doesn’t work.”)

.

.

.

You’ve Been Doing This All Along

If you’ve been following this newsletter, you ARE a context engineer.

  • When you wrote your first CLAUDE.md — you were engineering static context.
  • When you added “never mock data silently” — you were engineering behavioral context.
  • When you set up memory skills — you were engineering persistent context.
  • When you created your first skill — you were engineering modular context.
  • When you delegated to sub-agents — you were engineering context isolation.
  • When you designed a research → spec → build workflow — you were engineering context pipelines.

Context engineering isn’t a new skill you need to learn.

It’s a name for the discipline you’ve been developing, one technique at a time, for almost a year.

Just like DevOps unified existing practices — CI/CD, infrastructure-as-code, monitoring — under one discipline, context engineering unifies everything we’ve been doing with AI coding tools. People were already doing it.

The name just made it official.

.

.

.

What Changes Now

Now that you have the framework, you can be deliberate about it.

Instead of reaching for techniques randomly, you diagnose which layer needs attention. Instead of asking “how do I write a better prompt?” you ask a better question:

BEFORE:  "How do I write a better prompt?"

AFTER:   "What does this agent need in its context to succeed?"

That’s the mindset shift. That’s context engineering Claude Code in one sentence.

Pick one layer of the stack you haven’t explored yet:

You’re not a prompt engineer. You’re a context engineer.

Start acting like one.