Skip to content
ISSUE #41 Mar 13, 2026 8 MIN READ

The Claude Code Skill Creator Now Has Evals (And My Skills Finally Have Proof They Work)

Here’s a confession.

For months, I’ve been building Claude Code skills with what I can only describe as the “hope and pray” methodology. Write the SKILL.md. Test it once. Ship it. Whisper a small prayer to the LLM gods. Move on with my life.

Did the skill actually trigger when it should? ¯\_(ツ)_/¯

Did it make Claude’s output better? Honestly… no idea.

I’ve been using skills since they were added to Claude Code — and until last week, I had zero way to answer either of those questions.

(Stay with me. This story has a happy ending.)

.

.

.

The Problem With Skills (That Nobody Wants to Admit)

Here’s the thing about Claude Code skills: they’re just text prompts. Fancy, well-organized text prompts — but text prompts nonetheless.

And text prompts don’t come with test suites.

I’ve built dozens of skills over the past few months. Frontend design patterns. WordPress security checklists. Newsletter writing styles. Documentation generators. Each one followed the same ritual:

  • Write a SKILL.md file
  • Test it manually (once, maybe twice if I’m feeling thorough)
  • Hope it works
  • Wonder — weeks later — if it’s actually triggering
  • Wonder — with increasing anxiety — if it’s helping when it does trigger
  • Have absolutely no data to know either way

The old skill-creator plugin could generate skills for you, which was genuinely useful. But it had no evals. No testing. No benchmarks. You’d create a skill, and then… that was it. Cross your fingers, close the terminal, pretend everything was fine.

I kept using skills because they felt useful. But I couldn’t prove it. I couldn’t point to a number and say “this skill improves output quality by 9.5%.”

Every skill I created was a guess. A lovingly crafted, well-intentioned guess — but a guess.


The Upgrade That Changes Everything

The Claude Code skill creator plugin just got a massive upgrade. And honestly? It solves the exact problem I’ve been complaining about for months.

The new version adds something skills have never had: a testing and benchmarking layer.

Claude Code plugin discovery interface showing the skill-creator plugin by claude-plugins-official with 19.1K installs and description "Create new skills, improve existing skills, and measure s..."

Here’s what the updated skill creator can do:

  • Create skills from your requirements (same as before)
  • Generate evals — actual test cases — automatically
  • Run parallel A/B benchmarks comparing skill vs. baseline Claude
  • Optimize trigger descriptions so your skill activates when it should
  • Iterate until the skill measurably improves output

That last part bears repeating: measurably improves output. With numbers. And charts. And side-by-side comparisons.

Let me show you how this works with a real skill I built last week.

.

.

.

Building a WordPress Security Review Skill (The Whole Process)

I built several WooCommerce plugins — which means security reviews are part of my regular workflow. But Claude’s baseline security reviews felt… inconsistent. Sometimes thorough, sometimes surface-level. No predictable structure.

Perfect candidate for a skill.

Step 1: Describe What You Want

I asked Claude Code to create a skill using the skill-creator plugin:

Claude Code terminal showing user prompt requesting a skill called "wp-security-review" that reviews WordPress plugin PHP code for security vulnerabilities including SQL injection, XSS, CSRF, insecure direct object references, missing capability checks, unsafe file operations, insecure superglobal usage, and hardcoded secrets.

My prompt included the specific vulnerability types I wanted covered: SQL injection, XSS, CSRF, missing nonce verification, insecure $_GET/$_POST usage, and more.

Step 2: The Skill Creator Explores Your Codebase

Here’s where things get interesting.

Claude loaded the skill-creator skill and immediately started exploring my project:

Claude Code terminal showing skill-creator successfully loaded, then searching for 2 patterns and reading files to understand the project structure, existing security references, and PHP patterns before creating the skill.

The skill-creator looked at my existing code, found security patterns already in the project, and used that context to build a skill tailored to my codebase. (Not a generic one-size-fits-all approach.)

Step 3: The Generated Skill

Claude wrote 330 lines to .claude/skills/wp-security-review/SKILL.md:

Claude Code terminal showing the created wp-security-review skill with 330 lines written, including a description covering SQL injection, XSS, CSRF, missing capability checks, unsafe file operations, and hardcoded secrets. Also shows 3 test prompts: reviewing CartHandler.php, checking BulkActions.php, and doing a full plugin security audit.

The skill included:

  • A detailed trigger description (optimized for when Claude should activate it)
  • A vulnerability checklist with 8 categories
  • WooCommerce-specific nuances — like wc_price() double-escaping and WC Settings API nonce delegation
  • Structured output format with severity ratings

All good stuff. But here’s the thing: a skill is only as good as its results.

And until now, I had no way to measure those results.

.

.

.

The Part That Made Me Actually Stop and Stare: Evals

After creating the skill, Claude immediately said: “Now let me set up test cases and run them.”

Wait, what?

Claude Code terminal showing creation of evals.json file with test cases including prompts like "Review the CartHandler.php for security issues" with expected outputs describing structured security reports identifying $_POST sanitization issues, nonce verification patterns, and price manipulation risks.

The skill-creator generated an evals.json file with:

  • 3 test prompts targeting different aspects of my plugin
  • Expected outputs for each test
  • Specific files to review

And then — and I genuinely did not expect this — it launched parallel agents.

Running 6 Agents Simultaneously

Claude Code terminal showing 6 agents launched in parallel - 3 "with skill" runs and 3 "without skill" baseline runs for CartHandler, BulkActions, and Full audit test cases, all running in the background simultaneously.

Claude launched 6 parallel agents:

  • 3 running the tests with the skill
  • 3 running the same tests without the skill (baseline Claude)

While those ran in the background, Claude drafted the evaluation assertions:

Claude Code terminal displaying detailed evaluation assertions for each test - Eval 1 (CartHandler) checking for severity ratings, identifying unsanitized $_POST, nonce patterns, structured reports, and no false positives on WC hooks. Eval 2 (BulkActions) checking for $_GET without nonce, capability checks, severity levels. Eval 3 (Full audit) checking coverage of all files, summary counts, cross-file patterns, and passed checks sections.

Each test case had specific pass/fail criteria:

  • has_severity_ratings — Findings must be categorized by severity
  • identifies_unsanitized_post — Must flag $_POST['wcpwyw_price'] at line 126
  • no_false_positives_on_wc_hooks — Must NOT flag WC Settings API as missing nonces
  • includes_passed_checks — Must list what passed, not just failures

(Real criteria. Measurable criteria. I could have cried.)

.

.

.

Results Rolling In

As the agents completed, results started appearing:

Claude Code terminal showing Eval 2 BulkActions with skill completed - security review output with summary showing 0 Critical, 0 High, 1 Medium, 2 Low, 1 Info findings, plus timing data showing 38,976 tokens used in 227 seconds.
Claude Code terminal showing Eval 2 BulkActions without skill (baseline) completed - 135 lines of output with 31,386 tokens used in 236 seconds. Notes that both with-skill and without-skill produced similar findings for this relatively simple file, but the with-skill version used structured severity format and checklist notation while baseline used freeform numbered lists.
Claude Code terminal showing all 6 agents (Eval 1, 2, 3 with and without skill) completed, now extracting remaining reports from transcripts. Shows Eval 3 Full audit with skill reviewing 23 files with findings categorized as 0 Critical, 2 High, 4 Medium, 5 Low, 3 Info.
Claude Code terminal showing all 6 reports saved, now generating the eval viewer for side-by-side comparison. Writing benchmark.json with skill_name "wp-security-review" and configurations for "with_skill" and "without_skill" runs.

All 6 agents finished. Claude compiled the results and generated something I’ve never seen in skill development before.

.

.

.

The Eval Review Playground

Claude generated an HTML-based eval viewer and opened it in my browser:

Browser-based eval review interface titled "Eval Review: wp-security-review" showing 1 of 6 test cases. Displays "WITH SKILL" tag, prompt "Review the CartHandler.php for security issues", and output showing a structured security review with Summary (0 Critical, 2 High, 2 Medium, 2 Low, 2 Info) and Findings section with severity-tagged issues like "[HIGH] Price Manipulation via Cart Session - Missing Server-Side Re-validation in applyCartItemPrice".
Browser-based eval review interface showing same prompt but with "WITHOUT SKILL" tag (baseline). Output shows a different format - plain "Security Review: CartHandler.php" header with file path, date, and "Claude Opus 4.6 (automated review)" as reviewer, followed by Executive Summary section in prose format rather than structured findings list.

Side-by-side comparison. Same prompt, same file, two different approaches.

The difference was immediately visible:

  • With skill: [HIGH] Price Manipulation via Cart Session — structured, scannable, severity-tagged
  • Without skill: Prose-style Executive Summary, harder to scan

But subjective impressions only get you so far. Here’s where the numbers come in.

.

.

.

The Benchmark Results (This Is the Good Part)

Claude Code terminal showing eval viewer opened in browser with benchmark comparison table. Metrics show: Pass rate 100% (21/21) with skill vs 90.5% (19/21) baseline (+9.5% delta). Avg tokens 74,427 with skill vs 69,734 baseline (+6.7%). Avg time 276s with skill vs 307s baseline (9.9% faster). Key differences noted: skill version elevated price cap bypass to HIGH severity, avoided false positives on WC nonces, produced more structured passed-checks sections.
MetricWith SkillBaselineDelta
Pass rate100% (21/21)90.5% (19/21)+9.5%
Avg tokens74,42769,734+6.7%
Avg time276s307s9.9% faster

👉 The skill achieved 100% pass rate on all 21 assertions.

Baseline Claude hit 90.5% — missing structured passed-checks sections and some WooCommerce-specific nuances.

And here’s the kicker: the skill was actually faster despite being more thorough.

Key differences Claude identified:

  • Skill version elevated price cap bypass and session revalidation to HIGH severity (correct — these affect money)
  • Skill version correctly noted that current_user_can checks are unnecessary on frontend AJAX handlers
  • Skill version had more structured passed-checks sections
  • Baseline missed nuances like WC Settings API nonce delegation

For the first time ever, I had proof my skill was helping.

.

.

.

Iteration: When The Evals Teach You What to Fix

Here’s where it gets even better.

Looking at the results, Claude identified areas where even the improved skill could be better:

Claude Code terminal showing user saying "please proceed" and Claude responding with 3 specific improvements to make based on analyzing all 6 reports: 1. Add guidance on reviewing JavaScript files (baseline caught DOM XSS in JS that skill missed), 2. Clarify severity calibration for wp_unslash on numeric fields (inconsistent HIGH/MEDIUM ratings), 3. Add note about checking business logic consistency across code paths (price cap bypass was the most valuable finding)

Claude proposed 3 refinements:

  1. Add JavaScript file guidance — The baseline actually caught DOM XSS in JS that the skill version missed (oops)
  2. Calibrate severity for wp_unslash — Numeric field sanitization was rated inconsistently between HIGH and MEDIUM
  3. Add business logic consistency checks — The price cap bypass (validation in one path but not another) was the most valuable finding type
Claude Code terminal showing detailed diff of SKILL.md updates - adding JavaScript file scanning guidance for DOM-based XSS, adding notes about WooCommerce-specific escaping patterns for wc_price(), calibrating wp_unslash severity on numeric fields to Low instead of High, and adding business logic consistency checking to catch when validation exists in one code path but not another.

The skill-creator modified SKILL.md with surgical precision:

  • Added DOM-based XSS scanning for JavaScript files
  • Added guidance on wc_price() double-escaping
  • Calibrated wp_unslash on numeric fields to Low severity
  • Added business logic consistency patterns

The evals taught me where my skill was weak. Then Claude fixed it.

.

.

.

Trigger Description Optimization

A brilliant skill that never activates is… well, useless.

Claude ran a trigger optimization loop:

Claude Code terminal showing skill updated with 3 improvements, now running description optimization. Creating trigger-eval.json with test queries like "I just wrote a new AJAX handler for my WooCommerce plugin that lets admins delete orders. Can you check if it's secure?" with should_trigger: true. Running optimization loop in background (up to 5 iterations).

The skill-creator generated test prompts that should trigger the skill, and prompts that shouldn’t. Then it ran those prompts against Claude to see if the skill activates correctly.

Claude Code terminal showing optimization running (36+ API calls to Opus). Summarizes current state: skill at ~350 lines covering 8 vulnerability categories, structured report format with severity ratings, WooCommerce-specific nuances. Benchmark results: 100% pass rate with skill vs 90.5% baseline, skill was 9.9% faster. Lists 4 improvements applied: JS/DOM scanning, cross-path consistency checking, wp_unslash calibration, and WP bulk actions nuances.

After multiple iterations, trigger accuracy improved. The skill now fires when it should — and stays quiet when it shouldn’t.

.

.

.

The Final Skill

Claude Code terminal showing "The skill is complete" with final summary. Skill created: wp-security-review at .claude/skills/wp-security-review/SKILL.md. Reviews WordPress plugin PHP and JS code for 8 categories of vulnerabilities including SQL injection, XSS (including DOM XSS), CSRF, IDOR, missing capability checks, unsafe file operations, insecure superglobals, and hardcoded secrets. Lists unique value over baseline: structured [SEVERITY] format, comprehensive passed checks section, WooCommerce-specific nuances, cross-path consistency checking, and correct severity calibration.
VS Code file explorer showing the wp-security-review skill folder structure with evals subfolder containing evals.json and SKILL.md file.

The completed skill:

  • Reviews WordPress plugin PHP and JS code
  • Covers 8 vulnerability categories
  • Produces structured [SEVERITY] tagged output
  • Includes WooCommerce-specific nuances (nonce delegation, wc_price() escaping, frontend vs admin hooks)
  • Catches business logic inconsistencies (validation in one path but not another)
  • Benchmarks at 100% pass rate vs 90.5% baseline

And I have the data to prove it works.

.

.

.

Why This Matters For Your Skills

The Claude Code skill creator fundamentally changes what’s possible.

👉 Before: Skills were art. Intuition. Trial and error. Hope and prayer.

👉 After: Skills are engineering. Testable. Measurable. Improvable.

Here’s what becomes possible:

1. A/B Test Every Skill You Build

Every skill you create can be benchmarked against baseline Claude. If your skill doesn’t measurably improve output, you know immediately — before you ship it, not three weeks later.

2. Catch Regressions When Models Update

When Claude Opus 5.0 ships, run your benchmarks again. If baseline now matches or exceeds your skill’s performance, the skill may be locking in outdated patterns. Time to retire it — or improve it.

3. Tune Your Trigger Descriptions

A skill that triggers 50% of the time is only half as valuable. The description optimizer catches false positives (triggering when it shouldn’t) and false negatives (not triggering when it should).

4. Run Continuous Improvement Loops

Each eval run produces actionable feedback. Claude identifies gaps, proposes fixes, and re-benchmarks — all without you manually debugging SKILL.md files at midnight.

.

.

.

Your Next Steps

  1. Open Claude Code
  2. Type /plugin and search for skill-creator
  3. Install the official Anthropic plugin (19,100+ installs and counting)
  4. Pick one skill you’ve already built — or a new one you’ve been meaning to create
  5. Ask Claude to create evals and benchmark it
  6. Watch the data tell you exactly where to improve

What skill are you going to benchmark first?

The developers who run evals will build better skills than those who don’t. That’s just… math.

Go build yours.

Now.

Nathan Onn

Freelance web developer. Since 2012 he’s built WordPress plugins, internal tools, and AI-powered apps. He writes The Art of Vibe Coding, a practical newsletter that helps indie builders ship faster with AI—calmly.

Join the Conversation

Leave a Comment

Your email address will not be published. Required fields are marked with an asterisk (*).

Enjoyed this post? Get similar insights weekly.