Skip to content
ISSUE #19 Oct 3, 2025 5 MIN READ

Claude Opus 4.1 vs GPT-5 vs GPT-5 Codex: The Bash Bug That Revealed Their Differences

A real-world debugging session that shows how Claude Opus 4.1, GPT-5, and GPT-5 Codex handle the same problem differently – with surprising results.

Sometimes the most frustrating bugs have the simplest solutions.

And sometimes, the AI model you trust most isn’t the right one for the job.

Let me tell you about the time I spent hours (yes, hours) watching Claude Opus 4.1 chase its tail trying to fix a bash script, only to have GPT-5 Codex solve it in minutes.

.

.

.

The Setup: A/B Testing with Git Worktrees

I needed a script to automate A/B testing environments using Git worktrees.

The concept was straightforward: create separate workspaces for comparing different AI agents’ outputs side by side, each with its own database instance.

Using my standard workflow, I had GPT-5 plan the architecture and Claude Sonnet 4 implement it.

The script (ab-worktrees.sh) was created successfully.

… or so I thought.

.

.

.

The Problem: Silent Failure

When I ran the script, nothing happened. No error. No output. Just… nothing.

Time to debug.

.

.

.

Round 1: Claude Opus 4.1 – The Endless Loop

I turned to Claude Opus 4.1, my go-to for complex debugging.

It immediately started adding debug statements, analyzing the script structure, and running tests.

It found issues and made fixes!

It added more debug output!

More debugging attempts by Claude

After multiple iterations, Claude confidently declared:

Claude claiming "The script is working but the output isn't showing due to how the shell environment handles output"

But when I tested it…

The script still returning nothing

Still nothing.

Claude kept insisting it was fixed, identifying “echo -e incompatibility” and “working tree check failures” – but the fundamental issue remained.

.

.

.

Claude Opus 4.1 vs GPT-5 vs GPT-5 Codex: Testing Them All

Frustrated, I switched to GPT-5-high in Codex CLI.

GPT-5-high couldn’t crack it either.

But then I tried one more variant: GPT-5-codex-high.

.

.

.

The Breakthrough: A Devastatingly Simple Fix

GPT-5 Codex found the issue immediately:

GPT-5 Codex's fix overview showing the actual problem

The problem? Bash conditional syntax.

The Bug

# Original (broken) pattern
[[ -z "$A_START" ]] && A_START="$BASE_BRANCH"

The Fix

# Fixed pattern
if [[ -z "$A_START" ]]; then
    A_START="$BASE_BRANCH"
fi

That’s it.

That’s the entire fix.

.

.

.

Why This Simple Bug Was So Hard to Find

GPT-5 Codex explained it perfectly:

The && shortcut pattern fails silently in certain contexts:

  • When set -e is enabled (which my script had)
  • In functions or subshells
  • When the first condition returns non-zero

The shortcut syntax relies on specific Bash behavior that breaks under these conditions. The explicit if statement always works.

.

.

.

The Working Script

After applying the fix:

The actual setup completing successfully

Success!

The script now:

  • Creates two Git worktrees (worktrees/a and worktrees/b)
  • Clones the database for each environment
  • Sets up separate ports (3001 and 3002)
  • Provides clear instructions for running both variants
The created worktree structure in the file system
The PostgreSQL databases created for A/B testing

.

.

.

The Lessons

1. Different Models, Different Strengths

Claude Opus 4.1 is brilliant at many things, but it got stuck in a loop adding complex debugging rather than reconsidering the fundamental syntax.

GPT-5 Codex, specialized for code, immediately spotted the syntax pattern issue.

2. Simple Bugs Can Be the Hardest

The simpler the bug, the easier it is to overlook.

We expect complex problems to have complex solutions, but sometimes it’s just a matter of using if instead of &&.

3. Shell Scripting Is Deceptively Complex

What looks like identical syntax can behave completely differently depending on shell options, context, and environment.

The “clever” shortcut often isn’t worth the fragility.

4. Know When to Switch Tools

After watching Claude go in circles for hours, switching to a different model wasn’t giving up – it was being strategic.

Different AI models truly have different strengths.

.

.

.

The Irony

The most sophisticated AI model I had access to couldn’t fix a bug that boiled down to “use an if statement instead of &&”.

Meanwhile, a more specialized model solved it immediately.

Sometimes the best debugger isn’t the most powerful AI – it’s the one that’s been trained specifically for the task at hand.

.

.

.

Important Disclaimer

This is just one data point, not a comprehensive benchmark.

This single debugging session shouldn’t be used to conclude that GPT-5 Codex is definitively superior to Claude Opus 4.1, or that any model is inherently better than another.

Each AI model has its strengths and weaknesses, and they excel in different contexts.

What this experience taught me is that when you’re stuck on a problem with one model, switching to another might provide the fresh perspective needed for a breakthrough.

It’s not about finding the “best” model – it’s about using the right tool for the specific job at hand.

Claude Opus 4.1 remains exceptional at complex reasoning, creative problem-solving, and many coding tasks. GPT-5 and GPT-5 Codex have their own unique strengths.

The key is knowing when to leverage each one.

.

.

.

What’s Next: The A/B Testing Script

Speaking of using the right tool for the job – next week, I’ll be sharing the complete worktrees.sh script that started this whole adventure.

This bash script lets you test different AI models’ outputs side by side using Git worktrees, making it easy to compare implementations from Claude Opus 4.1 vs GPT-5 vs GPT-5 Codex (or any other models) in real-time.

You’ll learn how to:

  • Set up isolated testing environments for each model’s output
  • Compare implementations without contaminating your main branch
  • Run parallel tests with separate databases
  • Make data-driven decisions about which model’s solution to use

Subscribe to stay tuned for next week’s deep dive into automated A/B testing for AI-generated code!

Want techniques like these weekly?

Join The Art of Vibe Coding—short, practical emails on shipping with AI (without the chaos).

No spam. Unsubscribe anytime.


Have you experienced similar situations where switching AI models made all the difference? Where a “simple” fix eluded even the most advanced tools? I’d love to hear your debugging war stories in the comments.

Nathan Onn

Freelance web developer. Since 2012 he’s built WordPress plugins, internal tools, and AI-powered apps. He writes The Art of Vibe Coding, a practical newsletter that helps indie builders ship faster with AI—calmly.

Join the Conversation

Leave a Comment

Your email address will not be published. Required fields are marked with an asterisk (*).

Enjoyed this post? Get similar insights weekly.