IAL Indie AI Lab

ChatGPT and Codex got the same coding task. The results were not alike.

Jun 21, 2026
In this video
  • The actual job I gave both (full spec linked in description)
  • What I had to do, minute by minute, to get a working result from each
  • What model each one is running (it's not the same one)
  • The thing nobody told me before I tried it
  • Cost, time, and the final diff side-by-side

Both come from OpenAI. People assume Codex is just “ChatGPT for code.” It isn’t. I gave both the same real coding task — a small Python CLI called AskTube that fetches a YouTube transcript and summarizes it — and watched what each one shipped.

Two different models. Two different workflows. Two completely different artifacts at the end. Here’s the honest difference, with the bill and the time it took.

Full transcript

Same coding job. Two tools, both from OpenAI. ChatGPT handed me back code in under two minutes — untested. Codex took seven minutes — and ran the tests itself, before I ever saw them. One delivered fast. The other delivered verified. Most people I talk to think these two are basically the same product with a different button. They aren’t. Different model, different loop, different artifact at the end. Here’s what I actually got from each one — and how to pick which job each one is best at. Open OpenAI’s pricing page. You’ll see Codex listed right inside the ChatGPT Plus column — same icon group, nested label, no real explanation of what’s different. You’d be forgiven for assuming they’re the same product with a coding skin on top. They aren’t. ChatGPT runs GPT-5 — the same general-purpose chat model you’ve been using for everything else. Codex runs codex-1, a separate model. It’s an o3 derivative that OpenAI specifically fine-tuned for long-horizon coding tasks. Different weights, different training, different behavior. Workflow is different too. ChatGPT is a chat: you paste your problem, it answers, you copy code into your editor, repeat. Codex is an agent: it clones a repo into its own sandbox, reads files, writes new ones, runs commands, iterates, and finally opens a pull request. And the artifact you walk away with is fundamentally different. Chat answers go through your clipboard. Agent answers arrive as a PR. Same company, same login, two very different products. The task I gave both of them is called AskTube. A small Python command-line tool. You give it a YouTube URL. It fetches the transcript, hands it to an LLM, asks for a summary, and saves the result as a Markdown file with timestamp anchors. Not a toy — a tool you’d actually use. The full specification is in a public GitHub repo. The link is in the description. Both AIs get the same README as their only prompt. No clarifying questions answered. No follow-up. One shot. Four things I’m watching. Wall-clock time from prompt to working code. Number of times I had to step in and unblock the AI. Whether the tests pass. And how the code reads when I open it. First, ChatGPT. I open a fresh chat, paste the README in as my one and only message, and hit send. GPT-5 starts writing immediately. No clarifying questions, no checking in — just code. It splits the work into the four files the spec asked for: cli, youtube, summarizer, formatter. And a test file for each one. The whole session is just code streaming onto the screen, file after file. What you’re watching here is sped up — the real wall clock is just under two minutes. From prompt to last line of code, no interventions. I download what it produced as a zip, extract it locally, install with pip, and run pytest. Sixteen tests. All green on the very first try. Total elapsed time, from sending the prompt to having a working Python package on disk: roughly two minutes. Faster than I expected, honestly. But there’s a footnote here that matters more than the speed. ChatGPT did not actually run the tests itself. It wrote them. It wrote code it thought would pass them. And it handed both over to me to find out. Whether they actually agree is something the user discovers, not something the model confirmed before pressing send. They did pass this time, and that’s a real thing — but it’s GPT-5 being good enough to land working code without verification, not a guarantee. On a harder task, you’d find out the hard way. Looking at the code itself, it reads the way ChatGPT code usually reads: defensive. Lots of custom exception classes. Separate functions for each LLM provider. Helpers wrapped around helpers. Six hundred seventy-eight lines total. Four hundred seventy-nine of those in source, the rest in tests. It also quietly used ‘from future import annotations’ at the top of every file — which keeps the code compatible with older Python versions, even though the spec explicitly asked for three-ten or newer. A conservative choice. Hold that thought. Open any one of these files and you can almost predict the next line. It’s careful, organized, slightly over-engineered, and would not look out of place in a junior-developer take-home. Now, Codex. Same prompt. Brand new sandbox repo, completely empty. I submit the README as the task, walk away from the keyboard, and let the agent loop run. What you’re watching here is sped up by roughly three times — the real wall clock is about seven minutes. And Codex is not just typing. It plans the file structure first. It writes pyproject.toml, then the source files one by one, then the test files. Then it runs pytest. And that’s where the agent loop earns its name: a test fails. Then another one fails. Codex reads its own failing test output, goes back to the source it just wrote, edits, and reruns. Two or three of these iterations happen in the middle of the recording. Each one is short, but they add up to the difference between four minutes and seven. Finally, every test goes green. Only then does Codex commit, push, and open a pull request back to my repo. I come back to the keyboard, open the PR, review the diff, and click merge. Done. Zero interventions, same as ChatGPT. But the artifact in my hands is fundamentally different. A pull request. With the tests already passing inside the agent’s own environment. Not ‘probably works’ — actually green on Codex’s machine before I ever saw it. And the code itself reads tighter. Five hundred thirty lines total. Three hundred six in source. That’s roughly thirty-six percent less source code than ChatGPT shipped, for exactly the same external behavior. One example: the summarizer module is one hundred sixty-two lines in ChatGPT’s version, and seventy-one lines in Codex’s. Less than half. Same function, fewer ceremonies. Codex also targeted Python three-ten strictly, using the newer ‘pipe’ union type syntax. It read the spec literally and didn’t hedge. When the pull request lands, I see a clean diff: source on the left, tests on the right, a passing CI badge generated by Codex’s own run. It feels less like ‘an AI wrote this’ and more like ‘a coworker submitted this for review.’ Let me line them up on the four axes I started with. First, wall-clock time. ChatGPT, about one and a half minutes from prompt to working code. Codex, about seven minutes. ChatGPT is four to five times faster to a first draft. Clear win. Interventions. Zero, both of them. On a well-specified task, neither one needed me at the keyboard. I walked away from both. Tests passing. Both pass. Sixteen tests for ChatGPT, fourteen for Codex. But notice the asterisk in the verified-by-AI row. ChatGPT wrote tests and handed them over untested. Codex wrote tests, ran them, fixed the failures, and only handed over once everything was green. Code feel. ChatGPT is more defensive, more lines, more abstractions. Codex is more direct, more concise, fewer detours. Two different aesthetics, from the exact same one-paragraph prompt. Neither one wins on every axis. The trade is real, and the trade is what matters. Here’s the thing nobody told me before I tried this. ChatGPT delivers untested code that probably works. Codex delivers verified code that took longer to get to you. Both shipped a working AskTube on this task. But the moment you hand work to a chat model, you become the verifier. You’re the one running pytest for the first time. You’re the one finding the bug. Codex collapses that step into itself. By the time you see the pull request, the verify-and-fix loop is already done. For a small, well-specified task — like this one — that doesn’t matter very much. GPT-5 is good enough that the gamble usually pays out. For a real codebase, with edge cases, integration paths, and the kinds of bugs you don’t see until you actually run the thing — the verify-and-iterate loop is where most of the value lives. That’s not a marketing pitch from either side. It’s just where the five extra minutes of Codex’s run actually went. So how do you actually pick? If you want a result right now, and the task is small enough that ‘probably works’ is good enough, ChatGPT delivers in a minute or two. Great for quick scaffolding, prototypes, throwaway scripts, the kind of code you’re going to read line-by-line anyway. If you’d rather come back to verified code, and you don’t mind waiting seven to ten minutes — and especially if the task is non-trivial — Codex hands you a finished pull request with the tests already green. Great for feature work in real codebases, refactors, things you trust to merge. Bigger tasks tilt the trade toward Codex. Quick scaffolding tilts it back toward ChatGPT. Same company. Same login. Two different jobs they’re best at. Full spec, both shipped solutions, my observations, and the screen recordings are all at github dot com slash indie-ai-lab slash asktube-experiment. Try the experiment yourself. Whichever result you get, it’ll be different from mine. That’s kind of the point.