Episode 1 of The Greenfield Games: An Autonomous App Builder Beat Two Human Teams

"I hope to God that bot didn't win."

It won. And what it got right, and got wrong, is more interesting than the scorecard.

She said it out loud, right before the envelope was opened.

Dana Lawson is the CTO of Netlify. She has spent her career evaluating software. She'd just been told the twist: one of the three apps in The Greenfield Games had been built by an AI agent. And in the seconds before the winner was announced, her honest reaction was: "I hope to God that bot didn't win."

The setup

The Greenfield Games is a new competition series from CodeTV, hosted by Kedasha Kerr. Teams get a hard, real product category and six hours to build something from scratch. The first brief was a project management tool in the vein of Linear, built for engineering teams. Not a clone. The same problems Linear solves, in your own way.

Two human teams competed, The Data Dawgs and America's Next Top Models, six experienced developers total. Running alongside them, without their knowledge, was a third team: C.H.A.D.S. (Continuous Humanless Agentic Development System), an autonomous AI agent loop powered by Replay’s time-travel debugger.

The judges, Dana Lawson (CTO, Netlify) and Craig Dennis (Principal Developer Educator, Cloudflare), evaluated all three apps in a separate room. No knowledge of which team built which app. No context about who or what was behind each one. Open the link, use the app, apply the criteria.

Six hours

The teams dove in. What followed was equal parts brilliant, chaotic, and occasionally held together by a phone connected to a remote codebase in San Francisco.

The Data Dawgs approached it like engineers: measure twice, cut once. Ryan, Reys, and Dylan spent the first half hour in planning mode, debating the stack, mapping the data model. By the time they started building, they'd already changed course three times. "30 minutes in, we've pivoted three times already," one of them admitted on camera. They settled on Next.js, Postgres, and OpenCode, divided up responsibilities, and started shipping. Then Reese committed the team's environment variables to the repo. And one of their developers ended up running his entire operation through his phone as the only connection to the team’s codebase. "That phone is the gateway to our source code," said one of his team mates. He wasn't joking. A merge conflict nearly took them out with twenty minutes left. They pulled it together, barely.

America's Next Top Models looked at the brief, nodded, and went their own direction. Sterling, Riselle, and Amanda had a thesis: productivity software is boring, so stop making it boring. "We're tricking you into being productive," Riselle told the host when she stopped by. The result was a two-layer app, a standard sprint board on one side and a Super Mario-style island map on the other, where each project was its own world to explore. It was audacious. They knew exactly what they were doing.

C.H.A.D.S. didn't have a thesis. It didn't pivot or debate the stack or commit credentials to the repo. It had a spec, an agentic loop, and no concept of the chaos happening two workstations over.

Partway through, one of the Data Dawgs looked up from his screen and said: "I'm concerned we're going to lose to the AI agent."

Before they knew

The judges evaluated all three apps without knowing who built what.

The Data Dawgs delivered something functional. It covered the basics and hit the criteria, but the judges wanted more from it. "It's really hitting the marks," Dana said, "but it's 2026. I want it to do my work for me."

America's Next Top Models went somewhere unexpected. The gamified project map stopped the judges cold. "Oh my gosh. It looks like Super Mario World style," Dana said. "I love this. This would annoy the marketing team, but the developers would love it."

Then they pulled up Planwise, the name C.H.A.D.S. gave its app. The first thing they noticed was the polish. Clean, styled, immediately navigable. Then they found the roadmap view, a feature neither of the other teams had built. "This is the one that none of them did. That was a pretty key Linear feature." By the time they'd worked through it, Dana's stated: "This feels like I'm going to put my stuff in it."

That's not a judge being impressed by a gimmick. That's a senior technical leader saying: I would actually use this. In a blind evaluation, judged on the merits, against two teams of experienced human developers, the AI agent built the most complete, most functional app.

The reveal

Dana had just said "I hope to God that bot didn't win" and then the winner was revealed as Planwise.

The room erupted. The human teams: "NO. NO, NO, NO. GET OUT OF HERE." Dana: "OH MAN. I can't believe it. We chose the — I can't believe you chose the bot."

What came next was more interesting than the shock. Once the dust settled, the judges gave an honest post-mortem.

Dana: "The bots are efficient. It did the task and it did it well. It was actually very surprising and heartbreaking."

Craig: "It was very clinical, almost like it was exactly what needed to happen there. And it was missing the human touch. It does not have that sort of human element. It does not have any real creativity. It's just trying to do its best to interpret the requirements and build something that will satisfy those."

That assessment is exactly right. And I think it's worth reflecting on.

A still image from the episode when Replay’s autonomous app builder was revealed as the secret third contestant.

What C.H.A.D.S. is, and what it isn't

C.H.A.D.S. is our open-source autonomous loop builder. You give it a spec. It builds the app, runs it, records what happens at runtime using Replay, analyzes failures, fixes them, and iterates. Without a human in the loop.

It won the Greenfield Games not because it was creative. Not because it had taste or flair or the kind of judgment that comes from years of building software. It won because it read the requirements, satisfied the criteria, and shipped a complete, working product. The roadmap view the judges called out as a key Linear feature the other teams missed? C.H.A.D.S. built it because it was in the spec.

The judges called it clinical. They're right. That's the point.

You don't want your CI pipeline to have personality. You want it to work. You don't want your automated test analysis to have opinions. You want it to find the root cause. The quality that made C.H.A.D.S.'s app feel slightly sterile to the judges, that relentless and unsentimental focus on satisfying requirements, is exactly the quality you want in the system that's watching your PRs at 2am.

What the other teams couldn't see

Every other teams that competed were debugging blind.

When code breaks during a build, when a component fails to render, when a state transition produces the wrong result, when a test fails in a way that doesn't reproduce locally, most AI agents do what humans do: read the error message, form a hypothesis, guess. They're working from symptoms, not from the actual runtime.

C.H.A.D.S. wasn't guessing. Replay’s time-travel debugging and analysis tools gave it a deterministic recording of everything that happened at runtime: every state, every event, every transition from the start of execution to the point of failure. It could rewind. It could inspect state that was never explicitly logged. It could trace the full causal chain from the originating event to the broken outcome.

That's not a debugging aid. That's a fundamentally different relationship with runtime behavior. And it's why an autonomous agent could build and ship a complete, working product in six hours without a human touching the keyboard.

We ran a benchmark earlier this year, Web Debug Bench, that tested agent debugging performance on 177 hard, realistic problems in agent-built web apps. Claude Code with Replay MCP scored 76%. Claude Code without Replay scored 61%. A 15 percentage point lift from one variable: access to the runtime recording.

The Greenfield Games isn't a benchmark. It's messier and more human than that, which is part of why it's more compelling. A real competition. Real judges. Real apps. Real shock when the results came in.

The scorecard, as the host put it at the end of the episode: "C.H.A.D.S. one, humans zero."

What we learned from this experience

C.H.A.D.S. won on requirements. It missed on delight, on creativity, on the kind of flourish that makes a user fall in love with software. The judges saw it. We're not glossing over it.

But here's what the competition actually proved: an autonomous AI agent with runtime visibility can build production-quality software that satisfies real requirements, evaluated by real technical judges, in a blind test. Six months ago, that wasn't possible.

Runtime visibility isn't a debugging feature. It's what separates an agent that can complete from one that can only attempt. When an agent can see what's actually happening at runtime, not just what the code says should happen, it stops guessing. It builds, records, analyzes, fixes, ships. The loop closes. That's what C.H.A.D.S. had that the other agents didn't. It's why the bot won.

Your agents are probably guessing. They don't have to be.

Watch it for yourself

Screenshot_2026-05-14_at_10.47.25_AM.png

The full episode is available on YouTube. The judges' evaluation starts at 14:04. The reveal is at 18:56. Dana's reaction is everything.

Watch Episode 1 of The Greenfield Games

If you want to see what Replay can do in your own stack, not in a competition but in the CI failures and flaky tests that are blocking your team right now, start here.

Building a software factory and need to ensure quality at scale? Let's talk.

Looking to create high quality video content, we highly recommend working with Jason Lengstorf at CodeTV.