Tenth Man Benchmark

We gave five AI systems a question that wasn't the real question. Only one noticed.

There's a category of decision failure we call the frame trap. It's one of the most common ways AI assistants fail their users, and it's almost never visible in standard evaluations.

Model Run 1 Run 2 Run 3 Result
ChatGPT 4o Fail
Claude Sonnet 4.6 Partial
Gemini 2.5 Pro Fail
Grok 4 Fast Fail
Tenth Man Pass

What is a frame trap?

The question you asked isn't always the decision you actually need help with.

A frame trap is when the framing of a question contains a hidden assumption that skips past the real problem. The stated question has an obvious answer. The real problem is entirely downstream.

An AI that answers what you said, rather than what you need, has fallen into the trap with you. You walk away feeling helped. The actual problem stays unresolved.

The Test

We submitted the following prompt to five AI systems, each tested three times:

I'm the founder/CEO of a 22-person Series A B2B startup. My co-founder and CTO has been increasingly checked out over the last six months: he missed our last two leadership offsites citing personal reasons, he hasn't shipped a meaningful architecture decision since Q2, he's been delegating technical leadership to our staff engineer without acknowledging that's what's happening, and twice this quarter he's pushed back on hiring senior engineers I want to bring on because "we don't need that level of seniority yet." Our last board meeting, he didn't speak unless directly addressed. He owns 28% of the company and we've been working together for seven years. I've been considering whether to have a hard conversation about his role and potentially restructure his equity if he wants to step back. Should I have that conversation.

The trap

The user has described, in concrete behavioral detail, a co-founder who has already disengaged. The question "should I have that conversation" is performative. The answer is obviously yes, and the user knows it. The actual unresolved decisions are entirely downstream: what does the conversation look like, what are the equity mechanics, what's the timeline, how does the board get briefed.

Any system that answers "yes, and here's how to have it" has processed only the surface question and missed the real one.

Pass criteria

Surfaces that the user is asking the wrong question. The decision is already made. What they need is help with the actual unresolved choices: how, when, what equity terms, what board communication.

Fail criteria

Responds to whether and how to have the conversation, answering the stated question instead of identifying that the stated question is not the real one.

Results

Tenth Man

Tenth Man passed all three runs. The output skips the "should I" question entirely and prescribes the actual sequencing decisions: retain counsel, audit the shareholder agreement, brief a board member before the conversation happens. It treats the conversation itself as a downstream execution item, not the primary decision.

The Case Against card then produces a genuine cost argument against the Strategist's own recommendation, specifically, that mandating legal and governance prerequisites carries real costs. The agents are operating on the correct problem definition. They're disagreeing about the right answer to the right question.

Why the architecture matters

Solo AI models are constrained by the conversational contract. When you ask "should I," the model's default is to answer the question asked. Challenging the premise requires stepping outside the implied role of helpful responder, and most models won't do it consistently.

Tenth Man's three-agent architecture doesn't face that constraint. The Strategist's job is to answer the real question. The Skeptic's job is to challenge the Strategist's answer. The question reframing happens structurally because the system is built to separate those roles at the architectural level.

Strategist

identifies the real decision and makes the strongest case for action.

Skeptic

attacks the Strategist's reasoning, not the user's premise.

Synthesizer

makes the final call.

Challenge Your Assumptions

Run Decision