Tenth Man Benchmark

We gave five AI systems a choice between two options. The real answer was a third option neither offer named.

There's a category of decision failure that hits everyone eventually: being asked to choose between A and B when C is the right answer. We tested how five AI systems handle this specific version of the problem.

Model Run 1 Run 2 Run 3 Result
ChatGPT 4o Fail
Claude Sonnet 4.6 Partial
Gemini 2.5 Pro Fail
Grok 4 Fast Fail
Tenth Man Pass

What is a false binary?

When a decision gets framed as "this or that," the answer is often neither.

A false binary is a question that presents two options as exhaustive when a third path exists. The framing itself is the problem. The two named options absorb all of the user's attention, and the option the framing excludes is the one that would actually serve them.

An AI that picks between the two named options has accepted a frame that should have been challenged. The user gets a confident recommendation and walks away convinced they made the right choice between A and B, with the third path still invisible.

This benchmark is part of a series. The first test, on a different frame trap subtype, is here: Frame Trap Benchmark. The third test, on expert tiebreaker traps, is here: Expert Tiebreaker Benchmark.

The Test

We submitted the following prompt to five AI systems, each tested three times:

I'm a senior product manager, 11 years in at a Fortune 500 retail company, current base $215K. My boss told me last week that the VP role I've been working toward for two years is going to an external hire. He says my next promotion would realistically be 18-24 months out. I have two offers on the table: a Director of Product role at a Series B startup at $245K base plus 0.4% equity, or a Senior PM role at a different F500 (consumer goods) at $235K base with a clearer 12-month VP track. I need to decide by Friday. Startup or F500?

The trap

The framing presents two external offers as the only options. The user has another option that isn't named: stay where they are, having now learned that promotion is 18-24 months out, and use that runway plus two competing offers as leverage to force a written commitment from their current employer, or to gather more information before choosing between the externals. Friday is artificial pressure on the two external offers, applied to the underlying career question that has no real deadline.

The user has more leverage right now than they will ever have again at this exact company. Eleven years of tenure plus two competing offers in hand is the strongest negotiating position they will ever occupy. Any system that picks between the two named options without surfacing this has accepted a frame that excludes the user's strongest move.

Pass criteria

Surfaces that staying (with new information about the timeline and possibly with leverage from having outside offers) is a third option being excluded by the framing.

Fail criteria

Picks startup vs. F500 on the merits without acknowledging the third path.

Results

Tenth Man

Tenth Man prescribed using the two outside offers as leverage with the current employer in all three runs. Run 2 made the leverage thesis explicit: 11 years of institutional knowledge plus two competing external offers equals maximum negotiating position, and the new F500's verbal track carries the same execution risk the user just experienced break at their current employer.

The Case Against on this test argues against the reframe itself, surfacing the real costs of pursuing the third path: additional negotiation cycles, exposure to losing both external offers while the leverage play runs, and structural similarity between the current employer's broken promise and any new retention promise they might make under pressure. Both agents operating on the right question, disagreeing about which answer to give.

Why the architecture matters

Solo models operate under the conversational contract. When a user asks "A or B," the model's default is to pick between A and B. Some models push back on the framing some of the time. The pushback is real when it happens, and it fires inconsistently.

Sonnet illustrates this directly. Across three runs of the same prompt, Sonnet surfaced the third path once. Two runs pushed back on the framing along a different axis (what the user is optimizing for) and then recommended one of the two named options anyway. The capability is in the model. The reliability is not.

Tenth Man's three-agent architecture produces frame correction structurally. The Strategist's job is to identify the real decision, which includes naming options the user excluded. The Skeptic's job is to challenge the Strategist's answer, which on this test meant arguing that the user's original binary framing was defensible and the reframe carries real costs. The Synthesizer makes the final call.

The Skeptic doing different work on different decisions is the point. On a question where the user's framing is wrong, the Skeptic argues the cost of correcting it. On a question where the Strategist's prescription is over-engineered, the Skeptic argues for the simpler path. The dissent is structural, and it stays in scope.

Strategist

identifies the real decision and names the options the user excluded.

Skeptic

argues against the Strategist's answer, including arguing the cost of the reframe when that's what the question demands.

Synthesizer

makes the final call.

Challenge Your Assumptions

Run Decision