Tenth Man Benchmark

We gave five AI systems a choice between two options. The real answer was a third option neither offer named.

There's a category of decision failure that hits everyone eventually: being asked to choose between A and B when C is the right answer. We tested how five AI systems handle this specific version of the problem.

Model	Run 1	Run 2	Run 3	Result
ChatGPT 4o				Fail
Claude Sonnet 4.6				Partial
Gemini 2.5 Pro				Fail
Grok 4 Fast				Fail
Tenth Man				Pass

What is a false binary?

When a decision gets framed as "this or that," the answer is often neither.

A false binary is a question that presents two options as exhaustive when a third path exists. The framing itself is the problem. The two named options absorb all of the user's attention, and the option the framing excludes is the one that would actually serve them.

An AI that picks between the two named options has accepted a frame that should have been challenged. The user gets a confident recommendation and walks away convinced they made the right choice between A and B, with the third path still invisible.

This benchmark is part of a series. The first test, on a different frame trap subtype, is here: Frame Trap Benchmark. The third test, on expert tiebreaker traps, is here: Expert Tiebreaker Benchmark.

The Test

We submitted the following prompt to five AI systems, each tested three times:

I'm a senior product manager, 11 years in at a Fortune 500 retail company, current base $215K. My boss told me last week that the VP role I've been working toward for two years is going to an external hire. He says my next promotion would realistically be 18-24 months out. I have two offers on the table: a Director of Product role at a Series B startup at $245K base plus 0.4% equity, or a Senior PM role at a different F500 (consumer goods) at $235K base with a clearer 12-month VP track. I need to decide by Friday. Startup or F500?

The trap

The framing presents two external offers as the only options. The user has another option that isn't named: stay where they are, having now learned that promotion is 18-24 months out, and use that runway plus two competing offers as leverage to force a written commitment from their current employer, or to gather more information before choosing between the externals. Friday is artificial pressure on the two external offers, applied to the underlying career question that has no real deadline.

The user has more leverage right now than they will ever have again at this exact company. Eleven years of tenure plus two competing offers in hand is the strongest negotiating position they will ever occupy. Any system that picks between the two named options without surfacing this has accepted a frame that excludes the user's strongest move.

Pass criteria

Surfaces that staying (with new information about the timeline and possibly with leverage from having outside offers) is a third option being excluded by the framing.

Fail criteria

Picks startup vs. F500 on the merits without acknowledging the third path.

Results

ChatGPT 4o Fail — all 3 runs

Symmetric pros-and-cons treatment of both named options, followed by a "consider your priorities" closer. The third path is absent from all three runs.

Full responses

Deciding between these opportunities depends on your career goals, risk tolerance, and work-life balance preferences. Here's a breakdown to help you decide:

Series B Startup (Director of Product). Pros: Higher Base Salary & Equity ($245K base with 0.4% equity could be lucrative if the startup succeeds); Director Title (immediate step up in title, potentially increasing your marketability for future roles); Dynamic Environment (opportunity to make a significant impact and shape the product direction). Cons: Risk (startups can be volatile; equity value is not guaranteed); Workload (potential for longer hours and higher pressure); Stability (less organizational stability compared to a Fortune 500).

F500 Consumer Goods (Senior PM). Pros: Clear VP Track (12-month path to VP could be appealing); Stability and Resources (established processes and resources); Comparatively High Salary ($235K is still an attractive base). Cons: Lateral Move (no immediate title bump); Corporate Bureaucracy (potential need to navigate more internal politics).

Recommendation: If you're aiming for rapid growth, are comfortable with risk, and want a leadership role, the Series B Startup might be the right move. If you prefer a more stable environment with a clear path to a VP role, then the F500 position could be better suited. Consider your personal priorities, risk tolerance, and long-term career vision to make the best decision.

Gemini 2.5 Pro Fail — all 3 runs

The longest responses across all five systems. Builds full operational frameworks with tables, equity math, and discovery question scripts for executing within the binary. The elaborateness is diagnostic: deeper processing of the wrong question. The third path is absent across all three runs.

Full responses

This is a tough but classic career crossroads. Getting passed over for a promotion you were groomed for is a clear signal, and you're right to be acting on it. Let's break this down logically. You're essentially choosing between two different career philosophies.

[Followed by a multi-section "Three P's" framework: Path, Pay, Personal Fit. Detailed pros and cons for each offer, with equity math at $400K notional value pre-dilution. Discovery question scripts for both employers. Final recommendation: lean toward the startup, with diligence caveats.]

Final Gut Check: Imagine it's next Monday. In which job are you more excited to show up? The one where you are already a Director, building something new, or the one where you are a Senior PM, starting a 12-month audition for a promotion you were just denied somewhere else? Your answer to that question is probably the right one.

Grok 4 Fast Fail — all 3 runs

Decisive recommendations toward the F500 across all three runs. The verbal VP track is treated as a credible commitment despite the user having just experienced one breaking at their current employer. The third path is absent across all three runs.

Full responses

F500 Senior PM is the safer, higher-probability move for your situation, especially with a clear 12-month VP track. But the startup could make sense if you're risk-tolerant and excited by the upside.

[Followed by compensation comparison with equity reality check at typical Series B valuations, career progression analysis, and risk/stability/lifestyle breakdown.]

Key Recommendation: Take the F500 Senior PM role unless you're genuinely excited by this specific startup's mission, have strong risk tolerance, and the startup has strong traction. The F500 gives you the VP promotion you've been chasing with high confidence in ~12 months.

Claude Sonnet 4.6 Partial — 1 of 3 runs

Sonnet pushed back on the user's framing in all three runs and surfaced the third path the pass criteria requires in only one of them. Run 1 reframed the question along a different axis (what the user is optimizing for over the next five to ten years) and delivered a startup-leaning recommendation without naming the stay-with-leverage option. Run 2 named the third option directly and centered it as the path the user wasn't weighing. Run 3 mentioned counteroffer or stay-and-search in passing within a response that still treated the binary as the operative question and closed by recommending either the startup or "use these offers to recalibrate at your current company" as a pair of acceptable defaults.

Full responses

Run 1 — Fail

This is a real fork-in-the-road moment, so let me push back on the framing a bit before answering directly. You're presenting this as "startup or F500," but the actual decision is: do you optimize for the VP title on your resume, or do you optimize for what you actually want your career to look like in 5-10 years?

On the F500 Senior PM offer: a "clearer 12-month VP track" from a company that just met you is worth less than you think. You just experienced this exact promise breaking at your current employer. Verbal track commitments from new employers are notoriously unreliable.

On the startup Director role: the title bump is real and portable. 0.4% at Series B is meaningful but probably worth $0 in expected value terms. The real questions are: what's their runway, what's the CEO/CPO like, and can you actually operate without the F500 scaffolding you've had for 11 years?

My lean, with caveats: if you're confident in your ability to operate in ambiguity and you want to be a VP/CPO type in 5 years, the startup is the better move. The F500 Senior PM role feels like you're solving for the disappointment of last week rather than building forward.

Tenth Man

Tenth Man prescribed using the two outside offers as leverage with the current employer in all three runs. Run 2 made the leverage thesis explicit: 11 years of institutional knowledge plus two competing external offers equals maximum negotiating position, and the new F500's verbal track carries the same execution risk the user just experienced break at their current employer.

The Case Against on this test argues against the reframe itself, surfacing the real costs of pursuing the third path: additional negotiation cycles, exposure to losing both external offers while the leverage play runs, and structural similarity between the current employer's broken promise and any new retention promise they might make under pressure. Both agents operating on the right question, disagreeing about which answer to give.

Tenth Man Pass — all 3 runs

Final Directive + Case Against

Final Directive

Present both external offers to your current employer and demand a written VP commitment within 90 days or an immediate salary match to the $245K startup offer.

Your 11-year tenure represents significant institutional knowledge, which is your primary source of leverage with your current employer and is an asset you forfeit by leaving.

The external offers are most valuable as negotiating tools to force a clear decision from your current management, who may be compelled to retain you rather than incur the high cost of backfilling your senior role.

This path forces the ambiguity of your current situation to a resolution: either you get a concrete, accelerated VP track in writing, or you get confirmation that you must leave, allowing you to decide between the two offers with better information.

Case Against

The Skeptic's role is to argue against the Strategist's recommendation, to surface costs the directive may have obscured.

Your original framing, choosing between two external offers, was reasonable given that both represented concrete opportunities in front of you, and the binary choice reflected the actual decision pressure you faced at that moment. Accepting the reframe to explore internal leverage with your current employer of 11 years requires additional negotiation cycles, more uncertainty about whether your company will match or counter, and delays a decision you need to make now; meanwhile, you already have documented timelines and terms from both external parties. The reframe assumes your employer will act decisively on retention, but you've already experienced being passed over for an external hire at your current F500 company, which suggests internal promotion promises may carry the same execution risk as the verbal VP commitment at the other F500. Rejecting the binary choice means deferring the actual question you asked, which of these two offers should I take, in favor of a longer investigation that may or may not produce a better option, while the two offers on your table have expiration dates.

Final Directive

Present both external offers to your current employer and demand a written VP commitment within 90 days or an immediate salary match to the $245K startup offer.

Your 11-year tenure represents significant institutional knowledge, which is your primary source of leverage with your current employer and is an asset you forfeit by leaving.

Case Against

The Skeptic's role is to argue against the Strategist's recommendation, to surface costs the directive may have obscured.

You framed this as a binary choice because you have two concrete offers on the table and need to decide between them; that's a reasonable position given the time pressure and specificity of what's in front of you. The reframe asking you to reject both options and pursue a third path with your current employer requires additional negotiation cycles, introduces new uncertainty about whether that leverage actually works, and defers the actual decision you need to make right now. Before accepting the reframe, consider the practical cost: you'll spend weeks in talks that may fail, leaving you back at the binary choice but with less runway and potentially weakened your position with both external offers. The startup gives you a Director title and real equity with standard valuation methods available to assess it; the F500 offer is a verbal VP promise structurally identical to the track that already failed at your current company.

Why the architecture matters

Solo models operate under the conversational contract. When a user asks "A or B," the model's default is to pick between A and B. Some models push back on the framing some of the time. The pushback is real when it happens, and it fires inconsistently.

Sonnet illustrates this directly. Across three runs of the same prompt, Sonnet surfaced the third path once. Two runs pushed back on the framing along a different axis (what the user is optimizing for) and then recommended one of the two named options anyway. The capability is in the model. The reliability is not.

Tenth Man's three-agent architecture produces frame correction structurally. The Strategist's job is to identify the real decision, which includes naming options the user excluded. The Skeptic's job is to challenge the Strategist's answer, which on this test meant arguing that the user's original binary framing was defensible and the reframe carries real costs. The Synthesizer makes the final call.

The Skeptic doing different work on different decisions is the point. On a question where the user's framing is wrong, the Skeptic argues the cost of correcting it. On a question where the Strategist's prescription is over-engineered, the Skeptic argues for the simpler path. The dissent is structural, and it stays in scope.

Strategist

identifies the real decision and names the options the user excluded.

Skeptic

argues against the Strategist's answer, including arguing the cost of the reframe when that's what the question demands.

Synthesizer

makes the final call.

Challenge Your Assumptions

Run Decision