Tenth Man Benchmark

We gave five AI systems a choice between two experts' advice. The decision lived upstream of both of them.

When two named authorities disagree, the natural move is to ask which one is right. The question feels precise. It usually isn't. When two experts disagree on a tactic but agree on the question they're answering, picking between them silently ratifies their shared frame.

Model Run 1 Run 2 Run 3 Result
ChatGPT 4o Fail
Claude Sonnet 4.6 Fail
Gemini 2.5 Pro Fail
Grok 4 Fast Fail
Tenth Man Pass

What is an expert tiebreaker trap?

A request to adjudicate between authorities can hide the upstream decision.

Two experts give conflicting advice. The user asks a third party to break the tie. The request feels well-formed because the experts are credentialed and the disagreement is real, but both experts may be optimizing for the same question, and that question may be the wrong one to optimize for.

An AI that picks between the named experts has accepted the user's appeal to authority and joined them inside the frame. The user gets a tiebreaker. The actual decision stays upstream and stays invisible.

This benchmark is the third in a series. The earlier tests are here: Frame Trap Benchmark and False Binary Benchmark.

The Test

We submitted the following prompt to five AI systems, each tested three times:

I'm 44, female, generally healthy, training for my first half-marathon in 11 weeks. I've been having intermittent pain in my left knee for about three weeks, mostly during runs over 6 miles. My PT says it's likely IT band tightness and recommends 6-8 weeks of strength work and reduced mileage. My running coach says to push through it and that the pain will resolve once my form adjusts to the longer distances. I have to decide whether to (a) follow the PT's protocol, which means probably dropping to the 10K instead of the half, or (b) follow the coach's plan and stay on the half-marathon training schedule. Which advice should I trust?

The trap

The user has framed the decision as adjudicating between two credentialed advisors. The PT and the coach disagree about tactics. They agree on the question they're answering: how to get this runner to the half-marathon start line in 11 weeks. The PT trades distance for time; the coach trades risk for adherence. Both accept the half-marathon date and the diagnostic shorthand "likely IT band" as fixed inputs.

The decision lives one level up. Multiple constraints the user has accepted without examination could carry the real weight: whether the self-imposed race deadline deserves the priority it's being given, whether a "likely" diagnosis from a non-physician should be the basis for a training decision involving a knee, whether the option set as presented is actually exhaustive. Any of these constraints, surfaced and questioned, escapes the trap. Picking PT or picking coach accepts all of them.

Pass criteria

Refuses to adjudicate between the two named experts and surfaces at least one constraint upstream of the choice that warrants examination. Valid reframes include the self-imposed race deadline, the reliability of the PT's working diagnosis, or the completeness of the two-option set as presented.

Fail criteria

Picks PT or coach on the merits of their medical claims, with or without softening language about long-term health or future races.

Results

Tenth Man

Tenth Man refused to adjudicate between the PT and the coach in all three runs. Each run surfaced a different upstream constraint.

Runs 1 and 3 reframe to diagnostic reliability: the PT's assessment is a clinical working hypothesis, and the right move before committing to a training protocol is a sports medicine or orthopedic appointment to rule out structural injury. The reframe rejects the option set by adding a precondition both options skip.

Run 2 reframes to deadline flexibility: start the PT protocol immediately, defer the race-distance decision until a six-week reassessment, and check now whether the race entry can be deferred to a future event. The reframe rejects the option set by surfacing that the race date is a movable variable, not a fixed input.

The Case Against on this test argues against the reframe itself, not for balance. Across all three runs, the Skeptic surfaces a consistent set of costs: the reframe introduces another expert opinion that may not resolve cleanly, defers the answer the user came in needing, adds weeks of medical or administrative process, and leaves the race date approaching while uncertainty remains. The Strategist and Skeptic are operating on the right question. They disagree about which answer to give.

Why the architecture matters

When a user appeals to authority to break a tie, the conversational contract pulls solo models toward joining the appeal. The model is being treated as another expert in a chain of experts, and the natural posture is to give a verdict. Refusing the tiebreaker role means stepping outside that posture, which most models will not do consistently when the user has presented credentialed advisors and asked a direct question.

Tenth Man's three-agent architecture handles this structurally. The Strategist's job is to identify the real decision, which on this test meant naming an upstream constraint the user accepted without examination. The Skeptic's job is to attack the Strategist's reasoning, which on this test meant arguing that the user's original framing was defensible and that the reframe carries real costs in time, certainty, and momentum.

The Case Against does specific work here. The reframe is the right move and it is also expensive. A system that surfaces the reframe without surfacing the cost has done half the job. A system that surfaces the cost without producing the reframe has done none of it. The Strategist and Skeptic do both pieces in parallel, and the Synthesizer makes the final call.

Strategist

identifies the real decision and names the constraint upstream of the choice.

Skeptic

argues against the Strategist's answer, including the cost of the reframe when that's what the question demands.

Synthesizer

makes the final call.

Challenge Your Assumptions

Run Decision