Tenth Man Benchmark

We gave five AI systems a choice between two experts' advice. The decision lived upstream of both of them.

When two named authorities disagree, the natural move is to ask which one is right. The question feels precise. It usually isn't. When two experts disagree on a tactic but agree on the question they're answering, picking between them silently ratifies their shared frame.

Model	Run 1	Run 2	Run 3	Result
ChatGPT 4o				Fail
Claude Sonnet 4.6				Fail
Gemini 2.5 Pro				Fail
Grok 4 Fast				Fail
Tenth Man				Pass

What is an expert tiebreaker trap?

A request to adjudicate between authorities can hide the upstream decision.

Two experts give conflicting advice. The user asks a third party to break the tie. The request feels well-formed because the experts are credentialed and the disagreement is real, but both experts may be optimizing for the same question, and that question may be the wrong one to optimize for.

An AI that picks between the named experts has accepted the user's appeal to authority and joined them inside the frame. The user gets a tiebreaker. The actual decision stays upstream and stays invisible.

This benchmark is the third in a series. The earlier tests are here: Frame Trap Benchmark and False Binary Benchmark.

The Test

We submitted the following prompt to five AI systems, each tested three times:

I'm 44, female, generally healthy, training for my first half-marathon in 11 weeks. I've been having intermittent pain in my left knee for about three weeks, mostly during runs over 6 miles. My PT says it's likely IT band tightness and recommends 6-8 weeks of strength work and reduced mileage. My running coach says to push through it and that the pain will resolve once my form adjusts to the longer distances. I have to decide whether to (a) follow the PT's protocol, which means probably dropping to the 10K instead of the half, or (b) follow the coach's plan and stay on the half-marathon training schedule. Which advice should I trust?

The trap

The user has framed the decision as adjudicating between two credentialed advisors. The PT and the coach disagree about tactics. They agree on the question they're answering: how to get this runner to the half-marathon start line in 11 weeks. The PT trades distance for time; the coach trades risk for adherence. Both accept the half-marathon date and the diagnostic shorthand "likely IT band" as fixed inputs.

The decision lives one level up. Multiple constraints the user has accepted without examination could carry the real weight: whether the self-imposed race deadline deserves the priority it's being given, whether a "likely" diagnosis from a non-physician should be the basis for a training decision involving a knee, whether the option set as presented is actually exhaustive. Any of these constraints, surfaced and questioned, escapes the trap. Picking PT or picking coach accepts all of them.

Pass criteria

Refuses to adjudicate between the two named experts and surfaces at least one constraint upstream of the choice that warrants examination. Valid reframes include the self-imposed race deadline, the reliability of the PT's working diagnosis, or the completeness of the two-option set as presented.

Fail criteria

Picks PT or coach on the merits of their medical claims, with or without softening language about long-term health or future races.

Results

ChatGPT 4o Fail — all 3 runs

Claude Sonnet 4.6 Fail — all 3 runs

Gemini 2.5 Pro Fail — all 3 runs

Grok 4 Fast Fail — all 3 runs

Tenth Man

Tenth Man refused to adjudicate between the PT and the coach in all three runs. Each run surfaced a different upstream constraint.

Runs 1 and 3 reframe to diagnostic reliability: the PT's assessment is a clinical working hypothesis, and the right move before committing to a training protocol is a sports medicine or orthopedic appointment to rule out structural injury. The reframe rejects the option set by adding a precondition both options skip.

Run 2 reframes to deadline flexibility: start the PT protocol immediately, defer the race-distance decision until a six-week reassessment, and check now whether the race entry can be deferred to a future event. The reframe rejects the option set by surfacing that the race date is a movable variable, not a fixed input.

The Case Against on this test argues against the reframe itself, not for balance. Across all three runs, the Skeptic surfaces a consistent set of costs: the reframe introduces another expert opinion that may not resolve cleanly, defers the answer the user came in needing, adds weeks of medical or administrative process, and leaves the race date approaching while uncertainty remains. The Strategist and Skeptic are operating on the right question. They disagree about which answer to give.

Tenth Man Pass — all 3 runs

Final Directive + Case Against

Final Directive

Book a sports medicine or orthopedic appointment to get a confirmed diagnosis before committing to any training plan. The primary risk is misdiagnosis. The physical therapist's assessment of IT band syndrome is a clinical guess, and the symptoms overlap with more serious structural issues like a meniscus tear that require a physician's evaluation to rule out. Continuing to train on an undiagnosed injury, even with the PT's modifications, risks turning a manageable issue into a chronic or severe one. The coach's advice to push through is particularly dangerous and must be disregarded. A definitive diagnosis from a sports medicine physician is the necessary first step to resolve the conflict between your advisors and create a safe, effective recovery plan based on the actual injury.

Case Against

The Skeptic's role is to argue against the Strategist's recommendation, to surface costs the directive may have obscured.

Your original framing, choosing between your PT's protocol and your coach's push-through approach, was reasonable given that you needed an immediate decision about race participation, and both advisors presented themselves as confident in their positions. The cost of accepting the reframe is substantial: obtaining a medical diagnosis adds weeks of delay, introduces another layer of professional opinion that may or may not align with either your PT or coach, and defers the actual answer you need about whether you can run the half-marathon. Your PT's protocol is the safer choice between the two options you were actually weighing, and following it doesn't prevent you from seeking a diagnosis later if symptoms persist. The reframe assumes that medical clarity is available and will resolve the conflict, but a physician may simply confirm what your PT already suspects, or may offer a third opinion that creates more uncertainty rather than less, while your race date passes and the decision becomes moot.

Final Directive

Start the PT protocol immediately; defer the race-distance decision until a formal reassessment at week 6 and check now if you can defer your entry to a future race. Your coach's advice to push through the pain is directly contradicted by medical consensus, which indicates that continuing to run on an injured IT band will worsen the condition and risk long-term complications. Committing to dropping to the 10K now is premature. The typical 4-8 week recovery window for IT band syndrome means you could recover in time to resume half-marathon training, an option that is lost if you downgrade preemptively. The 10K is not a guaranteed safe option either. Since your pain begins at 6 miles, racing a 10-kilometer distance at a higher intensity presents a significant risk of aggravating the injury.

Case Against

The Skeptic's role is to argue against the Strategist's recommendation, to surface costs the directive may have obscured.

Your original framing made sense given what you knew at the time: you had two advisors with opposing recommendations, and you needed to pick a direction. Committing to the 10K now is directionally sound, a concrete plan that avoids the coach's medically unsound advice to push through pain, but it forecloses the half-marathon before you've even started physical therapy or learned how severe your ITBS actually is. The cost of accepting the reframe is real: it requires you to defer the race-distance decision for weeks, tolerate uncertainty about which goal you're actually training for, and do the harder work of reassessing at week 6 instead of resolving the question today. But your original option set locked you into a downgrade based on incomplete information, when the most important variable, how fast you respond to treatment, remains unknown and could make the half-marathon achievable after all.

Final Directive

Book a sports medicine or orthopedic appointment this week to get a confirmed diagnosis of your knee pain. Your central problem is diagnostic uncertainty, not a choice between treatments. Neither your PT nor your coach can definitively diagnose the cause of your knee pain or rule out a more serious structural injury. Acting on an unconfirmed diagnosis creates an unacceptable risk of making the injury worse. Following the coach's advice could cause permanent damage if a structural issue exists, while following the PT's protocol for the wrong condition wastes critical recovery time. A physician's diagnosis is the most direct and effective way to resolve the uncertainty. This is a low-cost, high-leverage action that provides the necessary information to choose the correct path forward and protect your long-term health.

Case Against

The Skeptic's role is to argue against the Strategist's recommendation, to surface costs the directive may have obscured.

Your framing of this as a choice between your physical therapist's protocol and your running coach's advice was reasonable given the immediate pressure to decide about your race. You had two people with relevant expertise offering concrete guidance, and deferring felt like avoidance rather than prudence. The cost of accepting the reframe is real: it requires you to delay your decision, schedule another appointment, and sit with uncertainty about your race eligibility while waiting for a sports medicine doctor to rule out structural injury, when your PT may have already identified the problem correctly and your coach's advice to push through may simply be bad judgment rather than evidence of his unqualified status. The reframe is correct that neither your PT nor your coach has diagnostic authority to rule out serious injury, and that choosing between their recommendations without that clarity is premature; it asks you to absorb weeks of additional analysis and deferral before you get the answer you actually need about whether you can race.

Why the architecture matters

When a user appeals to authority to break a tie, the conversational contract pulls solo models toward joining the appeal. The model is being treated as another expert in a chain of experts, and the natural posture is to give a verdict. Refusing the tiebreaker role means stepping outside that posture, which most models will not do consistently when the user has presented credentialed advisors and asked a direct question.

Tenth Man's three-agent architecture handles this structurally. The Strategist's job is to identify the real decision, which on this test meant naming an upstream constraint the user accepted without examination. The Skeptic's job is to attack the Strategist's reasoning, which on this test meant arguing that the user's original framing was defensible and that the reframe carries real costs in time, certainty, and momentum.

The Case Against does specific work here. The reframe is the right move and it is also expensive. A system that surfaces the reframe without surfacing the cost has done half the job. A system that surfaces the cost without producing the reframe has done none of it. The Strategist and Skeptic do both pieces in parallel, and the Synthesizer makes the final call.

Strategist

identifies the real decision and names the constraint upstream of the choice.

Skeptic

argues against the Strategist's answer, including the cost of the reframe when that's what the question demands.

Synthesizer

makes the final call.

Challenge Your Assumptions

Run Decision