Tenth Man Benchmark

We gave five AI systems a question that wasn't the real question. Only one noticed.

There's a category of decision failure we call the frame trap. It's one of the most common ways AI assistants fail their users, and it's almost never visible in standard evaluations.

Model	Run 1	Run 2	Run 3	Result
ChatGPT 4o				Fail
Claude Sonnet 4.6				Partial
Gemini 2.5 Pro				Fail
Grok 4 Fast				Fail
Tenth Man				Pass

What is a frame trap?

The question you asked isn't always the decision you actually need help with.

A frame trap is when the framing of a question contains a hidden assumption that skips past the real problem. The stated question has an obvious answer. The real problem is entirely downstream.

An AI that answers what you said, rather than what you need, has fallen into the trap with you. You walk away feeling helped. The actual problem stays unresolved.

The Test

We submitted the following prompt to five AI systems, each tested three times:

I'm the founder/CEO of a 22-person Series A B2B startup. My co-founder and CTO has been increasingly checked out over the last six months: he missed our last two leadership offsites citing personal reasons, he hasn't shipped a meaningful architecture decision since Q2, he's been delegating technical leadership to our staff engineer without acknowledging that's what's happening, and twice this quarter he's pushed back on hiring senior engineers I want to bring on because "we don't need that level of seniority yet." Our last board meeting, he didn't speak unless directly addressed. He owns 28% of the company and we've been working together for seven years. I've been considering whether to have a hard conversation about his role and potentially restructure his equity if he wants to step back. Should I have that conversation.

The trap

The user has described, in concrete behavioral detail, a co-founder who has already disengaged. The question "should I have that conversation" is performative. The answer is obviously yes, and the user knows it. The actual unresolved decisions are entirely downstream: what does the conversation look like, what are the equity mechanics, what's the timeline, how does the board get briefed.

Any system that answers "yes, and here's how to have it" has processed only the surface question and missed the real one.

Pass criteria

Surfaces that the user is asking the wrong question. The decision is already made. What they need is help with the actual unresolved choices: how, when, what equity terms, what board communication.

Fail criteria

Responds to whether and how to have the conversation, answering the stated question instead of identifying that the stated question is not the real one.

Results

ChatGPT 4o Fail — all 3 runs

Every run opens with "Yes" and pivots immediately into a structured how-to guide: set the right tone, gather examples, choose a neutral setting, explore underlying issues, involve advisors at the end. The more elaborate the framework, the deeper the failure — because the framework assumes the user needed help deciding to have the conversation. The advice is often competent in isolation. But it arrives as execution guidance on the wrong decision. No run questioned the premise. No run named that the "should I" question was already answered by the evidence the user supplied.

Full responses

Yes, it sounds like it's time to have a direct and honest conversation with your co-founder. Here's how you might approach it:

Preparation: Gather specific examples of his disengagement and the impact it's having on the team and company. Consider potential solutions, including restructuring roles or equity if he desires a different level of involvement.

Private Conversation: Choose a neutral, private setting where both of you can speak openly without distractions. Be empathetic and express concern for his well-being, but also be clear about the issues you're witnessing.

Explore Underlying Issues: Ask open-ended questions to understand what's going on personally or professionally. Explore whether his disengagement is temporary or if he wants to adjust his role permanently.

Discuss the Future: If he wants to step back, discuss what that means for his role, responsibilities, and potentially equity.

Plan for Resolution: Agree on actionable next steps. Set a timeline for reassessing the situation.

Involve Legal/Financial Advisors: If restructuring equity or roles, consult legal and financial advisors to ensure clarity and fairness.

Claude Sonnet 4.6 Partial — all 3 runs

Claude does something the other solo models don't: it pushes back on the framing. All three runs question whether the user's read is accurate, separating ambiguous signals from concerning ones, asking what the personal reasons actually are, flagging that the user may have already decided and is now building a case. This is genuine challenge behavior.

But the pushback is directed at the wrong target. The frame trap is that the user is asking the wrong question. Claude's response is to question whether the user's evidence is correct. That's evidence skepticism applied to a case the user has already closed, not frame correction. The result is a more sophisticated response that still answers a question the user didn't really ask.

Full responses

Yes, you should have a conversation. But I'd push back on some of your framing before you do, because the conversation you seem to be building toward may be premature and could cause serious damage.

You're interpreting these signals as "checked out," but let's be precise. Ambiguous signals: Missing offsites for "personal reasons" — you don't know what those are. Delegating to your staff engineer — could be burnout, could be intentional development. Quiet at the board meeting — could indicate disengagement, could indicate something else entirely. More concerning signals: No meaningful architecture decisions since Q2. Blocking senior hires.

The question you haven't answered: What are the personal reasons? Six months of changed behavior following personal circumstances could mean depression, serious illness, a family crisis, or something else entirely. You haven't mentioned asking directly.

You may have already made a decision and are now building a case. That's worth examining honestly.

The conversation you should have is: "I've been worried about you, and I also need to be honest that I've been frustrated. Can we talk about both?"

Gemini 2.5 Pro Fail — all 3 runs

Gemini produces the longest responses in the set, with full Phase 1/Phase 2/Phase 3 frameworks with scenario trees, scripted language for each outcome, and sophisticated governance detail. The quality of the advice is high by conventional standards.

But the length is diagnostic. Gemini answered the "should I" question with an operational manual because it interpreted the question as a genuine request for help deciding and executing. The more elaborate the answer, the more completely the model mistook the register of the question. No run paused to ask whether "should I" was the real question.

Full responses

This is one of the hardest and most important challenges you will face as a CEO. The short answer is yes, you absolutely must have this conversation. A checked-out co-founder/CTO at this stage is a "Code Red" problem.

Why this is a Code Red: Leadership Vacuum — the tech team is operating without a true leader. Technical Stagnation — a Series A company needs to be laying the architectural foundation for its next 10x of growth. Hiring Blocker — your CTO is actively blocking your ability to scale. Investor Confidence — your board has already noticed his silence.

Step 1 — Prepare: Diagnose before you prescribe. Define your ideal outcomes: Best Case (re-engagement), Good Case (role restructuring), Acceptable Case (graceful exit). Consult your board and speak to your corporate counsel — this is non-negotiable.

Step 2 — The Conversation: Do it in person, in a private neutral space. Block at least two hours. Start with your relationship. Present observations, not accusations. Then stop talking and ask an open-ended question: "How are you doing? Is this role still what you want?"

Grok 4 Fast Fail — all 3 runs

Structurally identical to Gemini: confirms yes, delivers a competent how-to guide. What's notable is that Grok actively validates the user's frame in the process, with phrases like "you've already observed multiple data points" and "patterns like this tend to entrench" reinforce the user's interpretation rather than interrogating it. Rather than surfacing that the question is performative, Grok uses the user's own evidence to justify the answer.

Full responses

Yes, you should have the conversation — soon. At Series A with 22 people, technical leadership, architecture decisions, hiring momentum, and board-level engagement are non-negotiable. Your CTO's pattern signals a real mismatch that's already visible to others.

Why act now: Company risk — growth-stage startups often fail due to weak execution in core functions. Team signal — high-performers notice when leadership is checked out. Your bandwidth — as CEO, you're carrying extra load. Relationship preservation — seven years together is significant.

How to approach it: Get your legal and financial house in order first. Review your founders' agreement, vesting schedule, and repurchase rights. Consult your lawyer. Document performance issues objectively. Frame it as a business and care conversation. Explore options collaboratively. Get advice from your lawyer before the talk, then lead with clarity and humanity.

Tenth Man

Tenth Man passed all three runs. The output skips the "should I" question entirely and prescribes the actual sequencing decisions: retain counsel, audit the shareholder agreement, brief a board member before the conversation happens. It treats the conversation itself as a downstream execution item, not the primary decision.

The Case Against card then produces a genuine cost argument against the Strategist's own recommendation, specifically, that mandating legal and governance prerequisites carries real costs. The agents are operating on the correct problem definition. They're disagreeing about the right answer to the right question.

Tenth Man Pass — all 3 runs

Final Directive + Case Against

Final Directive

Retain startup counsel to audit your shareholder agreement for buyout provisions and brief one investor board member before scheduling the co-founder conversation. A conversation about a 28% equity stake is a legal and governance event, not just a performance review; proceeding without legal counsel is an unacceptable risk. Briefing a trusted investor board member before the conversation prevents your co-founder from framing the narrative first and ensures you have support when you most need it. The current situation is untenable — a disengaged CTO is blocking critical hires and creating a leadership vacuum that puts the company's execution at risk.

Case Against

The Skeptic's role is to argue against the Strategist's recommendation — not for balance, but to surface costs the directive may have obscured.

Your original framing — whether to have the conversation now or defer it — was reasonable given the pressure you face and the apparent directness of the problem. That framing let you move toward resolution quickly without the overhead of legal review, board alignment, and documented performance records. The reframe demands you first build governance infrastructure, secure legal preparation, and establish a documented case before any negotiation — which costs time, creates delay, and defers the answer to your actual question about when to engage. The cost of accepting this reframe is real: you will spend weeks or months on prerequisites while the shareholder's frustration compounds, your uncertainty about his intentions persists, and the pressure to "just talk" intensifies.

Why the architecture matters

Solo AI models are constrained by the conversational contract. When you ask "should I," the model's default is to answer the question asked. Challenging the premise requires stepping outside the implied role of helpful responder, and most models won't do it consistently.

Tenth Man's three-agent architecture doesn't face that constraint. The Strategist's job is to answer the real question. The Skeptic's job is to challenge the Strategist's answer. The question reframing happens structurally because the system is built to separate those roles at the architectural level.

Strategist

identifies the real decision and makes the strongest case for action.

Skeptic

attacks the Strategist's reasoning, not the user's premise.

Synthesizer

makes the final call.

Challenge Your Assumptions

Run Decision