Research

Research Library

Published benchmarks on how AI systems handle decision framing: when the question you ask isn’t the decision you need.

Each study runs the same high-stakes prompt across multiple models, scores pass and fail against explicit criteria, and publishes the full runs. More benchmarks will be added here as we publish them.

Published benchmarks

Frame trap

We gave five AI systems a question that wasn’t the real question. Only one noticed.

Model	Run 1	Run 2	Run 3	Result
ChatGPT 4o				Fail
Claude Sonnet 4.6				Partial
Gemini 2.5 Pro				Fail
Grok 4 Fast				Fail
Tenth Man				Pass

False binary

We gave five AI systems a choice between two options. The real answer was a third option neither offer named.

Model	Run 1	Run 2	Run 3	Result
ChatGPT 4o				Fail
Claude Sonnet 4.6				Partial
Gemini 2.5 Pro				Fail
Grok 4 Fast				Fail
Tenth Man				Pass

Expert tiebreaker

We gave five AI systems a choice between two experts’ advice. The decision lived upstream of both of them.

Model	Run 1	Run 2	Run 3	Result
ChatGPT 4o				Fail
Claude Sonnet 4.6				Fail
Gemini 2.5 Pro				Fail
Grok 4 Fast				Fail
Tenth Man				Pass