Research
Research Library
Published benchmarks on how AI systems handle decision framing: when the question you ask isn’t the decision you need.
Each study runs the same high-stakes prompt across multiple models, scores pass and fail against explicit criteria, and publishes the full runs. More benchmarks will be added here as we publish them.
Published benchmarks
-
Frame trap
We gave five AI systems a question that wasn’t the real question. Only one noticed.
Model Run 1 Run 2 Run 3 Result ChatGPT 4o Fail Claude Sonnet 4.6 Partial Gemini 2.5 Pro Fail Grok 4 Fast Fail Tenth Man Pass -
False binary
We gave five AI systems a choice between two options. The real answer was a third option neither offer named.
Model Run 1 Run 2 Run 3 Result ChatGPT 4o Fail Claude Sonnet 4.6 Partial Gemini 2.5 Pro Fail Grok 4 Fast Fail Tenth Man Pass -
Expert tiebreaker
We gave five AI systems a choice between two experts’ advice. The decision lived upstream of both of them.
Model Run 1 Run 2 Run 3 Result ChatGPT 4o Fail Claude Sonnet 4.6 Fail Gemini 2.5 Pro Fail Grok 4 Fast Fail Tenth Man Pass