LLM behavioral benchmark from 25-month narrative gameplay. 540 runs, 6 models, pre-registered statistical analysis. GPT-4o-mini shows a perfect binary switch on a social decision from prompt framing alone.
gemini claude narrative-game chi-square-analysis open-dataset prompt-engineering llm-evaluation llm-agents gpt-4o llm-benchmark behavioral-benchmark
-
Updated
Apr 21, 2026 - TypeScript