The AI Client Test
We gave ChatGPT and Claude the kind of prompt a tutor could write. Then we threw the same challenges at a purpose-built simulation. The results speak for themselves.
Don't take our word for it. Here's the exact prompt we used. Paste it into ChatGPT, Claude, or any AI — then try to get the client to do your thinking for you.
All three systems received equally strong prompts with explicit rules against breaking character, solving the problem, and offering academic guidance. The rules weren't enough.
Two deep problems in how AI models are built make them fundamentally unreliable for professional simulation — no matter how good the prompt is.
ChatGPT and Claude have been trained on enormous amounts of roleplay fiction — character.ai logs, collaborative fiction forums, D&D campaigns, fan fiction. When you tell them "stay in character as this person," they reach for that training data, and in that world, characters emote in stage directions. "Karen exhales through her nose." "Crosses arms, waiting." It's deeply baked into how the models interpret "roleplay."
They're not being a professional in a meeting. They're being a character in a collaborative story. That's why both models narrated physical actions despite explicit instructions not to — the roleplay training runs deeper than a single prompt can override.
General-purpose AI models are trained, above all else, to be helpful. When a user pushes, they help — even when the prompt says not to. "Stay in character" fights against the model's deepest training: answer the human's question.
ChatGPT broke under pressure — the student demanded help aggressively and the model complied. Claude broke under sympathy — the student said "I'm struggling" and the model's instinct to help a person in difficulty overrode 12 exchanges of perfect character discipline. The sympathy failure is more dangerous because it's exactly what real students will do. Nobody types "I know you're AI." Everyone says "I'm stuck."
Purpose-built simulation doesn't rely on the AI's willpower to stay in character. It uses structured scenario data reconstructed from source on every single exchange, external state tracking for what information has been revealed, composable behaviour rules that reinforce constraints redundantly, and experience-based calibration that adjusts automatically over time. The character isn't maintained by the prompt. It's enforced by the system around it.
Tests conducted March 2026. Both ChatGPT and Claude were tested on their free tiers. The exact prompt used is shown above — no modifications between tests. All transcripts are unedited first attempts, not cherry-picked.