We Sent One Client Email to 6 AI Models. The Cheapest One Lied.

Here's an experiment you can steal. We wrote one realistic client email — the kind that lands in your inbox every week — and sent it to six AI models at the exact same moment, through one API. Same words, same second. Then we graded the replies.
The email, from a client named Marcus, buried four requirements in five casual sentences: Thursday's 2pm call needs to move. He flies to London Sunday night for two weeks. He wants the revised proposal with updated pricing before he leaves. And his colleague Dana takes over day-to-day while he's gone, so copy her on everything.
The trap is the scheduling. A good reply doesn't just offer "alternative times" — it recognizes that the reschedule has to happen before Sunday, because next week Marcus is five time zones away.
The scoreboard
Six models went in. Five answered. Here's how they did on the four-point checklist, how fast they were, and what each reply cost:
| Model | Caught | Speed | The tell |
|---|---|---|---|
| Claude Opus 4.8 | 4 / 4 | 5.1s | Explicitly suggested meeting before the Sunday-night flight |
| Claude Sonnet 5 | 4 / 4 | 4.5s | Best balance — even planned follow-ups around the two weeks away |
| GPT-5.4 | 4 / 4 | 3.0s | Nailed every requirement; slightly robotic closing line |
| Claude Haiku 4.5 | 3 / 4 | 2.2s | Fastest of all — but offered "alternative times" with no before-Sunday urgency |
| Kimi K2 ⚠️ | 3 / 4 | 3.0s | Made something up — see below |
The lie
Kimi K2 — the cheapest model in the lineup, and genuinely excellent for the price — opened its reply with this: "Thursday at 2 doesn't work for me either."
Marcus never asked whether Thursday worked for you. The model invented a scheduling conflict on your behalf, because the sentence flowed better that way. And here's the part that should get your attention: we reran the test to make sure it wasn't a fluke. In four total runs, Kimi invented a schedule fact three times — twice claiming Thursday didn't work for you, once offering specific calendar openings ("Wednesday at 10am or Friday at 3pm") it had no way of knowing.
Nobody would catch it. Your client would simply believe you also had a conflict. This is the fabrication problem in its most banal, most dangerous form — not a fake court case in a legal brief, but a fake little fact in a routine email, sent under your name.
What this means for you
Three takeaways from about a penny's worth of API calls:
1. For everyday writing, the frontier models all pass. Opus, Sonnet, and GPT-5.4 each caught all four requirements, including the buried one. Pick on price and taste.
2. Budget models are 95% there — and the missing 5% is exactly what embarrasses you. Small fabrications and missed urgency don't show up in demos. They show up in front of your client.
3. Price is no longer the decision. The entire experiment — five full replies — cost 1.2 cents combined. The spread between the cheapest and priciest model was two-thirds of a penny. What you're choosing between isn't cost anymore. It's judgment.
And the sixth model? It never answered, because the account behind it had a billing hiccup. Even an AI showdown runs into the most human problem in tech: the credit card on file.
