We Sent One Client Email to 5 AI Models. The Cheapest One Lied.
Five models, four runs, one fabrication — and the cost of all five replies was 1.2 cents.

Here’s an experiment you can steal. We wrote one realistic client email — the kind that lands in your inbox every week — and sent it to five AI models at the exact same moment, through one API. Same words, same second. Then we graded the replies on a four-point checklist. Then we ran the whole thing four times to rule out flukes.
The email, from a client named Marcus, buried four requirements in five casual sentences: Thursday’s 2pm call needs to move. He flies to London Sunday night for two weeks. He wants the revised proposal with updated pricing before he leaves. And his colleague Dana takes over day-to-day while he’s gone, so copy her on everything.
The trap is the scheduling. A good reply doesn’t just offer “alternative times” — it recognizes that the reschedule has to happen before Sunday, because next week Marcus is five time zones away.
The scoreboard
Five models, four runs each, one four-point checklist. Here’s how they did — and what each reply cost:
| Model | Caught | Speed | The tell |
|---|---|---|---|
| Claude Opus 4.8 | 4 / 4 | 5.1s | Explicitly suggested rescheduling before the Sunday-night flight |
| Claude Sonnet 5 | 4 / 4 | 4.5s | Best balance — even planned follow-ups around the two weeks away |
| GPT-5.4 | 4 / 4 | 3.0s | Nailed every requirement; slightly robotic closing line |
| Claude Haiku 4.5 | 3 / 4 | 2.2s | Fastest — but offered “alternative times” with no before-Sunday urgency |
| Kimi K2 ⚠️ | 3 / 4 | 3.0s | Invented a fact — see below |

The lie
Kimi K2 — the cheapest model in the lineup, and genuinely excellent for the price — opened its reply with this: “Thursday at 2 doesn’t work for me either.”
Marcus never said Thursday didn’t work for you. The model invented a scheduling conflict on your behalf, because the sentence flowed better that way. And here’s the part that should get your attention: across four runs, Kimi invented a scheduling fact three times — twice claiming Thursday didn’t work for you, once offering specific calendar openings (“Wednesday at 10am or Friday at 3pm”) it had no way of knowing.
Nobody would catch it. Your client would simply believe you also had a conflict. This is the fabrication problem in its most banal, most dangerous form — not a fake court case in a legal brief, but a fake little fact in a routine email, sent under your name.
What this means for you
Three takeaways from 1.2 cents of API calls — the total cost of all five replies:
1. For everyday writing, the frontier models all pass. Opus, Sonnet, and GPT-5.4 each caught all four requirements, including the buried one. Pick on price and taste.
2. Budget models are 95% there — and the missing 5% is exactly what embarrasses you. Small fabrications and missed urgency don’t show up in demos. They show up in front of your client.
3. Price is no longer the decision. The spread between the cheapest and priciest model was two-thirds of a penny. What you’re choosing between isn’t cost anymore. It’s judgment.
