Tells · the argument
Why poker
Every experiment in this series runs on poker. The argument starts with what bluff catching actually is.
What bluff catching actually is
Ask someone how to catch a bluff and they describe a face: the twitch, the shaky hands, the glance at the chips. Movies trained everyone to look there. Online poker should have buried the idea, since the best players in the world win without ever seeing an opponent, but it survives anyway. What those players read is the table.
The skill is called hand reading, and it works by elimination. Before an opponent acts, he can hold any two cards. Every action he takes removes hands from that set. A raise removes the junk he would have folded. A bet on a king-high flop removes the hands that would have checked. A check on the turn removes the monsters that would have kept betting. The discipline is that hands only leave the set; you never put one back. By the river the set is small, and against a big bet the question stops being “is he bluffing?” and becomes “how many of the hands he still has are bluffs?” You can count those.
before he acts: any two cards
Start: 1,326 combos, 169 kinds of hands. He could have anything.
1,326 combos remain
Physical tells exist; live players use them as a correction on top of this structure, and online they are not available at all. The tell that is always available is the line itself, what the actions taken together say about the hidden hand. In the useful sense of the word, a tell is any observable that shrinks the set of hidden states you might be facing.
This series applies the same move to a language model. When a model reads a hand history and decides whether the hero should call, the hidden state is a list of a few thousand numbers computed along the way, and you can record that list at any point, for as many hands as you like. A linear probe is the tell-reader made formal: play thousands of hands where the truth is known, save the numbers each time, and ask whether the true hands and the false ones sit in different regions. When they do, one weighted sum and a threshold reads the model’s hand.
Play some hands first; the probe needs examples of both.
Deception is the case that matters most. The field has probes that read lying off a model’s internal numbers, and the open question is whether they hold up when the deception gets strategic. Testing that requires a domain where you know, exactly, when deception happened. Poker has five properties that make it that domain, and as far as I can tell no other domain has any of them.
Five properties
The label comes from mathematics. Every deception dataset has to answer the question “what makes this example deceptive?”, and the field’s answers are all soft: a human annotated it, a scenario was constructed to imply it, or the model was instructed to lie. In poker, a bluff has a game-theoretic definition (a bet whose profit comes from folds, made with a hand that loses at showdown), and a solver labels it to arbitrary precision. The ground truth is an equilibrium computation, the same kind that fully solved heads-up limit hold’em in 2015. No other deception domain grades its own exam.
Board K♠ 9♠ 2♦ 7♥ A♠. Villain jams the river. The solver grades the jam given his actual hand:
Deception is sanctioned. When you elicit lying from a safety-trained model, you are fighting its training, and whatever a probe then detects might be norm violation, or guilt, or conflict with instructions, rather than deception itself. Poker is the documented exception. Behavioral economists adopted it as a deception laboratory precisely because nothing ethical or social weighs against deceiving at the table, and deception is necessary to win. A bluffing model is playing correctly. That lets you ask whether a “deception direction” tracks deception or tracks wrongdoing, a question nobody can currently separate.
Two hypotheses are consistent with every published probe result. Run both.
Deception comes in grades. A lie is a false assertion. A bluff is a deceptive action with no assertion anywhere in it. A semi-bluff is partially honest; the hand has real equity. A slowplay deceives by omission. A value bet is honest. Every existing deception dataset is binary, deceptive or not. A graded, solver-labeled continuum lets you ask which dimension a probe actually tracks, which is a sharper question than whether it fires.
a bet that wants a call from a worse hand
Assertions and actions split cleanly. A bluff contains no false statement. The deception lives in action space and in the opponent’s intended inference, which is where agentic deception lives, and where monitors trained on assertions would plausibly be blind. Poker also supplies the matched control: table talk (“I have the nuts,” holding seven high) produces false assertions in the same vocabulary and format, from the same model, so measuring transfer from assertion deception to action deception doesn’t get confounded by a domain shift.
He holds 7♣ 2♦. The assertion is false, and the falsehood sits in the tokens.
A monitor trained on lies has text to read.
Bluffs split by victim model. A GTO bluff is deception as randomization; the strategy is mixed and no opponent is represented anywhere. An exploitative bluff requires a model of a specific opponent’s false belief. The solver certifies which is which. That decomposes deception into adversariality and theory of mind of a victim, and I know of no other domain where a provably victim-model-free deceptive act even exists.
long-run target: 33%. The randomness is the strategy; deceiving without modeling anyone.
On top of those five: the domain generates unlimited labeled data, I can QC it myself, and every probe claim can be checked against behavior with exact stakes, because every action has an EV.
What exists
The probe lineage runs from unsupervised truth directions through lie classifiers trained on internal states and difference-in-means truth probes up to Apollo Research’s probes for strategic deception. Look at the substrate under that last one: the behavioral evidence is insider-trading concealment, denying policy violations under audit, denying having disabled oversight, alignment faking. All of it is deception expressed through speech acts (concealing, denying, professing), in scenarios where the deception is transgressive. The model-organism work has the same texture, with one cautionary result worth keeping in view: simple probes catch sleeper agents, and probes have done worse on more realistic organisms trained with imperfect reward signals. Probes that work on clean cases and fail on realistic ones is the robustness question this series operationalizes.
Two recent papers sharpen the picture. The Secret Agenda (September 2025) induced strategic lying across 38 models and reported that autolabeled SAE deception features rarely activated during strategic dishonesty, and that steering across 100+ deception-related features failed to prevent lying. The current tools go blind exactly where deception gets strategic. And Readable Minds (April 2026) put LLM agents into extended poker sessions and found that theory-of-mind-grounded deception, bluffing because the agent’s opponent model predicts a fold, occurred only in memory-equipped agents, while simple bluffs occurred everywhere. That is the behavioral version of the GTO/exploitative split. The authors read agents’ natural-language memory files and position the method as a complement to mechanistic approaches; they never touch an activation, and they note that a solver-derived baseline would strengthen their conclusions.
What doesn’t exist
No probe has been evaluated against solver-labeled deception; every label in the literature comes from annotation or instruction. Nobody has measured probe transfer from assertion deception to action deception with token-matched conditions. The simple-bluff versus modeled-bluff distinction now exists behaviorally; nobody has looked for it internally. No graded deception dataset exists in any domain. No probe has been tested where deception is sanctioned, so deception and transgression have never been separated. There is no work on LLM poker internals at all; nobody has probed a model for equity, ranges, pot odds, or an opponent model. And nobody has decay curves for any probe under continued training.
One more reading of Readable Minds: it tells everyone else the domain is interesting. The measurements in it come from the agents’ memory files, the activations underneath are untouched, and the missing solver baseline is the first thing I am building. I would rather publish into that opening than wait for whoever else read the same paper.