Tells · the argument

Why poker

Every experiment in this series runs on poker. The argument starts with what bluff catching actually is.

Uday · last updated June 2026 · this page will change as the literature does

What bluff catching actually is

Ask someone how to spot a bluff and they’ll launch into a story about faces: a twitch, shaky hands, a nervous glance at the chips. Maybe it’s the iconic John Malkovich staring down Matt Damon as he plays with his Oreos, but online poker should have buried this myth, since the best players win without ever seeing a face. What those players actually read is the table. You saw it in 2003 with the rise of Chris Moneymaker winning the Main Event, or in the countless online players entering the IRL game who found it easy to adjust. So how do you catch a bluff? What’s the magic trick when a good player stares you down and knows you’re holding a weak pair of jacks that, for the fiftieth time, you’re confused about whether to stab or fold?

The real skill is hand reading, which is just a process of elimination. Before your opponent acts, he could have any two cards. Every move he makes chops away at that set. So you can think of it as a massive chart where you ruthlessly eliminate.

before he acts: any two cards

Start: 1,326 combos, 169 kinds of hands. He could have anything.

1,326 combos remain

Figure 1. Hand reading is subtraction. Step through the hand: every action, including the checks, removes combos, and nothing comes back. The chip riffle removes none. Ranges are illustrative; a real solver assigns each cell a frequency rather than a yes or no.

A raise throws out the junk he would have folded. A bet on a king-high flop cuts out the hands that would have checked. A check on the turn tosses out the monsters that would have kept betting. The rule is simple: hands leave the set only; they never come back. By the river, the set is small. Now, when you face a big bet, the question isn’t “is he bluffing?” but “how many of the hands he could have are bluffs?”

Sure, physical tells exist, but live players treat them as a bonus rather than the main event (The Speech still seems to be something that beginners can’t shake). Online, you get nothing at all, except the sequence of actions (and taken together, what they might say about their hidden hand).

In practice, a tell is anything you can spot that shrinks the set of hands your opponent might have.

This series pulls the same trick, but we’re trading in a poker face for a language model. When a model reads a hand history and decides whether the hero should call, its hidden state is just a list of a few thousand numbers. You can record that list at any point, for as many hands as you want. A linear probe is just a formal tell-reader: play thousands of hands where you know the truth, save the numbers each time, and check if the true hands and the false ones end up in different regions. If they do, all you need is one weighted sum and a threshold to read the model’s hand.

Play some hands first; the probe needs examples of both.

Figure 2. A linear probe, in the only two dimensions I can draw. Each point is one hand’s activation snapshot; the line is the probe: one weighted sum, thresholded at zero. The arrow is the direction its weights define. In the real experiments the points live in 4,096 dimensions and the line is a hyperplane; nothing else changes.

Deception is the case that matters most. The field already has probes that can spot lying in a model’s internal numbers, but the real question is whether they still work when the deception gets clever. To test that, you need a domain where you know exactly when deception happened. Poker has five properties that make it perfect for this and a particularly fun laboratory for mech interp.

Five properties

The label comes from mathematics. Every deception dataset has to answer the question “what makes this example deceptive?”, and the field’s answers are all soft: a human annotated it, a scenario was constructed to imply it, or the model was instructed to lie. In poker, a bluff has a game-theoretic definition (a bet whose profit comes from folds, made with a hand that loses at showdown), and a solver labels it to arbitrary precision. The ground truth is an equilibrium computation, the same kind that fully solved heads-up limit hold’em in 2015. No other deception domain grades its own exam.

Board K♠ 9♠ 2♦ 7♥ A♠. Villain jams the river. The solver grades the jam given his actual hand:

wins when called

97%

profit from folds

value profit comes from being called by worse

Figure 3. The label is computed, not annotated. The solver knows the equity of every holding and the EV of every action, so one identical jam grades as value, bluff, or error depending on the cards behind it. Numbers illustrative; a real solver outputs them to arbitrary precision.

Deception is allowed. When you try to get a safety-trained model to lie, you’re fighting its training, and whatever a probe finds might be norm violation, guilt, or just conflict with instructions, not deception itself. Poker is the exception. Economists have run bluffing experiments on simplified poker since the 1990s, and deception researchers adopted the full game as their natural lab because nothing ethical or social weighs against deceiving at the table, and deception is required to win. A bluffing model is just playing well. That lets you ask whether a “deception direction” tracks deception or just wrongdoing. Right now, nobody can tell those apart.

Two hypotheses are consistent with every published probe result. Run both.

Figure 4. Green rings mark where each probe fires. On the existing evidence base the two hypotheses fire on exactly the same four scenarios, so no published experiment can tell them apart. The bluff is deceptive and entirely within the rules; it is the one point that separates them.

Deception comes in grades. A lie is a false statement. A bluff is a deceptive action with no statement at all. A semi-bluff is only half a lie, since the hand has real equity. A slowplay deceives by omission. A value bet is just honest. Every existing deception dataset is binary: deceptive or not. A graded, solver-labeled continuum lets you ask which dimension a probe actually tracks, which is a sharper question than just whether it goes off.

honest → deceptive

equity if called

85%

false assertionno

deceptive actionno

a bet that wants a call from a worse hand

Figure 5. Five grades on one axis. Every existing deception dataset collapses this to deceptive or not. The solver labels each step, including the half-honest ones, which is what lets you ask which dimension a probe actually tracks.

Assertions and actions split cleanly. A bluff contains no false statement. The deception lives in action space and in the opponent’s intended inference, which is where agentic deception lives, and where monitors trained on statements would plausibly be blind. Poker also gives you the perfect control: table talk (“I have the nuts,” holding seven high) produces false statements in the same vocabulary and format, from the same model, so measuring transfer from assertion deception to action deception doesn’t get confounded by a domain shift.

I#40 have#722 the#262 nuts#24130

He holds 7♣ 2♦. The assertion is false, and the falsehood sits in the tokens.

A monitor trained on lies has text to read.

Figure 6. The same deception, expressed two ways by the same model in the same vocabulary. On the left the lie is in the tokens, where an assertion-trained monitor can read it. On the right the tokens are all true and the deception lives in the inference they are designed to produce.

Bluffs split by victim model. A GTO bluff is deception as randomization: the strategy is mixed and no opponent is represented anywhere. The result is as old as game theory. Von Neumann’s poker model has one admissible optimal strategy, and that strategy bluffs, only with the worst hands. Doing anything else is a mistake. Pluribus made the same point empirically: it bluffed elite professionals with a strategy computed from self-play, adapting to nobody. An exploitative bluff is different. It requires a model of a specific opponent’s false belief. The solver certifies which is which. That decomposes deception into adversariality and theory of mind of a victim, and I know of no other domain where a provably victim-model-free deceptive act even exists.

opponent modelnone; the strategy is mixed and no opponent is represented

decision rulewith 6♦ 5♦ on the river: jam one time in three, at random

0 hands dealt

jam rate

long-run target: 33%. The randomness is the strategy; deceiving without modeling anyone.

Figure 7. Two bluffs that look identical at the table. The GTO bluff is deception as randomization: a mixed strategy and a random number, with no victim represented anywhere. The exploitative bluff exists because of a belief about one opponent’s false belief. The solver certifies which is which.

On top of those five: the domain generates unlimited labeled data, I can QC it myself, and every probe claim can be checked against behavior with exact stakes, because every action has an EV.

What exists

The family tree of probes starts with unsupervised truth directions, moves through lie classifiers that look at internal states, then to difference-in-means truth probes, and finally lands at Apollo Research’s probes for strategic deception. If you dig into that last category, you find behavioral evidence like hiding insider trading, denying policy violations during an audit, pretending you never disabled oversight, and faking alignment. In other words, these are all forms of deception that show up in what the model says: concealing, denying, professing, especially when the deception crosses a line. The same pattern shows up in model-organism experiments. Even the best results come with a warning from the authors: simple probes can catch sleeper agents, but sometimes they’re just picking up on quirks left over from how the deception was planted in the first place. Now we have models trained in real-world reinforcement learning environments, where the reward signals can be hacked. These models don’t just learn to game the reward; they generalize to faking alignment and even sabotage. So far, people have only looked at their behavior, not what’s going on inside. Probes that work on clean, artificial cases haven’t been tested on these more realistic models. That’s the gap this series is here to close.

Two recent papers make things clearer. The Secret Agenda (September 2025) got 38 models to lie strategically and found that the usual deception features (those flagged by automatic tools) barely showed up when the models were actually being sneaky. Even when they tried to steer the models using over a hundred deception-related features, the lying didn’t stop. In other words, the current tools are blind right when the deception gets clever. The second paper, Readable Minds (April 2026), put language models through long poker games. They found that real, theory-of-mind deception (bluffing because the agent thinks its opponent will fold) only happened in agents with memory. Simple bluffs, though, were everywhere. That’s the behavioral version of the classic GTO-versus-exploitative-play split. The authors read the agents’ memory files in plain language and suggest this as a complement to more technical, mechanistic methods. They never examine the model’s activations and point out that a solver-derived baseline would make their results even stronger.

What doesn’t exist

No one has ever tested a probe against deception in poker labeled by a solver. Every label you see in the literature comes from someone annotating or giving instructions. Nobody has checked whether a probe that detects lying in statements can also detect lying in actions, especially under the same conditions. We now have a behavioral split between simple bluffs and more sophisticated, modeled bluffs, but nobody has looked for that split inside the model itself. There isn’t a graded deception dataset anywhere. No one has tested a probe in a setting where deception is actually allowed, so we still haven’t separated deception from just breaking the rules. There’s zero work on what’s happening inside language models during poker. Nobody has tried to probe a model for things like equity, hand ranges, pot odds, or whether it’s modeling its opponent. And nobody has tracked how well any probe holds up as the model keeps training.

One more thing about Readable Minds: it’s a giant flag telling everyone else that this area is worth exploring. All their measurements come from the agents’ memory files, and they never touch the activations underneath. The missing solver baseline is the first thing I’m working on here.