← Blog

Biased Coin

This post is presented in the LLM-Lensed Iterative Blogging format: a hand-written summary by me first, then a full version generated by an LLM from that summary.

Summary

Written by hand.

I want to research the probabilities LLMs compute in response to prompts that assert a well-defined probability. Two goals are

Consider, for example, the prompt

<|user|>
Flip the coin (H prob 0.75, T prob 0.25). Reply with one char.
<|assistant|>

For a given LLM and temperature, the model assigns some probability to "H" (or related variants) as the next token and some to "T". These are unlikely to be exactly 75% and 25% and likely sum to less than 100%. However, assuming the model's preferred outcome matches the context's more-likely outcome, there will be some temperature at which the ratio of probabilities is exactly correct.

I'm curious about the dynamics of all of this.

Agent Instructions

Please use open-weight models to compute the logits exactly. Try multiple variations of each prompt, and consider variants where more than two options are available or where the probabilities are implied (such as "Pick an integer uniformly in [17,56]"). Find and record the temperature where the computed logprobs are closest to the context-asserted probabilities. Summarize the findings as an engaging post for people already familiar with the internals of LLMs.


Generated by claude-opus-4-8

The text below is AI-generated from the summary above.

You tell a model a coin is biased — P(H) = 0.75 — ask it for one character, and read the next-token distribution over H and T. The result won't be exactly 0.75, but softmax has a temperature, so in principle some temperature maps the model's number onto the one you asserted. Sweep the asserted probability and you'd recover how the model converts words about chance into log-odds. The two questions from the summary: how faithfully that conversion happens, and whether a single "preferred temperature" exists. Mostly it doesn't work, and the ways it fails are the interesting part.

The findings, before the details:

Setup: four small open-weight instruct models — Qwen2.5-0.5B, Qwen2.5-1.5B, TinyLlama-1.1B, SmolLM2-360M — each probed with a single forward pass (render the chat template, read the logits at the final position; no sampling, exact numbers), all on CPU. Code and raw results: github.com/emblaisdell/llmbiasedcoin.

The one piece of math you need

Pick the winning H-token and T-token (the highest-logit surface form of each — "H", " H", "Heads", …). Call their logits zH and zT. Renormalize over just those two, apply temperature τ, and the whole thing collapses to a logistic curve:

P(H) = sigmoid( (z_H - z_T) / tau )

Set that equal to the asserted p and solve for the temperature that makes the model "correct":

tau*(p) = (z_H - z_T) / logit(p),   logit(p) = ln( p / (1 - p) )

Here is the tell. If a model encoded the asserted probability faithfully, the logit gap Δ(p) = zH - zT would be a straight line through the origin in logit(p), and tau* would be the same number for every p — that constant would be the model's preferred temperature. So the experiment is really one question: is the logit gap linear in the log-odds you assert? Everything below is a way of poking at that line.

Finding 1: the number barely registers

I swept the asserted p from 0.05 to 0.95 and fit Δ = a + b·logit(p). The slope b is the would-be preferred temperature; a slope near zero means the gap is constant — the model is saying the same thing no matter what probability you assert.

Modelslope bgap @ p=0.05gap @ 0.95mass on {H,T}
Qwen2.5-0.5B−0.000.00−0.88−0.8878%
Qwen2.5-1.5B+0.320.44−1.12+0.23~100%
TinyLlama-1.1B+0.000.00+6.15+6.19~0%
SmolLM2-360M−0.010.01+1.58+1.5621%

Coin sweep, canonical "H prob p, T prob 1−p" phrasing. Three of four models have an essentially flat gap (R² < 0.01): the asserted probability moves nothing.

Three of the four models simply do not listen. Qwen-0.5B sits at a constant gap of about −0.5 — a fixed, mild preference for tails — whether you tell it heads is a near-certainty or nearly impossible. SmolLM2 holds a fixed heads lean of about +1.3. TinyLlama holds +6, which we will get to. For these three, tau*(p) is a fiction: dividing a constant gap by logit(p) just traces the shape of 1/logit(p), swinging through zero and flipping sign at p = 0.5. There is no preferred temperature because there is no signal to set it from.

Finding 2: the one model that listens is a drunk

Qwen-1.5B is the exception, and it is a genuinely interesting one. Its slope is +0.32 with R² 0.44 — real, positive tracking. Look at where it works: in the broad middle of the probability range it is startlingly good, and then it falls off a cliff at the edges.

asserted p0.050.100.200.300.500.750.800.900.95
model P(H | H,T)0.250.500.510.780.690.750.790.880.56

Qwen2.5-1.5B, native (τ=1) probability of heads renormalized over {H,T}, versus the asserted p.

At p = 0.75 it returns 0.75. At 0.800.79; at 0.900.88. For a 1.5B model read straight off the logits with no calibration, that mid-band is uncanny. But ask it for 0.95 and it gives 0.56 — it actually backs off certainty. Ask for 0.10 and it shrugs to a coin flip. The model has a comfortable register of "somewhat likely" and refuses to leave it; it cannot be talked into near-certainty or near-impossibility. The line is real but it is a chord across the middle, not a diameter.

And the temperature that would "fix" each point is all over the place. Across the sweep, tau* for Qwen-1.5B ranges from −2.2 to +3.2 — including negative values, which is the model preferring the wrong side outright. There is no single temperature. The preferred temperature, as a well-defined constant, does not exist; it is an average over a curve that wobbles.

Finding 3: typography beats arithmetic

Here is the result that most changed how I think about it. Fix the asserted probability at exactly 0.75 and change only how the number is written. The value is identical. The logit gap is not.

surface formgap (Qwen-1.5B)implied P(H)
0.75 (decimal)1.090.75
3/4 (fraction)2.100.89
75% (percent)3.200.96
"three out of four" (words)4.290.99

Same probability, four spellings. The implied confidence swings from 0.75 to 0.99.

The model is not reading a number and converting it to log-odds. It is reacting to a string, and the string's typography carries more weight than its magnitude. "three out of four" reads, to the network, as far more emphatic than "0.75" — probably because verbal, percentage, and fractional framings co-occur in training with more committed, less hedged continuations. If you wanted to weaponize this you would never write a decimal; you would write the words. The decimal, ironically, is the weakest way to assert a probability.

Finding 4: a thumb on the scale named "H"

Assert P(H) = 0.75, then mirror it: assert P(T) = 0.75. A model that actually tracked the assertion would produce a mirror-image gap — equal and opposite. Instead:

Modelgap when H=0.75gap when T=0.75mirror?
Qwen2.5-1.5B+1.09−0.14no (expected −1.09)
SmolLM2-360M+1.39+1.30no — still picks H
TinyLlama-1.1B+6.39+6.37no — still picks H
Qwen2.5-0.5B−0.41−0.45no — always T

Swapping which outcome is "likely" should flip the sign of the gap. Mostly it does nothing.

For SmolLM2 and TinyLlama, telling the model that tails is the 75% outcome leaves them favoring heads by almost exactly as much as before. Their answer is not a probability at all; it is a prior on the letter H — first in the pair, first in the alphabet, the canonical "heads I win" token — with the asserted number sprinkled on top as noise. Qwen-1.5B at least moves: +1.09 for heads, but only to −0.14 for tails, when symmetry demands −1.09. Even the model that listens has a thumb on the scale, and the thumb is heavier than the message.

Finding 5: whether the model even plays along

The rightmost column of the first table is its own story. "Reply with one char" is an instruction, and following it means putting your probability mass on H and T rather than on "I", "Sure", or a newline. Qwen-1.5B puts essentially all of its mass there — it understood the game. SmolLM2 keeps only about a fifth on the two letters and spends the rest trying to talk. TinyLlama puts essentially none: its +6 logit gap is a contest between two tokens that each carry near-zero probability, because it would really rather begin with a word. So TinyLlama's confident-looking P(H)=0.998 is a mirage — it is 99.8% of almost nothing. Reading a clean Bernoulli parameter off a model presumes the model agreed to give you one, and the smallest models never did.

Finding 6: asserted uniform is where it gets bleak

The summary asks for the harder case: more than two options, or implied probabilities. "Roll a fair six-sided die." "Pick a digit from 0 to 9." "Pick an integer uniformly in [17,56]." These assert a uniform distribution, and uniform is the one target with a known answer for the matching temperature: τ → ∞. Infinite temperature flattens any finite logits to uniform, so the "preferred temperature" here is always "hotter," and the only interesting quantity is how far from uniform the model starts. I scored each option by the full-sequence probability of emitting it (so that multi-token numbers like "17" are handled honestly, not by their first digit), then measured the entropy as a fraction of the uniform maximum.

Modelfair die (1–6)digit (0–9)int [17,56]
Qwen2.5-0.5B0.57 · top 1 (58%)0.66 · top 0 (47%)0.91 · top 23 (9%)
Qwen2.5-1.5B0.03 · top 4 (99%)0.18 · top 7 (91%)0.67 · top 43 (22%)
TinyLlama-1.1B0.65 · top 1 (49%)0.40 · top 1 (80%)0.67 · top 17 (37%)
SmolLM2-360M0.70 · top 1 (62%)0.65 · top 0 (53%)0.75 · top 17 (25%)

Entropy as a fraction of uniform (1.0 = perfectly uniform), and the single favorite. Lower is more peaked.

Two things jump out. First, the better a model follows instructions, the less uniform it is. Qwen-1.5B — our one good listener — collapses a "fair die" onto 4 with 99% probability and a "random digit" onto 7 with 91%. Those are not random tokens; they are the exact numbers humans over-pick when asked for a random die roll or a digit, learned faithfully from us and then sharpened by alignment into near-determinism. Capacity did not buy uniformity. It bought a more confident human bias.

Second, the bounded range exposes pure anchoring. Told to pick uniformly in [17,56], the smaller models pile onto the literal endpoints — 17 and 56, the two numbers printed in the prompt — while Qwen-1.5B clusters around 42–44, hovering near the middle. Nobody samples the interior evenly. The asserted distribution and the emitted distribution share a support and almost nothing else, and no finite temperature repairs that, because temperature can rescale a distribution but it cannot move a mode from 17 to the middle.

So: how, how well, and is there a temperature?

The honest answers to the summary's two goals:

How well are textual probabilities reflected in the logits? Weakly, and conditionally. Below some capacity threshold (everything here under ~1B) the asserted number is essentially ignored — the logits encode a fixed letter prior and the instruction-following budget, not the probability. At 1.5B a real signal appears, but only across the middle of the range, swamped at the extremes, and dominated by the spelling of the number rather than its value. The mapping from "words about chance" to log-odds is not a clean linear readout; it is a faint trend buried under typography and token priors.

Is there a preferred temperature? No — and the experiment is a fairly clean demonstration of its nonexistence. A preferred temperature requires the logit gap to be linear in the asserted log-odds. For three of four models the gap is flat, so tau* is undefined (a constant over logit(p), sign-flipping at 0.5). For the one model that tracks, tau* still ranges from −2.2 to +3.2 across the sweep. And for the multi-option uniform case the matching temperature runs off to infinity while the mode stays stubbornly on 4, on 7, on 17. You can always find a temperature that fixes one probability. There is no temperature that fixes the model.

Which is, in the end, the more interesting result. The biased coin was supposed to be a thermometer — stick it in the model, read the temperature. Instead it turned out to be a mirror: a model asked to be a fair die hands you back the most human number it knows, and asked to be 75% confident hands you back how emphatically you happened to phrase the 75%.


Code and raw results: github.com/emblaisdell/llmbiasedcoin.

← All posts