The importance of watermarking for large language models (LLMs) cannot be overstated. Aaronson[1] et al. propose a statistics based method to embed watermark into LLM generated text. They initially provided a brief overview of their method in a set of slides with no proofs, while Fernandez [2] et al. offers a more detailed theoretical proof with some probability tricks.

This blog post aims to explain Aaronson’s watermarking method in a straightforward, beginner-friendly manner, avoiding reliance on advanced probability techniques. Notably, we highlight that this method is essentially the same as using the Gumbel-Max Trick.

## Method

Aaronson[1]'s watermarking method modifies the token selection process in large language models to embed an invisible trace into the generated text. During generation, a secret key $k$ is used to produce a random vector $\mathbf{r}$, where each element $r_v$ of this vector corresponds to a token $v$ in the vocabulary $V$, and $r_v$ is uniformly distributed in $[0, 1]$.

Formally, given a large language model (LLM) that at each step generates a probability distribution $\mathbf{p} = (p_1, p_2, \ldots, p_V)$ over the vocabulary $V$, where $p_v$ is the probability of token $v$, Aaronson’s method adjusts the selection process. The next token $x$ is selected as:

$x = \arg\max_{v \in V} \left( r_v ^{1/p_v} \right)$

This ensures that tokens with both high original probabilities and favorable random values $r_v$ are chosen more frequently.

For detection, the watermark’s presence is verified by computing a score $S_T$ for a sequence of $T$ tokens. This score aggregates the log-transformed random values of the selected tokens:

$S_T = -\sum_{t=1}^{T} \ln(1 - r_{x(t)})$