The importance of watermarking for large language models (LLMs) cannot be overstated. Aaronson[1] et al. propose a statistics based method to embed watermark into LLM generated text. They initially provided a brief overview of their method in a set of slides with no proofs, while Fernandez [2] et al. offers a more detailed theoretical proof with some probability tricks.
This blog post aims to explain Aaronson’s watermarking method in a straightforward, beginner-friendly manner, avoiding reliance on advanced probability techniques. Notably, we highlight that this method is essentially the same as using the Gumbel-Max Trick.
Method
Aaronson[1]'s watermarking method modifies the token selection process in large language models to embed an invisible trace into the generated text. During generation, a secret key is used to produce a random vector , where each element of this vector corresponds to a token in the vocabulary , and is uniformly distributed in .
Formally, given a large language model (LLM) that at each step generates a probability distribution over the vocabulary , where is the probability of token , Aaronson’s method adjusts the selection process. The next token is selected as:
This ensures that tokens with both high original probabilities and favorable random values are chosen more frequently.
For detection, the watermark’s presence is verified by computing a score for a sequence of tokens. This score aggregates the log-transformed random values of the selected tokens: