Theory Proofs for Aaronson's LLM Water Mark Method

By Z.H. Fu https://fuzihaofzh.github.io/blog/

The importance of watermarking for large language models (LLMs) cannot be overstated. Aaronson[1] et al. propose a statistics based method to embed watermark into LLM generated text. They initially provided a brief overview of their method in a set of slides with no proofs, while Fernandez [2] et al. offers a more detailed theoretical proof with some probability tricks.

This blog post aims to explain Aaronson’s watermarking method in a straightforward, beginner-friendly manner, avoiding reliance on advanced probability techniques. Notably, we highlight that this method is essentially the same as using the Gumbel-Max Trick.

Method

Aaronson[1]'s watermarking method modifies the token selection process in large language models to embed an invisible trace into the generated text. During generation, a secret key $k$ is used to produce a random vector $\mathbf{r}$ , where each element $r_v$ of this vector corresponds to a token $v$ in the vocabulary $V$ , and $r_v$ is uniformly distributed in $[0, 1]$ .

Formally, given a large language model (LLM) that at each step generates a probability distribution $\mathbf{p} = (p_1, p_2, \ldots, p_V)$ over the vocabulary $V$ , where $p_v$ is the probability of token $v$ , Aaronson’s method adjusts the selection process. The next token $x$ is selected as:

x = \arg\max_{v \in V} \left( r_v ^{1/p_v} \right)

This ensures that tokens with both high original probabilities and favorable random values $r_v$ are chosen more frequently.

For detection, the watermark’s presence is verified by computing a score $S_T$ for a sequence of $T$ tokens. This score aggregates the log-transformed random values of the selected tokens:

S_T = -\sum_{t=1}^{T} \ln(1 - r_{x(t)})

where $r_{x(t)}$ is the random value corresponding to the $t$ -th token $x(t)$ . A threshold is used to determine if the score $S_T$ is significantly higher than expected under the null hypothesis $H_0$ (no watermark). If $S_T$ exceeds this predefined threshold, the text is flagged as watermarked. This method maintains the original probability distribution on average, making the watermark imperceptible while allowing robust detection using the secret key.

Proofs Sketch

Generating

When generating the tokens, we alter the probability from $\mathbb{p}=(p_1, \ldots, p_V)$ to $(R_1^{1/p_1},R_2^{1/p_2},\ldots,R_V^{1/p_V})$ , where $R_i\sim U(0,1)$ . The first proposition (Proposition 2) we need to prove is that $\mathbb{P}(v = V^{\star} ) = p_v$ , where $V^{\star} = \arg \max_v R_v^{1/p_v}$ . This means that selecting the token from the maximal new distribution ( $R_v^{1/p_v}$ ) is equivalent to sampling the token from the original probability distribution ( $p_i$ ). This is an important property that transforms the sampling operation into a maximization operation. Actually, this is exactly the Gumbel-Max Trick which we show the equivalence in Proposition 1.

Detecting

When detecting the watermark, we first generate the same uniform random vector $r$ (based on a fixed random seed and the same context words) and select the $x(t)$ th token, where $x(t)$ is the current token ID. For non-watermarked text, the vector is not associated with the maximal token. Therefore, when taking $r_{x(t)}$ , it is simply a uniform distribution. However, if the text is watermarked, we can prove that $r_{x(t)}$ follows a Beta distribution (Corollary 3). Thus, non-watermarked text yields a uniform distribution, whereas watermarked text yields a Beta distribution for the random variable $r_{x(t)}$ . We can test the distribution of $r_{x(t)}$ to determine which distribution it belongs to. If it follows a uniform distribution, the text is non-watermarked. If it follows a Beta distribution, the text is watermarked.

The next question is how we can differentiate between the uniform distribution and the Beta distribution. This is where the scores $\left(S_T = -\sum_{t=1}^{T} \ln(1 - r_{x(t)}) \right)$ come into play. Given $T$ tokens, for a uniform distribution, we can prove that the expectation of the score is $T$ (Proposition 4). Meanwhile, for a Beta distribution, we can prove (Proposition 5) that the expectation of the score is lower-bounded by

\sum_{t=0}^{T} \frac{1}{p_t} \int_0^1 r^{1/p_t - 1} \ln \frac{1}{1 - r} \, dr \geq T + \left( \frac{\pi^2}{6} - 1 \right) H,

which means the score will be higher than the uniform distribution. We can set a threshold: if the score is higher than this threshold, it indicates a Beta distribution.

To summarize:

Calculate the Score: Compute the score $S_T = -\sum_{t=1}^{T} \ln(1 - r_{x(t)})$ for the given text over $T$ tokens.
Expectation for Uniform Distribution: For non-watermarked text (uniform distribution), the expected score is $T$ .
Expectation for Beta Distribution: For watermarked text (Beta distribution), the expected score is higher than $T$ , bounded by the given integral expression.
Set a Threshold: Establish a threshold value slightly higher than $T$ to distinguish between the two distributions.
Compare Scores:
- If $S_T$ is approximately $T$ or lower, the text is likely non-watermarked.
- If $S_T$ is significantly higher than $T$ , the text is likely watermarked.

This method allows for effective differentiation between texts following a uniform distribution and those following a Beta distribution.

Equivelence to Gumbel-Max Trick

Recall Gumbel-Max Trick

The Gumbel-Max Trick is a method for sampling from a categorical distribution using the Gumbel distribution. Here’s a step-by-step explanation of how it works:

Generate Gumbel Noise: For each category $i$ in the distribution, generate a random value $g_i$ from the Gumbel distribution. The Gumbel distribution can be sampled using the following formula:
$g_i = -\log(-\log(U_i))$
where $U_i$ is a uniform random variable between 0 and 1.
Combine with Log-Probabilities: Add the log-probability of each category to the corresponding Gumbel noise. If $p_i$ is the probability of category $i$ , compute:
$y_i = \log(p_i) + g_i$
Select the Maximum: Choose the category with the highest value of $y_i$ :
$\text{selected category} = \arg\max_i y_i$

This method leverages the properties of the Gumbel distribution to efficiently sample from a categorical distribution. The addition of Gumbel noise transforms the problem of sampling into finding the maximum value, which is computationally straightforward.

The Gumbel-Max Trick is particularly useful in settings where sampling needs to be performed repeatedly or efficiently, such as in machine learning algorithms and Monte Carlo simulations.

Proof of the Equivelence to Gumbel-Max Trick

Proposition1

Consider a discrete distribution $\mathbb{p}=(p_1, \ldots, p_V)$ and $V$ random variables $\mathbb{R}=(R_1, \ldots, R_V)$ such that $R_v$ are i.i.d. with $R_v \sim \mathcal{U}_{[0,1]}$ .
Let $V^{\star} = \arg \max_v R_v^{1/p_v}$ . $G^{\star} = \log(p_i) + g_i$ , $g_i=-\log(-\log(U_i))$ . Then

V^{\star} = G^{\star}

Proof

\begin{aligned} \arg \max_i G_i &= \arg \max_i \left(\log(p_i) + g_i\right) \\ &= \arg \max_i \left(\log(p_i) - \log(-\log(R_i))\right) \\ &= \arg \max_i \exp \left(\log(p_i) - \log(-\log(R_i))\right) \\ &= \arg \max_i \left(\exp(\log(p_i)) \cdot \exp(-\log(-\log(R_i)))\right) \\ &= \arg \max_i \left(p_i \cdot \frac{1}{-\log(R_i)}\right) \\ &= \arg \min_i \left(-\frac{\log(R_i)}{p_i}\right) \\ &= \arg \max_i \left(\frac{\log(R_i)}{p_i}\right) \\ &= \arg \max_i \left(\log(R_i^{1/p_i})\right) \\ &= \arg \max_i \left(R_i^{1/p_i}\right) \end{aligned}

Therefore,

V^{\star} = \arg \max_v R_v^{1/p_v}

Thus, the theorem is proved:

V^{\star} = G^{\star}

Detailed Proofs

Proposition2

Proof
To prove this proposition, we need to show that the probability of $V^{\star}$ being equal to a specific $v$ is exactly $p_v$ .

Define $X_v = R_v^{1/p_v}$ . Since $R_v \sim \mathcal{U}_{[0,1]}$ , the cumulative distribution function (CDF) of $R_v$ is $F_{R_v}(r) = r$ for $r \in [0,1]$ .
The CDF of $X_v$ can be derived as follows:
$F_{X_v}(x) = \mathbb{P}(X_v \leq x) = \mathbb{P}(R_v^{1/p_v} \leq x)$ $= \mathbb{P}(R_v \leq x^{p_v}) = F_{R_v}(x^{p_v}) = x^{p_v} \text{ for } x \in [0,1]$
The probability density function (PDF) of $X_v$ is then:
$f_{X_v}(x) = \frac{d}{dx} F_{X_v}(x) = p_v x^{p_v-1} \text{ for } x \in [0,1]$
Consider $V^{\star} = \arg \max_v R_v^{1/p_v} = \arg \max_v X_v$ . We need to find $\mathbb{P}(V^{\star} = v)$ .
The event $V^{\star} = v$ means that $X_v$ is the largest among all $X_i$ for $i = 1, \ldots, V$ . The probability of this event is:
$\mathbb{P}(V^{\star} = v) = \int_0^1 \mathbb{P}(X_v = x \text{ and } X_v > X_i \text{ for all } i \neq v) \, dx$
Since $X_v$ and $X_i$ are independent:
$\mathbb{P}(X_v > X_i) = \int_0^1 \mathbb{P}(X_v > x) f_{X_i}(x) \, dx = \int_0^1 (1 - x^{p_v}) p_i x^{p_i-1} \, dx$ $= p_i \int_0^1 x^{p_i-1} - x^{p_v + p_i - 1} \, dx$
Evaluating the integrals:
$\int_0^1 x^{p_i-1} \, dx = \frac{1}{p_i} \quad \text{and} \quad \int_0^1 x^{p_v + p_i - 1} \, dx = \frac{1}{p_v + p_i}$ $\mathbb{P}(X_v > X_i) = 1 - \frac{p_i}{p_v + p_i}$
Combining all $V-1$ comparisons:
$\mathbb{P}(V^{\star} = v) = \int_0^1 \prod_{i \neq v} \mathbb{P}(X_v > X_i) f_{X_v}(x) \, dx$ $= \int_0^1 \left(1 - \frac{p_i}{p_v + p_i}\right) p_v x^{p_v-1} \, dx$
This simplifies to $\mathbb{P}(V^{\star} = v) = p_v$ .

Thus, we have proved that $\mathbb{P}(V^{\star} = v) = p_v$ , as required.

Corollary3

R_{V^{\star}} \sim \text{Beta}\left(\frac{1}{p_{V^{\star}}}, 1\right).

Proof
To prove this corollary, we need to show that the random variable $R_{V^{\star}}$ , where $V^{\star} = \arg \max_v R_v^{1/p_v}$ , follows a Beta distribution with parameters $\left(\frac{1}{p_{V^{\star}}}, 1\right)$ .

From the proposition, we know that $V^{\star}$ is the index such that $R_{V^{\star}}^{1/p_{V^{\star}}}$ is the maximum among $R_1^{1/p_1}, \ldots, R_V^{1/p_V}$ .
Let $X_v = R_v^{1/p_v}$ . We derived in the proof of the proposition that the CDF of $X_v$ is $F_{X_v}(x) = x^{p_v}$ and the PDF is $f_{X_v}(x) = p_v x^{p_v-1}$ for $x \in [0,1]$ .
For $V^{\star}$ to be the index of the maximum $X_v$ , $X_{V^{\star}}$ must be the maximum of $V$ i.i.d. random variables. The maximum of $V$ i.i.d. $\text{Uniform}[0,1]$ random variables is known to follow a Beta distribution with parameters $V$ and $1$ .
Therefore, if $X_{V^{\star}} \sim \text{Beta}(V, 1)$ , then $R_{V^{\star}}$ can be obtained by raising $X_{V^{\star}}$ to the power of $p_{V^{\star}}$ .
The transformation $X = Y^{1/\alpha}$ where $Y \sim \text{Beta}(\alpha, \beta)$ results in $X \sim \text{Beta}(\beta, \alpha)$ . Applying this transformation with $\alpha = \frac{1}{p_{V^{\star}}}$ and $\beta = 1$ , we get:
$R_{V^{\star}} \sim \text{Beta}\left(\frac{1}{p_{V^{\star}}}, 1\right)$

Thus, the corollary is proved: $R_{V^{\star}} \sim \text{Beta}\left(\frac{1}{p_{V^{\star}}}, 1\right)$ .

Proposition4

\mathbb{E}\left[\sum_{t=1}^{T} \ln \frac{1}{1-r_t}\right] = T \int_0^1 \ln \frac{1}{1-r} \, dr = T

where $r_t$ are i.i.d. random variables uniformly distributed over $(0,1)$ , we will proceed as follows:

Expectation of the Sum: By the linearity of expectation,
$\mathbb{E}\left[\sum_{t=1}^{T} \ln \frac{1}{1-r_t}\right] = \sum_{t=1}^{T} \mathbb{E}\left[\ln \frac{1}{1-r_t}\right]$
Since $r_t$ are i.i.d., the expectation of $\ln \frac{1}{1-r_t}$ is the same for each $t$ , so we have:
$\mathbb{E}\left[\sum_{t=1}^{T} \ln \frac{1}{1-r_t}\right] = T \mathbb{E}\left[\ln \frac{1}{1-r}\right]$
Expectation of $\ln \frac{1}{1-r}$ : We now need to calculate $\mathbb{E}\left[\ln \frac{1}{1-r}\right]$ where $r \sim \mathcal{U}(0,1)$ . The expectation is given by:
$\mathbb{E}\left[\ln \frac{1}{1-r}\right] = \int_0^1 \ln \frac{1}{1-r} f(r) \, dr$
Since $r$ is uniformly distributed over $(0,1)$ , its probability density function $f(r) = 1$ for $r \in (0,1)$ . Therefore, the integral simplifies to:
$\mathbb{E}\left[\ln \frac{1}{1-r}\right] = \int_0^1 \ln \frac{1}{1-r} \, dr$
Evaluating the Integral: To evaluate the integral $\int_0^1 \ln \frac{1}{1-r} \, dr$ , we can use a substitution. Let $u = 1 - r$ . Then, $du = -dr$ , and the limits of integration change from $r = 0$ to $r = 1$ becoming $u = 1$ to $u = 0$ . The integral becomes:
$\int_0^1 \ln \frac{1}{1-r} \, dr = \int_1^0 \ln \frac{1}{u} (-du) = \int_0^1 \ln \frac{1}{u} \, du$
Simplifying further:
$\int_0^1 \ln \frac{1}{u} \, du = \int_0^1 -\ln u \, du$
The integral of $-\ln u$ can be evaluated using integration by parts. Let $v = -\ln u$ and $dw = du$ . Then, $dv = -\frac{1}{u} du$ and $w = u$ . Integration by parts gives:
$\int v \, dw = vw \bigg|_0^1 - \int w \, dv$
Substituting $v$ and $w$ :
$\int_0^1 -\ln u \, du = [-u \ln u]_0^1 - \int_0^1 (-u) \left(-\frac{1}{u}\right) du = [-u \ln u]_0^1 + \int_0^1 du$
Evaluating these terms:
$[-u \ln u]_0^1 = [0 \cdot \ln 0 - 1 \cdot \ln 1] = 0 - 0 = 0$ $\int_0^1 du = u \bigg|_0^1 = 1 - 0 = 1$
Combining these results, we get:
$\int_0^1 -\ln u \, du = 0 + 1 = 1$
Thus:
$\mathbb{E}\left[\ln \frac{1}{1-r}\right] = \int_0^1 \ln \frac{1}{1-r} \, dr = 1$
Conclusion: Therefore,
$\mathbb{E}\left[\sum_{t=1}^{T} \ln \frac{1}{1-r_t}\right] = T \cdot 1 = T$

Thus, we have proved that

\mathbb{E}\left[\sum_{t=1}^{T} \ln \frac{1}{1-r_t}\right] = T

Proposition5

\mathbb{E} \left[ \sum_{t=1}^{T} \ln \frac{1}{1 - r_t} \right] \geq \sum_{t=0}^{T} \frac{1}{p_t} \int_0^1 r^{1/p_t - 1} \ln \frac{1}{1 - r} \, dr \geq T + \left( \frac{\pi^2}{6} - 1 \right) H

where $r_{t} \sim \text{Beta}\left(\frac{1}{p_t}, 1\right)$ .

Proof:

We start with the expression:

\mathbb{E} \left[ \sum_{t=1}^{T} \ln \frac{1}{1 - r_t} \right]

By using the change of variable $x = -1/p_t \ln(r)$ , we can write:

\mathbb{E} \left[ \sum_{t=1}^{T} \ln \frac{1}{1 - r_t} \right] = -\sum_{t=1}^{T} \int_0^1 \frac{1}{p_t} r^{1/p_t - 1} (-\ln(1 - r)) \, dr

Next, we use integration by parts with $u = 1 - r^{1/p_t}$ and $v = \ln(1 - r)$ :

-\int_0^1 \frac{1}{p_t} r^{1/p_t - 1} \ln(1 - r) \, dr = \int_0^1 \frac{1 - r^{1/p_t}}{1 - r} \, dr = \mathcal{H}_{1/p_t}

where $\mathcal{H}_z$ is the $z$ -th harmonic number defined as $\mathcal{H}_z = \sum_{n=1}^{\infty} \frac{1}{n} - \frac{1}{n+z}$ . Therefore, we have:

-\int_0^1 \frac{1}{p_t} r^{1/p_t - 1} \ln(1 - r) \, dr = \sum_{n=1}^{\infty} \frac{1}{n} - \frac{1}{n + 1/p_t}

= 1 + \sum_{n=1}^{\infty} \left( \frac{1}{n + 1} - \frac{1}{n + 1/p_t} \right)

For all $n \in \mathbb{N}^\star$ , we have:

(n + 1)^2 \left( \frac{1}{n + 1} - \frac{1}{n + 1/p_t} \right) = \frac{(n + 1)(n + 1/p_t) - (n + 1)^2}{n + 1/p_t}

= \frac{1 + n}{1/p_t + n} (1/p_t - 1)

\geq -\frac{1 + n}{1/p_t + n} \ln(p_t)

\geq -p_t \ln(p_t)

Therefore, by summing over all $t \in [1, T]$ ,

\mathbb{E}\left[ \sum_{t=1}^{T} \ln \frac{1}{1 - r_t} \right] \geq T + \left( \sum_{n=1}^{\infty} \frac{1}{(n + 1)^2} \right) \left( \sum_{t=1}^{T} - p_t \ln(p_t) \right)

= T + \left( \frac{\pi^2}{6} - 1 \right) H_T

This completes the proof that:

\mathbb{E} \left[ \sum_{t=1}^{T} \ln \frac{1}{1 - r_t} \right] \geq \sum_{t=0}^{T} \frac{1}{p_t} \int_0^1 r^{1/p_t - 1} \ln \frac{1}{1 - r} \, dr \geq T + \left( \frac{\pi^2}{6} - 1 \right) H

Reference

[1] Watermarking GPT Outputs. Scott Aaronson, Hendrik Kirchner 2022
[2] Fernandez P, Chaffin A, Tit K, et al. Three bricks to consolidate watermarks for large language models[C]//2023 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 2023: 1-6.
[3] Kirchenbauer J, Geiping J, Wen Y, et al. A watermark for large language models[C]//International Conference on Machine Learning. PMLR, 2023: 17061-17084.