By Z.H. Fu
https://fuzihaofzh.github.io/blog/
The importance of watermarking for large language models (LLMs) cannot be overstated. Aaronson[1] et al. propose a statistics based method to embed watermark into LLM generated text. They initially provided a brief overview of their method in a set of slides with no proofs, while Fernandez [2] et al. offers a more detailed theoretical proof with some probability tricks.
This blog post aims to explain Aaronson’s watermarking method in a straightforward, beginner-friendly manner, avoiding reliance on advanced probability techniques. Notably, we highlight that this method is essentially the same as using the Gumbel-Max Trick.
Method
Aaronson[1]'s watermarking method modifies the token selection process in large language models to embed an invisible trace into the generated text. During generation, a secret key k is used to produce a random vector r, where each element rv of this vector corresponds to a token v in the vocabulary V, and rv is uniformly distributed in [0,1].
Formally, given a large language model (LLM) that at each step generates a probability distribution p=(p1,p2,…,pV) over the vocabulary V, where pv is the probability of token v, Aaronson’s method adjusts the selection process. The next token x is selected as:
x=argv∈Vmax(rv1/pv)
This ensures that tokens with both high original probabilities and favorable random values rv are chosen more frequently.
For detection, the watermark’s presence is verified by computing a score ST for a sequence of T tokens. This score aggregates the log-transformed random values of the selected tokens:
ST=−t=1∑Tln(1−rx(t))
where rx(t) is the random value corresponding to the t-th token x(t). A threshold is used to determine if the score ST is significantly higher than expected under the null hypothesis H0 (no watermark). If ST exceeds this predefined threshold, the text is flagged as watermarked. This method maintains the original probability distribution on average, making the watermark imperceptible while allowing robust detection using the secret key.
Proofs Sketch
Generating
When generating the tokens, we alter the probability from p=(p1,…,pV) to (R11/p1,R21/p2,…,RV1/pV), where Ri∼U(0,1). The first proposition (Proposition 2) we need to prove is that P(v=V⋆)=pv, where V⋆=argmaxvRv1/pv. This means that selecting the token from the maximal new distribution (Rv1/pv) is equivalent to sampling the token from the original probability distribution (pi). This is an important property that transforms the sampling operation into a maximization operation. Actually, this is exactly the Gumbel-Max Trick which we show the equivalence in Proposition 1.
Detecting
When detecting the watermark, we first generate the same uniform random vector r (based on a fixed random seed and the same context words) and select the x(t)th token, where x(t) is the current token ID. For non-watermarked text, the vector is not associated with the maximal token. Therefore, when taking rx(t), it is simply a uniform distribution. However, if the text is watermarked, we can prove that rx(t) follows a Beta distribution (Corollary 3). Thus, non-watermarked text yields a uniform distribution, whereas watermarked text yields a Beta distribution for the random variable rx(t). We can test the distribution of rx(t) to determine which distribution it belongs to. If it follows a uniform distribution, the text is non-watermarked. If it follows a Beta distribution, the text is watermarked.
The next question is how we can differentiate between the uniform distribution and the Beta distribution. This is where the scores (ST=−∑t=1Tln(1−rx(t))) come into play. Given T tokens, for a uniform distribution, we can prove that the expectation of the score is T (Proposition 4). Meanwhile, for a Beta distribution, we can prove (Proposition 5) that the expectation of the score is lower-bounded by
t=0∑Tpt1∫01r1/pt−1ln1−r1dr≥T+(6π2−1)H,
which means the score will be higher than the uniform distribution. We can set a threshold: if the score is higher than this threshold, it indicates a Beta distribution.
To summarize:
-
Calculate the Score: Compute the score ST=−∑t=1Tln(1−rx(t)) for the given text over T tokens.
-
Expectation for Uniform Distribution: For non-watermarked text (uniform distribution), the expected score is T.
-
Expectation for Beta Distribution: For watermarked text (Beta distribution), the expected score is higher than T, bounded by the given integral expression.
-
Set a Threshold: Establish a threshold value slightly higher than T to distinguish between the two distributions.
-
Compare Scores:
- If ST is approximately T or lower, the text is likely non-watermarked.
- If ST is significantly higher than T, the text is likely watermarked.
This method allows for effective differentiation between texts following a uniform distribution and those following a Beta distribution.
Equivelence to Gumbel-Max Trick
Recall Gumbel-Max Trick
The Gumbel-Max Trick is a method for sampling from a categorical distribution using the Gumbel distribution. Here’s a step-by-step explanation of how it works:
-
Generate Gumbel Noise: For each category i in the distribution, generate a random value gi from the Gumbel distribution. The Gumbel distribution can be sampled using the following formula:
gi=−log(−log(Ui))
where Ui is a uniform random variable between 0 and 1.
-
Combine with Log-Probabilities: Add the log-probability of each category to the corresponding Gumbel noise. If pi is the probability of category i, compute:
yi=log(pi)+gi
-
Select the Maximum: Choose the category with the highest value of yi:
selected category=argimaxyi
This method leverages the properties of the Gumbel distribution to efficiently sample from a categorical distribution. The addition of Gumbel noise transforms the problem of sampling into finding the maximum value, which is computationally straightforward.
The Gumbel-Max Trick is particularly useful in settings where sampling needs to be performed repeatedly or efficiently, such as in machine learning algorithms and Monte Carlo simulations.
Proof of the Equivelence to Gumbel-Max Trick
Proposition1
Consider a discrete distribution p=(p1,…,pV) and V random variables R=(R1,…,RV) such that Rv are i.i.d. with Rv∼U[0,1].
Let V⋆=argmaxvRv1/pv. G⋆=log(pi)+gi, gi=−log(−log(Ui)). Then
V⋆=G⋆
Proof
argimaxGi=argimax(log(pi)+gi)=argimax(log(pi)−log(−log(Ri)))=argimaxexp(log(pi)−log(−log(Ri)))=argimax(exp(log(pi))⋅exp(−log(−log(Ri))))=argimax(pi⋅−log(Ri)1)=argimin(−pilog(Ri))=argimax(pilog(Ri))=argimax(log(Ri1/pi))=argimax(Ri1/pi)
Therefore,
V⋆=argvmaxRv1/pv
Thus, the theorem is proved:
V⋆=G⋆
Detailed Proofs
Proposition2
Consider a discrete distribution p=(p1,…,pV) and V random variables R=(R1,…,RV) such that Rv are i.i.d. with Rv∼U[0,1].
Let V⋆=argmaxvRv1/pv. Then P(V⋆=v)=pv.
Proof
To prove this proposition, we need to show that the probability of V⋆ being equal to a specific v is exactly pv.
-
Define Xv=Rv1/pv. Since Rv∼U[0,1], the cumulative distribution function (CDF) of Rv is FRv(r)=r for r∈[0,1].
-
The CDF of Xv can be derived as follows:
FXv(x)=P(Xv≤x)=P(Rv1/pv≤x)
=P(Rv≤xpv)=FRv(xpv)=xpv for x∈[0,1]
-
The probability density function (PDF) of Xv is then:
fXv(x)=dxdFXv(x)=pvxpv−1 for x∈[0,1]
-
Consider V⋆=argmaxvRv1/pv=argmaxvXv. We need to find P(V⋆=v).
-
The event V⋆=v means that Xv is the largest among all Xi for i=1,…,V. The probability of this event is:
P(V⋆=v)=∫01P(Xv=x and Xv>Xi for all i=v)dx
-
Since Xv and Xi are independent:
P(Xv>Xi)=∫01P(Xv>x)fXi(x)dx=∫01(1−xpv)pixpi−1dx
=pi∫01xpi−1−xpv+pi−1dx
-
Evaluating the integrals:
∫01xpi−1dx=pi1and∫01xpv+pi−1dx=pv+pi1
P(Xv>Xi)=1−pv+pipi
-
Combining all V−1 comparisons:
P(V⋆=v)=∫01i=v∏P(Xv>Xi)fXv(x)dx
=∫01(1−pv+pipi)pvxpv−1dx
-
This simplifies to P(V⋆=v)=pv.
Thus, we have proved that P(V⋆=v)=pv, as required.
Corollary3
RV⋆∼Beta(pV⋆1,1).
Proof
To prove this corollary, we need to show that the random variable RV⋆, where V⋆=argmaxvRv1/pv, follows a Beta distribution with parameters (pV⋆1,1).
-
From the proposition, we know that V⋆ is the index such that RV⋆1/pV⋆ is the maximum among R11/p1,…,RV1/pV.
-
Let Xv=Rv1/pv. We derived in the proof of the proposition that the CDF of Xv is FXv(x)=xpv and the PDF is fXv(x)=pvxpv−1 for x∈[0,1].
-
For V⋆ to be the index of the maximum Xv, XV⋆ must be the maximum of V i.i.d. random variables. The maximum of V i.i.d. Uniform[0,1] random variables is known to follow a Beta distribution with parameters V and 1.
-
Therefore, if XV⋆∼Beta(V,1), then RV⋆ can be obtained by raising XV⋆ to the power of pV⋆.
-
The transformation X=Y1/α where Y∼Beta(α,β) results in X∼Beta(β,α). Applying this transformation with α=pV⋆1 and β=1, we get:
RV⋆∼Beta(pV⋆1,1)
Thus, the corollary is proved: RV⋆∼Beta(pV⋆1,1).
Proposition4
E[t=1∑Tln1−rt1]=T∫01ln1−r1dr=T
where rt are i.i.d. random variables uniformly distributed over (0,1), we will proceed as follows:
-
Expectation of the Sum: By the linearity of expectation,
E[t=1∑Tln1−rt1]=t=1∑TE[ln1−rt1]
Since rt are i.i.d., the expectation of ln1−rt1 is the same for each t, so we have:
E[t=1∑Tln1−rt1]=TE[ln1−r1]
-
Expectation of ln1−r1: We now need to calculate E[ln1−r1] where r∼U(0,1). The expectation is given by:
E[ln1−r1]=∫01ln1−r1f(r)dr
Since r is uniformly distributed over (0,1), its probability density function f(r)=1 for r∈(0,1). Therefore, the integral simplifies to:
E[ln1−r1]=∫01ln1−r1dr
-
Evaluating the Integral: To evaluate the integral ∫01ln1−r1dr, we can use a substitution. Let u=1−r. Then, du=−dr, and the limits of integration change from r=0 to r=1 becoming u=1 to u=0. The integral becomes:
∫01ln1−r1dr=∫10lnu1(−du)=∫01lnu1du
Simplifying further:
∫01lnu1du=∫01−lnudu
The integral of −lnu can be evaluated using integration by parts. Let v=−lnu and dw=du. Then, dv=−u1du and w=u. Integration by parts gives:
∫vdw=vw∣∣∣∣∣01−∫wdv
Substituting v and w:
∫01−lnudu=[−ulnu]01−∫01(−u)(−u1)du=[−ulnu]01+∫01du
Evaluating these terms:
[−ulnu]01=[0⋅ln0−1⋅ln1]=0−0=0
∫01du=u∣∣∣∣∣01=1−0=1
Combining these results, we get:
∫01−lnudu=0+1=1
Thus:
E[ln1−r1]=∫01ln1−r1dr=1
-
Conclusion: Therefore,
E[t=1∑Tln1−rt1]=T⋅1=T
Thus, we have proved that
E[t=1∑Tln1−rt1]=T
Proposition5
E[t=1∑Tln1−rt1]≥t=0∑Tpt1∫01r1/pt−1ln1−r1dr≥T+(6π2−1)H
where rt∼Beta(pt1,1).
Proof:
We start with the expression:
E[t=1∑Tln1−rt1]
By using the change of variable x=−1/ptln(r), we can write:
E[t=1∑Tln1−rt1]=−t=1∑T∫01pt1r1/pt−1(−ln(1−r))dr
Next, we use integration by parts with u=1−r1/pt and v=ln(1−r):
−∫01pt1r1/pt−1ln(1−r)dr=∫011−r1−r1/ptdr=H1/pt
where Hz is the z-th harmonic number defined as Hz=∑n=1∞n1−n+z1. Therefore, we have:
−∫01pt1r1/pt−1ln(1−r)dr=n=1∑∞n1−n+1/pt1
=1+n=1∑∞(n+11−n+1/pt1)
For all n∈N⋆, we have:
(n+1)2(n+11−n+1/pt1)=n+1/pt(n+1)(n+1/pt)−(n+1)2
=1/pt+n1+n(1/pt−1)
≥−1/pt+n1+nln(pt)
≥−ptln(pt)
Therefore, by summing over all t∈[1,T],
E[t=1∑Tln1−rt1]≥T+(n=1∑∞(n+1)21)(t=1∑T−ptln(pt))
=T+(6π2−1)HT
This completes the proof that:
E[t=1∑Tln1−rt1]≥t=0∑Tpt1∫01r1/pt−1ln1−r1dr≥T+(6π2−1)H
Reference
[1] Watermarking GPT Outputs. Scott Aaronson, Hendrik Kirchner 2022
[2] Fernandez P, Chaffin A, Tit K, et al. Three bricks to consolidate watermarks for large language models[C]//2023 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 2023: 1-6.
[3] Kirchenbauer J, Geiping J, Wen Y, et al. A watermark for large language models[C]//International Conference on Machine Learning. PMLR, 2023: 17061-17084.