0%

Theory Proofs for Aaronson's LLM Water Mark Method

By Z.H. Fu
https://fuzihaofzh.github.io/blog/

The importance of watermarking for large language models (LLMs) cannot be overstated. Aaronson[1] et al. propose a statistics based method to embed watermark into LLM generated text. They initially provided a brief overview of their method in a set of slides with no proofs, while Fernandez [2] et al. offers a more detailed theoretical proof with some probability tricks.

This blog post aims to explain Aaronson’s watermarking method in a straightforward, beginner-friendly manner, avoiding reliance on advanced probability techniques. Notably, we highlight that this method is essentially the same as using the Gumbel-Max Trick.

Method

Aaronson[1]'s watermarking method modifies the token selection process in large language models to embed an invisible trace into the generated text. During generation, a secret key kk is used to produce a random vector r\mathbf{r}, where each element rvr_v of this vector corresponds to a token vv in the vocabulary VV, and rvr_v is uniformly distributed in [0,1][0, 1].

Formally, given a large language model (LLM) that at each step generates a probability distribution p=(p1,p2,,pV)\mathbf{p} = (p_1, p_2, \ldots, p_V) over the vocabulary VV, where pvp_v is the probability of token vv, Aaronson’s method adjusts the selection process. The next token xx is selected as:

x=argmaxvV(rv1/pv)x = \arg\max_{v \in V} \left( r_v ^{1/p_v} \right)

This ensures that tokens with both high original probabilities and favorable random values rvr_v are chosen more frequently.

For detection, the watermark’s presence is verified by computing a score STS_T for a sequence of TT tokens. This score aggregates the log-transformed random values of the selected tokens:

ST=t=1Tln(1rx(t))S_T = -\sum_{t=1}^{T} \ln(1 - r_{x(t)})

where rx(t)r_{x(t)} is the random value corresponding to the tt-th token x(t)x(t). A threshold is used to determine if the score STS_T is significantly higher than expected under the null hypothesis H0H_0 (no watermark). If STS_T exceeds this predefined threshold, the text is flagged as watermarked. This method maintains the original probability distribution on average, making the watermark imperceptible while allowing robust detection using the secret key.

Proofs Sketch

Generating

When generating the tokens, we alter the probability from p=(p1,,pV)\mathbb{p}=(p_1, \ldots, p_V) to (R11/p1,R21/p2,,RV1/pV)(R_1^{1/p_1},R_2^{1/p_2},\ldots,R_V^{1/p_V}), where RiU(0,1)R_i\sim U(0,1). The first proposition (Proposition 2) we need to prove is that P(v=V)=pv\mathbb{P}(v = V^{\star} ) = p_v, where V=argmaxvRv1/pvV^{\star} = \arg \max_v R_v^{1/p_v}. This means that selecting the token from the maximal new distribution (Rv1/pvR_v^{1/p_v}) is equivalent to sampling the token from the original probability distribution (pip_i). This is an important property that transforms the sampling operation into a maximization operation. Actually, this is exactly the Gumbel-Max Trick which we show the equivalence in Proposition 1.

Detecting

When detecting the watermark, we first generate the same uniform random vector rr (based on a fixed random seed and the same context words) and select the x(t)x(t)th token, where x(t)x(t) is the current token ID. For non-watermarked text, the vector is not associated with the maximal token. Therefore, when taking rx(t)r_{x(t)}, it is simply a uniform distribution. However, if the text is watermarked, we can prove that rx(t)r_{x(t)} follows a Beta distribution (Corollary 3). Thus, non-watermarked text yields a uniform distribution, whereas watermarked text yields a Beta distribution for the random variable rx(t)r_{x(t)}. We can test the distribution of rx(t)r_{x(t)} to determine which distribution it belongs to. If it follows a uniform distribution, the text is non-watermarked. If it follows a Beta distribution, the text is watermarked.

The next question is how we can differentiate between the uniform distribution and the Beta distribution. This is where the scores (ST=t=1Tln(1rx(t)))\left(S_T = -\sum_{t=1}^{T} \ln(1 - r_{x(t)}) \right) come into play. Given TT tokens, for a uniform distribution, we can prove that the expectation of the score is TT (Proposition 4). Meanwhile, for a Beta distribution, we can prove (Proposition 5) that the expectation of the score is lower-bounded by

t=0T1pt01r1/pt1ln11rdrT+(π261)H,\sum_{t=0}^{T} \frac{1}{p_t} \int_0^1 r^{1/p_t - 1} \ln \frac{1}{1 - r} \, dr \geq T + \left( \frac{\pi^2}{6} - 1 \right) H,

which means the score will be higher than the uniform distribution. We can set a threshold: if the score is higher than this threshold, it indicates a Beta distribution.

To summarize:

  1. Calculate the Score: Compute the score ST=t=1Tln(1rx(t))S_T = -\sum_{t=1}^{T} \ln(1 - r_{x(t)}) for the given text over TT tokens.

  2. Expectation for Uniform Distribution: For non-watermarked text (uniform distribution), the expected score is TT.

  3. Expectation for Beta Distribution: For watermarked text (Beta distribution), the expected score is higher than TT, bounded by the given integral expression.

  4. Set a Threshold: Establish a threshold value slightly higher than TT to distinguish between the two distributions.

  5. Compare Scores:

    • If STS_T is approximately TT or lower, the text is likely non-watermarked.
    • If STS_T is significantly higher than TT, the text is likely watermarked.

This method allows for effective differentiation between texts following a uniform distribution and those following a Beta distribution.

Equivelence to Gumbel-Max Trick

Recall Gumbel-Max Trick

The Gumbel-Max Trick is a method for sampling from a categorical distribution using the Gumbel distribution. Here’s a step-by-step explanation of how it works:

  1. Generate Gumbel Noise: For each category ii in the distribution, generate a random value gig_i from the Gumbel distribution. The Gumbel distribution can be sampled using the following formula:

    gi=log(log(Ui))g_i = -\log(-\log(U_i))

    where UiU_i is a uniform random variable between 0 and 1.

  2. Combine with Log-Probabilities: Add the log-probability of each category to the corresponding Gumbel noise. If pip_i is the probability of category ii, compute:

    yi=log(pi)+giy_i = \log(p_i) + g_i

  3. Select the Maximum: Choose the category with the highest value of yiy_i:

    selected category=argmaxiyi\text{selected category} = \arg\max_i y_i

This method leverages the properties of the Gumbel distribution to efficiently sample from a categorical distribution. The addition of Gumbel noise transforms the problem of sampling into finding the maximum value, which is computationally straightforward.

The Gumbel-Max Trick is particularly useful in settings where sampling needs to be performed repeatedly or efficiently, such as in machine learning algorithms and Monte Carlo simulations.

Proof of the Equivelence to Gumbel-Max Trick

Proposition1

Consider a discrete distribution p=(p1,,pV)\mathbb{p}=(p_1, \ldots, p_V) and VV random variables R=(R1,,RV)\mathbb{R}=(R_1, \ldots, R_V) such that RvR_v are i.i.d. with RvU[0,1]R_v \sim \mathcal{U}_{[0,1]}.
Let V=argmaxvRv1/pvV^{\star} = \arg \max_v R_v^{1/p_v}. G=log(pi)+giG^{\star} = \log(p_i) + g_i, gi=log(log(Ui))g_i=-\log(-\log(U_i)). Then

V=GV^{\star} = G^{\star}

Proof

argmaxiGi=argmaxi(log(pi)+gi)=argmaxi(log(pi)log(log(Ri)))=argmaxiexp(log(pi)log(log(Ri)))=argmaxi(exp(log(pi))exp(log(log(Ri))))=argmaxi(pi1log(Ri))=argmini(log(Ri)pi)=argmaxi(log(Ri)pi)=argmaxi(log(Ri1/pi))=argmaxi(Ri1/pi)\begin{aligned} \arg \max_i G_i &= \arg \max_i \left(\log(p_i) + g_i\right) \\ &= \arg \max_i \left(\log(p_i) - \log(-\log(R_i))\right) \\ &= \arg \max_i \exp \left(\log(p_i) - \log(-\log(R_i))\right) \\ &= \arg \max_i \left(\exp(\log(p_i)) \cdot \exp(-\log(-\log(R_i)))\right) \\ &= \arg \max_i \left(p_i \cdot \frac{1}{-\log(R_i)}\right) \\ &= \arg \min_i \left(-\frac{\log(R_i)}{p_i}\right) \\ &= \arg \max_i \left(\frac{\log(R_i)}{p_i}\right) \\ &= \arg \max_i \left(\log(R_i^{1/p_i})\right) \\ &= \arg \max_i \left(R_i^{1/p_i}\right) \end{aligned}

Therefore,

V=argmaxvRv1/pvV^{\star} = \arg \max_v R_v^{1/p_v}

Thus, the theorem is proved:

V=GV^{\star} = G^{\star}

Detailed Proofs

Proposition2

Consider a discrete distribution p=(p1,,pV)\mathbb{p}=(p_1, \ldots, p_V) and VV random variables R=(R1,,RV)\mathbb{R}=(R_1, \ldots, R_V) such that RvR_v are i.i.d. with RvU[0,1]R_v \sim \mathcal{U}_{[0,1]}.
Let V=argmaxvRv1/pvV^{\star} = \arg \max_v R_v^{1/p_v}. Then P(V=v)=pv\mathbb{P}(V^{\star} = v) = p_v.

Proof
To prove this proposition, we need to show that the probability of VV^{\star} being equal to a specific vv is exactly pvp_v.

  1. Define Xv=Rv1/pvX_v = R_v^{1/p_v}. Since RvU[0,1]R_v \sim \mathcal{U}_{[0,1]}, the cumulative distribution function (CDF) of RvR_v is FRv(r)=rF_{R_v}(r) = r for r[0,1]r \in [0,1].

  2. The CDF of XvX_v can be derived as follows:

    FXv(x)=P(Xvx)=P(Rv1/pvx)F_{X_v}(x) = \mathbb{P}(X_v \leq x) = \mathbb{P}(R_v^{1/p_v} \leq x)

    =P(Rvxpv)=FRv(xpv)=xpv for x[0,1]= \mathbb{P}(R_v \leq x^{p_v}) = F_{R_v}(x^{p_v}) = x^{p_v} \text{ for } x \in [0,1]

  3. The probability density function (PDF) of XvX_v is then:

    fXv(x)=ddxFXv(x)=pvxpv1 for x[0,1]f_{X_v}(x) = \frac{d}{dx} F_{X_v}(x) = p_v x^{p_v-1} \text{ for } x \in [0,1]

  4. Consider V=argmaxvRv1/pv=argmaxvXvV^{\star} = \arg \max_v R_v^{1/p_v} = \arg \max_v X_v. We need to find P(V=v)\mathbb{P}(V^{\star} = v).

  5. The event V=vV^{\star} = v means that XvX_v is the largest among all XiX_i for i=1,,Vi = 1, \ldots, V. The probability of this event is:

    P(V=v)=01P(Xv=x and Xv>Xi for all iv)dx\mathbb{P}(V^{\star} = v) = \int_0^1 \mathbb{P}(X_v = x \text{ and } X_v > X_i \text{ for all } i \neq v) \, dx

  6. Since XvX_v and XiX_i are independent:

    P(Xv>Xi)=01P(Xv>x)fXi(x)dx=01(1xpv)pixpi1dx\mathbb{P}(X_v > X_i) = \int_0^1 \mathbb{P}(X_v > x) f_{X_i}(x) \, dx = \int_0^1 (1 - x^{p_v}) p_i x^{p_i-1} \, dx

    =pi01xpi1xpv+pi1dx= p_i \int_0^1 x^{p_i-1} - x^{p_v + p_i - 1} \, dx

  7. Evaluating the integrals:

    01xpi1dx=1piand01xpv+pi1dx=1pv+pi\int_0^1 x^{p_i-1} \, dx = \frac{1}{p_i} \quad \text{and} \quad \int_0^1 x^{p_v + p_i - 1} \, dx = \frac{1}{p_v + p_i}

    P(Xv>Xi)=1pipv+pi\mathbb{P}(X_v > X_i) = 1 - \frac{p_i}{p_v + p_i}

  8. Combining all V1V-1 comparisons:

    P(V=v)=01ivP(Xv>Xi)fXv(x)dx\mathbb{P}(V^{\star} = v) = \int_0^1 \prod_{i \neq v} \mathbb{P}(X_v > X_i) f_{X_v}(x) \, dx

    =01(1pipv+pi)pvxpv1dx= \int_0^1 \left(1 - \frac{p_i}{p_v + p_i}\right) p_v x^{p_v-1} \, dx

  9. This simplifies to P(V=v)=pv\mathbb{P}(V^{\star} = v) = p_v.

Thus, we have proved that P(V=v)=pv\mathbb{P}(V^{\star} = v) = p_v, as required.

Corollary3

RVBeta(1pV,1).R_{V^{\star}} \sim \text{Beta}\left(\frac{1}{p_{V^{\star}}}, 1\right).

Proof
To prove this corollary, we need to show that the random variable RVR_{V^{\star}}, where V=argmaxvRv1/pvV^{\star} = \arg \max_v R_v^{1/p_v}, follows a Beta distribution with parameters (1pV,1)\left(\frac{1}{p_{V^{\star}}}, 1\right).

  1. From the proposition, we know that VV^{\star} is the index such that RV1/pVR_{V^{\star}}^{1/p_{V^{\star}}} is the maximum among R11/p1,,RV1/pVR_1^{1/p_1}, \ldots, R_V^{1/p_V}.

  2. Let Xv=Rv1/pvX_v = R_v^{1/p_v}. We derived in the proof of the proposition that the CDF of XvX_v is FXv(x)=xpvF_{X_v}(x) = x^{p_v} and the PDF is fXv(x)=pvxpv1f_{X_v}(x) = p_v x^{p_v-1} for x[0,1]x \in [0,1].

  3. For VV^{\star} to be the index of the maximum XvX_v, XVX_{V^{\star}} must be the maximum of VV i.i.d. random variables. The maximum of VV i.i.d. Uniform[0,1]\text{Uniform}[0,1] random variables is known to follow a Beta distribution with parameters VV and 11.

  4. Therefore, if XVBeta(V,1)X_{V^{\star}} \sim \text{Beta}(V, 1), then RVR_{V^{\star}} can be obtained by raising XVX_{V^{\star}} to the power of pVp_{V^{\star}}.

  5. The transformation X=Y1/αX = Y^{1/\alpha} where YBeta(α,β)Y \sim \text{Beta}(\alpha, \beta) results in XBeta(β,α)X \sim \text{Beta}(\beta, \alpha). Applying this transformation with α=1pV\alpha = \frac{1}{p_{V^{\star}}} and β=1\beta = 1, we get:

    RVBeta(1pV,1)R_{V^{\star}} \sim \text{Beta}\left(\frac{1}{p_{V^{\star}}}, 1\right)

Thus, the corollary is proved: RVBeta(1pV,1)R_{V^{\star}} \sim \text{Beta}\left(\frac{1}{p_{V^{\star}}}, 1\right).

Proposition4

E[t=1Tln11rt]=T01ln11rdr=T\mathbb{E}\left[\sum_{t=1}^{T} \ln \frac{1}{1-r_t}\right] = T \int_0^1 \ln \frac{1}{1-r} \, dr = T

where rtr_t are i.i.d. random variables uniformly distributed over (0,1)(0,1), we will proceed as follows:

  1. Expectation of the Sum: By the linearity of expectation,

    E[t=1Tln11rt]=t=1TE[ln11rt]\mathbb{E}\left[\sum_{t=1}^{T} \ln \frac{1}{1-r_t}\right] = \sum_{t=1}^{T} \mathbb{E}\left[\ln \frac{1}{1-r_t}\right]

    Since rtr_t are i.i.d., the expectation of ln11rt\ln \frac{1}{1-r_t} is the same for each tt, so we have:

    E[t=1Tln11rt]=TE[ln11r]\mathbb{E}\left[\sum_{t=1}^{T} \ln \frac{1}{1-r_t}\right] = T \mathbb{E}\left[\ln \frac{1}{1-r}\right]

  2. Expectation of ln11r\ln \frac{1}{1-r}: We now need to calculate E[ln11r]\mathbb{E}\left[\ln \frac{1}{1-r}\right] where rU(0,1)r \sim \mathcal{U}(0,1). The expectation is given by:

    E[ln11r]=01ln11rf(r)dr\mathbb{E}\left[\ln \frac{1}{1-r}\right] = \int_0^1 \ln \frac{1}{1-r} f(r) \, dr

    Since rr is uniformly distributed over (0,1)(0,1), its probability density function f(r)=1f(r) = 1 for r(0,1)r \in (0,1). Therefore, the integral simplifies to:

    E[ln11r]=01ln11rdr\mathbb{E}\left[\ln \frac{1}{1-r}\right] = \int_0^1 \ln \frac{1}{1-r} \, dr

  3. Evaluating the Integral: To evaluate the integral 01ln11rdr\int_0^1 \ln \frac{1}{1-r} \, dr, we can use a substitution. Let u=1ru = 1 - r. Then, du=drdu = -dr, and the limits of integration change from r=0r = 0 to r=1r = 1 becoming u=1u = 1 to u=0u = 0. The integral becomes:

    01ln11rdr=10ln1u(du)=01ln1udu\int_0^1 \ln \frac{1}{1-r} \, dr = \int_1^0 \ln \frac{1}{u} (-du) = \int_0^1 \ln \frac{1}{u} \, du

    Simplifying further:

    01ln1udu=01lnudu\int_0^1 \ln \frac{1}{u} \, du = \int_0^1 -\ln u \, du

    The integral of lnu-\ln u can be evaluated using integration by parts. Let v=lnuv = -\ln u and dw=dudw = du. Then, dv=1ududv = -\frac{1}{u} du and w=uw = u. Integration by parts gives:

    vdw=vw01wdv\int v \, dw = vw \bigg|_0^1 - \int w \, dv

    Substituting vv and ww:

    01lnudu=[ulnu]0101(u)(1u)du=[ulnu]01+01du\int_0^1 -\ln u \, du = [-u \ln u]_0^1 - \int_0^1 (-u) \left(-\frac{1}{u}\right) du = [-u \ln u]_0^1 + \int_0^1 du

    Evaluating these terms:

    [ulnu]01=[0ln01ln1]=00=0[-u \ln u]_0^1 = [0 \cdot \ln 0 - 1 \cdot \ln 1] = 0 - 0 = 0

    01du=u01=10=1\int_0^1 du = u \bigg|_0^1 = 1 - 0 = 1

    Combining these results, we get:

    01lnudu=0+1=1\int_0^1 -\ln u \, du = 0 + 1 = 1

    Thus:

    E[ln11r]=01ln11rdr=1\mathbb{E}\left[\ln \frac{1}{1-r}\right] = \int_0^1 \ln \frac{1}{1-r} \, dr = 1

  4. Conclusion: Therefore,

    E[t=1Tln11rt]=T1=T\mathbb{E}\left[\sum_{t=1}^{T} \ln \frac{1}{1-r_t}\right] = T \cdot 1 = T

Thus, we have proved that

E[t=1Tln11rt]=T\mathbb{E}\left[\sum_{t=1}^{T} \ln \frac{1}{1-r_t}\right] = T

Proposition5

E[t=1Tln11rt]t=0T1pt01r1/pt1ln11rdrT+(π261)H\mathbb{E} \left[ \sum_{t=1}^{T} \ln \frac{1}{1 - r_t} \right] \geq \sum_{t=0}^{T} \frac{1}{p_t} \int_0^1 r^{1/p_t - 1} \ln \frac{1}{1 - r} \, dr \geq T + \left( \frac{\pi^2}{6} - 1 \right) H

where rtBeta(1pt,1)r_{t} \sim \text{Beta}\left(\frac{1}{p_t}, 1\right).

Proof:

We start with the expression:

E[t=1Tln11rt]\mathbb{E} \left[ \sum_{t=1}^{T} \ln \frac{1}{1 - r_t} \right]

By using the change of variable x=1/ptln(r)x = -1/p_t \ln(r), we can write:

E[t=1Tln11rt]=t=1T011ptr1/pt1(ln(1r))dr\mathbb{E} \left[ \sum_{t=1}^{T} \ln \frac{1}{1 - r_t} \right] = -\sum_{t=1}^{T} \int_0^1 \frac{1}{p_t} r^{1/p_t - 1} (-\ln(1 - r)) \, dr

Next, we use integration by parts with u=1r1/ptu = 1 - r^{1/p_t} and v=ln(1r)v = \ln(1 - r):

011ptr1/pt1ln(1r)dr=011r1/pt1rdr=H1/pt-\int_0^1 \frac{1}{p_t} r^{1/p_t - 1} \ln(1 - r) \, dr = \int_0^1 \frac{1 - r^{1/p_t}}{1 - r} \, dr = \mathcal{H}_{1/p_t}

where Hz\mathcal{H}_z is the zz-th harmonic number defined as Hz=n=11n1n+z\mathcal{H}_z = \sum_{n=1}^{\infty} \frac{1}{n} - \frac{1}{n+z}. Therefore, we have:

011ptr1/pt1ln(1r)dr=n=11n1n+1/pt-\int_0^1 \frac{1}{p_t} r^{1/p_t - 1} \ln(1 - r) \, dr = \sum_{n=1}^{\infty} \frac{1}{n} - \frac{1}{n + 1/p_t}

=1+n=1(1n+11n+1/pt)= 1 + \sum_{n=1}^{\infty} \left( \frac{1}{n + 1} - \frac{1}{n + 1/p_t} \right)

For all nNn \in \mathbb{N}^\star, we have:

(n+1)2(1n+11n+1/pt)=(n+1)(n+1/pt)(n+1)2n+1/pt(n + 1)^2 \left( \frac{1}{n + 1} - \frac{1}{n + 1/p_t} \right) = \frac{(n + 1)(n + 1/p_t) - (n + 1)^2}{n + 1/p_t}

=1+n1/pt+n(1/pt1)= \frac{1 + n}{1/p_t + n} (1/p_t - 1)

1+n1/pt+nln(pt)\geq -\frac{1 + n}{1/p_t + n} \ln(p_t)

ptln(pt)\geq -p_t \ln(p_t)

Therefore, by summing over all t[1,T]t \in [1, T],

E[t=1Tln11rt]T+(n=11(n+1)2)(t=1Tptln(pt))\mathbb{E}\left[ \sum_{t=1}^{T} \ln \frac{1}{1 - r_t} \right] \geq T + \left( \sum_{n=1}^{\infty} \frac{1}{(n + 1)^2} \right) \left( \sum_{t=1}^{T} - p_t \ln(p_t) \right)

=T+(π261)HT= T + \left( \frac{\pi^2}{6} - 1 \right) H_T

This completes the proof that:

E[t=1Tln11rt]t=0T1pt01r1/pt1ln11rdrT+(π261)H\mathbb{E} \left[ \sum_{t=1}^{T} \ln \frac{1}{1 - r_t} \right] \geq \sum_{t=0}^{T} \frac{1}{p_t} \int_0^1 r^{1/p_t - 1} \ln \frac{1}{1 - r} \, dr \geq T + \left( \frac{\pi^2}{6} - 1 \right) H

Reference

[1] Watermarking GPT Outputs. Scott Aaronson, Hendrik Kirchner 2022
[2] Fernandez P, Chaffin A, Tit K, et al. Three bricks to consolidate watermarks for large language models[C]//2023 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 2023: 1-6.
[3] Kirchenbauer J, Geiping J, Wen Y, et al. A watermark for large language models[C]//International Conference on Machine Learning. PMLR, 2023: 17061-17084.