0%

EM (Expectation Maximization) Tutorial

By Z.H. Fu
https://fuzihaofzh.github.io/blog/

The Goal of This Tutorial

EM (Expectation Maximization) algorithm is a widely used method when a model has hidden variables that cannot be observed from the dataset. In this tutorial, we focus on the outline and intuition of the EM algorithm. A Gaussian mixture model is studied as an example.

Gaussian Mixture Model

In the Gaussian mixture model, we assume that there are several generators which can generate samples from a Gaussian distribution. Each sample (denoted as x(i)=(x1,x2,...,xd)x^{(i)}=(x_1,x_2,...,x_d)) of our data is sampled from one of the generators. We can only observe the position of each sample xx and we don’t know for each sample x(i)x^{(i)}, which generator z(i)z^{(i)} generate it. Here, z(i)={1,2,..,k}z^{(i)}=\{1,2,..,k\} stand for the ID of each generator. For each generator jj(j[1,k]j\in[1,k]), it generates samples by a Gaussian distribution with parameter μj,Σj\mu_j,\Sigma_j. We use θ\theta to denote all model parameters, θ=(μ1,μ2,...,μk,Σ1,Σ2,...,Σk,ϕ1,ϕ2,...,ϕk)\theta=(\mu_1,\mu_2,...,\mu_k,\Sigma_1,\Sigma_2,...,\Sigma_k,\phi_1,\phi_2,...,\phi_k). Where ϕj\phi_j is the probability of selecting jjth generator and j=ikϕj=1\sum_{j=i}^k \phi_j=1. The goal of Gaussian mixture model is to find a most suitable parameter θ\theta to describe the data.

However, things get complicated, since we don’t know each sample belongs to which class z(i)=?z^{(i)}=?. Some people often doubt the relation between model parameters and hidden variables. The difference is that for each single sample, the model parameter is the same θ\theta but they can have different hidden viaravbles z(i)z^{(i)}.

General Problem

We abstract above problem into more general form. Suppose we have some samples, each sample ii is composed of visible features x(i)x^{(i)} and hidden features z(i)z^{(i)}. Our ultimate goal is to maximize the incomplete log likelihood:

logp(xθ)=logzp(x,zθ)dz=logzq(zx)p(x,zθ)q(zx)zq(zx)logp(x,zθ)q(zx)=L(q,θ)   (Jensen’s inequality)=zq(zx)logp(x,zθ)zq(zx)logq(zx)    (The second term is not related to θ)\begin{aligned} \log p(x|\theta)&=\log \int_zp(x,z|\theta)dz\\ &= \log \int_z q(z|x)\frac{p(x,z|\theta)}{q(z|x)}\\ &\ge \int_z q(z|x) \log\frac{p(x,z|\theta)}{q(z|x)}=\mathcal{L}(q,\theta)\ \ \ \text{(Jensen's inequality)}\\ &=\int_z q(z|x) \log p(x,z|\theta)-\int_z q(z|x) \log q(z|x)\ \ \ \ (\text{The second term is not related to }\theta)\\ \end{aligned}

As shown above, zq(zx)logp(x,zθ)\int_z q(z|x) \log p(x,z|\theta) is called expected complete log likelihood. L(q,θ)\mathcal{L}(q,\theta) is the lower bound of the incomplete log likelihood. It is a lower bound of the incomplete log likelihood related to θ\theta. If we maximize it, the incomplete log likelihood will also increase.

In the next step, we show how to choose q(zx)q(z|x). We claim that when q(zx)=p(zx,θ)q(z|x)=p(z|x,\theta), the lower bound get the equal sign. We substitute q(zx)=p(zx,θ)q(z|x)=p(z|x,\theta):

L(q,θ)=zq(zx)logp(x,zθ)q(zx)=zp(zx,θ)logp(x,z,θ)p(zx,θ)=zp(zx,θ)logp(xθ)=logp(xθ)zp(zx,θ)=logp(xθ)\begin{aligned} &\mathcal{L}(q,\theta)=\int_z q(z|x) \log\frac{p(x,z|\theta)}{q(z|x)}\\ =&\int_z p(z|x,\theta) \log\frac{p(x,z,\theta)}{p(z|x,\theta)}\\ =&\int_z p(z|x,\theta) \log p(x|\theta)\\ =& \log p(x|\theta)\int_z p(z|x,\theta)\\ =& \log p(x|\theta) \end{aligned}

In above derivation, we show that when q(zx)=p(zx,θ)q(z|x)=p(z|x,\theta), Jensen’s inequality gets equal. After we determined q(zx)q(z|x), the remaining issue is to find a θ\theta that maximize the expected complete log likelihood. The step of determine q(zx)q(z|x) is call the E - step, and the procedure of maximizing the expected complete log likelihood is called the M-step.

Back to Gaussians Mixture Model

We use the EM algorithm to solve our Gaussian mixture model.

We denote $$\begin{aligned}w_j{(i)}&=p(z{(i)}=j|x^{(j)};\theta)\
&=\frac{p(z{(i)}=j)p(x|z{(i)}=j;\theta)}{\sum_{l=1}^k p(z{(i)}=l)p(x|z{(i)}=l;\theta)}\
p(z^{(i)}=j)&=\phi_j
\end{aligned}$$
wj(i)w_j^{(i)} can be viewed as the responsibility that component k takes for ‘explaining’ the observation x. ϕj\phi_j is the priori distribution of selecting jjth Gaussians. The expected complete log likelihood is:

i=1mz(i)q(z(i)x(i))logp(x(i),z(i);θ)q(z(i)x(i))=i=1mj=1kwj(i)log1(2π)n/2Σj1/2exp(12(x(i)μj)TΣ1(x(i)μj))ϕjwj(i)\begin{aligned} &\sum_{i=1}^m\sum_{z^{(i)}}q(z^{(i)}|x^{(i)})\log\frac{p(x^{(i)},z^{(i)};\theta)}{q(z^{(i)}|x^{(i)})}\\ =&\sum_{i=1}^m\sum_{j=1}^k w_j^{(i)}\log\frac{\frac{1}{(2\pi)^{n/2}|\Sigma_j|^{1/2}}\exp(-\frac{1}{2}(x^{(i)}-\mu_j)^T\Sigma^{-1}(x^{(i)}-\mu_j))\phi_j}{w_j^{(i)}} \end{aligned}

We can calculate the derivative of it w.r.t. each element of μ\mu (denoted as μl\mu_l).

μli=1mj=1kwj(i)log1(2π)n/2Σj1/2exp(12(x(i)μj)TΣ1(x(i)μj))ϕjwj(i)=μli=1mj=1kwj(i)12(x(i)μj)TΣ1(x(i)μj)=i=1mj=1kwl(i)Σ1(x(i)μl)\begin{aligned} &\nabla_{\mu_l}\sum_{i=1}^m\sum_{j=1}^k w_j^{(i)}\log\frac{\frac{1}{(2\pi)^{n/2}|\Sigma_j|^{1/2}}\exp(-\frac{1}{2}(x^{(i)}-\mu_j)^T\Sigma^{-1}(x^{(i)}-\mu_j))\phi_j}{w_j^{(i)}}\\ =&-\nabla_{\mu_l}\sum_{i=1}^m\sum_{j=1}^k w_j^{(i)}\frac{1}{2}(x^{(i)}-\mu_j)^T\Sigma^{-1}(x^{(i)}-\mu_j)\\ =&\sum_{i=1}^m\sum_{j=1}^k w_l^{(i)}\Sigma^{-1}(x^{(i)}-\mu_l) \end{aligned}

Set this term equal to 0. We get:

μl=i=1mwl(i)x(i)i=1mwl(i)\mu_l=\frac{\sum_{i=1}^m{w_l^{(i)}}x^{(i)}}{\sum_{i=1}^m{w_l^{(i)}}}

Other parameters in θ\theta can also be solved analytically by the same method.

Summary

  • EM algorithm can be used to find the parameters of models dealing with data with missing variables.
  • EM algorithm constructs the lower bound of the incomplete log likelihood. It maximizes the lower bound to increase the original incomplete log likelihood.
  • EM algorithm has two steps, i.e. E-step and M-step. In E-step, we use the best estimation p(zx,θ)p(z|x,\theta) to estimate q(zx)q(z|x). In M-step, we maximize the parameters θ\theta by setting the derivative of each parameter to 0.

Reference

  1. Ng, Andrew. “CS229 Lecture notes.” CS229 Lecture notes 1.1 (2000): 1-3.
  2. Xiaogang, Wang. “ENGG5202 Lecture notes.” Chapter4, Homework1-3.
  3. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.