0%

SVM Revisit

By Z.H. Fu
https://fuzihaofzh.github.io/blog/

SVM (Support Vector Machine) is an important tool in pattern recognition before the wave of different kind of neural networks. In this post, we revisit the derivation of SVM and get some important intuition of the result.

The original problem

Suppose we have some training data D={(x1,y1),(x2,y2),,(xn,yn)}\mathcal{D}=\{(\mathbf{x}_1, y_1),(\mathbf{x}_2, y_2),\cdots,(\mathbf{x}_n, y_n)\}, in which x\mathbf{x} is a feature vector, and yiy_i is the class label for each sample and it is -1 or 1. Our task is to find a classifier that uses a separate hyper plane to divide the points in D\mathcal{D} into two parts. Points on each side of the hyper plane have the same class label y respectively. This classifier is denoted as f(x)=sign(wTx+b)f(\mathbf{x})=\text{sign}(\mathbf{w}^T\mathbf{x}+b). For any hyper plane that successfully divide the two classes. We define the geometry margin of the iith point as the distance from the point to the hyper plane:

γi=yi(wTxi+b)w\gamma_i=\frac{y_i(\mathbf{w}^Tx_i+b)}{||\mathbf{w}||}


We observe that for both classes y=1y=1 and y=1y=-1, the geometry margin is larger than 0. We can define the geometry margin of our dataset D\mathcal{D} as:

γ=mini=1,,nγi\gamma=\min_{i=1,\cdots,n} \gamma_i

We can see that there are a lot of possible hyper planes. Our ultimate goal is to find the best hyper plane (with parameter (w,b)(\mathbf{w}^*,b^*)) that maximize the geometry margin of the dataset D\mathcal{D}. Only the geometry margin of the dataset is maximized, can some points between two classes been classified correctly. We can write our problem in the standard optimization form as:

(P)               maxw,bγs.t. yi(wTxi+b)wγ,i[1,n]\begin{aligned} (P)\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ &\max_{\mathbf{w},b} \gamma\\ s.t.\ & \frac{y_i(\mathbf{w}^T\mathbf{x_i}+b)}{||\mathbf{w}||} \ge \gamma,i \in [1, n] \end{aligned}

Convert the original problem into its dual form

To make our problem easy to solve, we can convert it into its dual form. First of all, we define the function margin to help our further derivations. A function margin of iith point is defined as:

γi^=yi(wTx+b)\hat{\gamma_i}=y_i(\mathbf{w}^T\mathbf{x}+b)

The function margin of the whole dataset is defined as:

γ^=mini=1,,nγ^i\hat\gamma=\min_{i=1,\cdots,n} \hat\gamma_i

It can be easily observed that γ=γ^w\gamma=\frac{\hat\gamma}{||\mathbf{w}||}.
The original problem can be rewritten as:

maxw,bγ^ws.t. yi(wTxi+b)γ^,i[1,n]\begin{aligned} &\max_{w,b} \frac{\hat\gamma}{||\mathbf{w}||}\\ s.t.\ & y_i(\mathbf{w}^T\mathbf{x_i}+b) \ge \hat\gamma,i \in [1, n] \end{aligned}

We observed that γ^\hat\gamma can be assigned to any value, because if we change γ^\hat\gamma to λγ^\lambda\hat\gamma, we can adjust (ww, bb) to (λw\lambda \mathbf{w}, λb\lambda b) to make the problem unchanged. As a result, we can assign γ^=1\hat\gamma=1 to simplify the problem, i.e.

maxw,b1ws.t. yi(wTxi+b)1,i[1,n]\begin{aligned} &\max_{\mathbf{w},b} \frac{1}{||\mathbf{w}||}\\ s.t.\ & y_i(\mathbf{w}^T\mathbf{x_i}+b) \ge 1,i \in [1, n] \end{aligned}

To maximize $ \frac{1}{||\mathbf{w}||}$ is the same as minimize w||\mathbf{w}||, and the problem changes to:

minw,b12w2s.t. yi(wTxi+b)1,i[1,n]\begin{aligned} &\min_{\mathbf{w},b} \frac{1}{2}||\mathbf{w}||^2\\ s.t.\ & y_i(\mathbf{w}^T\mathbf{x_i}+b) \ge 1,i \in [1, n] \end{aligned}

Next, we want to merge the constrains into the objective function to make it an unconstrained problem. We want to design a function gi(w,b)g_i(\mathbf{w},b), s.t.

gi(w,b)={0   if (w,b) fulfill the ith constrain   otherwiseg_i(\mathbf{w},b)=\left\{ \begin{aligned} 0&\ \ \ \text{if } (\mathbf{w},b) \text{ fulfill the ith constrain}\\ \infty&\ \ \ \text{otherwise} \end{aligned} \right.

It can be written into a more compact form with the max\max notation:

gi(w,b)=maxαi0αi[1yi(wTx+b)]g_i(\mathbf{w},b)=\max_{\alpha_i\ge0}\alpha_i[1-y_i(\mathbf{w}^T\mathbf{x}+b)]

In which, αi\alpha_i is called the Largrange multiplier. If (w,b)(\mathbf{w},b) fulfill the constrain, 1yi(wTx+b)01-y_i(\mathbf{w}^T\mathbf{x}+b)\le0 and αi=0\alpha_i=0 to attain its max value 0. On the other hand, if (w,b)(\mathbf{w},b) violate the constraint, 1yi(wTx+b)>01-y_i(\mathbf{w}^T\mathbf{x}+b)>0 and αi=\alpha_i=\infty to attain its max value \infty.
Thus, the optimization problem can be written as $$\begin{aligned}&\min_{\mathbf{w},b}\frac{1}{2}||\mathbf{w}||2+\sum_{i=1}ng_i(\mathbf{w},b)\
=&\min_{\mathbf{w},b}\frac{1}{2}||\mathbf{w}||2+\sum_{i=1}n\max_{\alpha_i\ge0}\alpha_i[1-y_i(\mathbf{w}^T\mathbf{x}+b)]\
=&\max_{\alpha_i\ge0}\min_{\mathbf{w},b}\frac{1}{2}||\mathbf{w}||2+\sum_{i=1}n\alpha_i[1-y_i(\mathbf{w}^T\mathbf{x}+b)]\
=&\max_{\alpha_i\ge0}\min_{\mathbf{w},b}J(\mathbf{w},b,\mathbf{\alpha})
\end{aligned}$$
In which J(w,b,α)=12w2+i=1nαi[1yi(wTx+b)]J(\mathbf{w},b,\mathbf{\alpha})=\frac{1}{2}||\mathbf{w}||^2+\sum_{i=1}^n\alpha_i[1-y_i(\mathbf{w}^T\mathbf{x}+b)]. The inside minimize problem can be solved by setting the derivative to 0:

J(w,b,α)w=wi=1nαiyixi=0J(w,b,α)b=i=1nαiyi=0\begin{aligned} \frac{\partial J(\mathbf{w},b,\mathbf{\alpha})}{\partial \mathbf{w}}&=\mathbf{w}-\sum_{i=1}^n\alpha_iy_i\mathbf{x_i}=0\\ \frac{\partial J(\mathbf{w},b,\mathbf{\alpha})}{\partial \mathbf{b}}&=-\sum_{i=1}^n\alpha_iy_i=0 \end{aligned}

We get w=i=1nαiyixi\mathbf{w^* }=\sum_{i=1}^n\alpha_iy_i\mathbf{x_i}, substitute it into the optimization problem we get:

maxαi0i=1nαi12i=1nj=1nyiyjαiαjxiTxj\max_{\alpha_i\ge0}\sum_{i=1}^n\alpha_i-\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^ny_iy_j\alpha_i\alpha_j\mathbf{x_i}^T\mathbf{x_j}

We write it into a formal optimization problem:

(D)               minα  12i=1nj=1nyiyjαiαjxiTxji=1nαis.t.  i=1nαiyi=0αi0,i[1,n]\begin{aligned} (D)\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \min_{\alpha}\ \ & \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^ny_iy_j\alpha_i\alpha_j\mathbf{x_i}^T\mathbf{x_j}-\sum_{i=1}^n\alpha_i \\ s.t. \ \ &\sum_{i=1}^n\alpha_iy_i=0\\ &\alpha_i\ge0, i \in [1,n] \end{aligned}

This is the dual optimization problem of problem (P) and it is a standard Quadratic Programming which can be solved by a lot of solvers.

Some obervations and intuitions

  1. The variable of the primal optimization problem (P) is to optimize (w,b)(\mathbf{w},b) while the dual problem (D) is to optimize the dual variable α\mathbf{\alpha}. The length of dual variable equals to the number of constrains of the primal problem.

  2. If αi=0\alpha_i=0, yi(wTxi+b)>1y_i(\mathbf{w}^T\mathbf{x_i}+b)>1 the function margin of this is not equal to the function margin of the dataset (γ^=1\hat\gamma=1) . It means that the margin of this sample is not the smallest among others.
    If αi>0\alpha_i>0, it means that the iith constrain is active and yi(wTxi+b)=1y_i(\mathbf{w}^T\mathbf{x_i}+b)=1. The function margin of the iith sample equals to the function margin of the dataset. This sample is called a support vector.

  3. We observe the optimal value of w\mathbf{w^*}:

w=i=1nαiyix\mathbf{w}^*=\sum_{i=1}^n\alpha_iy_i\mathbf{x}

In which αi=0\alpha_i=0 for all non support vectors, which means that the value of w\mathbf{w^*} is only determined by support vectors.

Reference

[1] 李航. 统计学习方法.

[2] Xiaogang, Wang. “ENGG5202 Lecture notes.” Chapter8.