The Dual Problem, Kernels, and SMO

Motivation

WIP

I’m still working my way towards the implementation of a support vector machine, doing my best to leave no gaps in understanding. In the last post I discussed the general application of Lagrangian systems of equations to optimization problems. This approach (solving the resulting system of equations) would have been enough if the optimization problem found in SVMs yielded a system in which the solution lies directly on the intersection of all of the constraints. (Un)fortunately that is not the case, and we have to do something more interesting.

TL;DR:

Translating the previously discussed optimization problem into what is known as the “dual problem”. It will turn out that the dual problem can be expressed in terms of inner products between feature vectors. There will then be some neat Kernel computations of inner products which will let us map feature spaces into higher dimensions relatively inexpensively. This will in turn allow the training steps of sequential minimal optimization, SMO (a gradient descent algorithm specific to constrianed quadratic problems) to be calculated quickly in high dimensional spaces, which we can use to build non-linear classifiers.

The Primal and Dual problems

Before I discussed Lagrangian representations of optimization problems with inequality constraints. Generally, and in all of the referenced discussions of the primal and dual problems, we can also include strict(er) equality constraints $h_i(x)$. I’ll also drop the explicit features $x$ and $y$, from here on out writing in terms of a generalized vector $\vec{x}$ of any dimensionality. Against my better judgement I will also drop the vector mark from this $x$. With these changes problem statement and resulting Lagrangian become:

$$ \min f(x) \\ \begin{align} s.t. & g_i(x) \leq 0 \\ & h_i(x) = 0 \\ \\ \mathcal{L}(x, \lambda, \nu) = f(x) + \sum_{i=i}^m \lambda_ig_i(x) + \sum_{i=1}^p\nu_ih_i(x) \\ \end{align} $$

The Primal Problem is the maximization of the Lagrangian over the Lagrange multipliers $\lambda$ and $\nu$, holding all $\lambda_i \geq 0$. This quantity, which is now a function only of $x$, can then be minimized. Wait weren’t we minimizing f(x) directly? Doesn’t this change our result?. It turns out that it does not.

$$ \min_{\substack{x}}\bigg[\max_{\substack{\lambda, \nu, \lambda_i \geq 0} }\bigg[ (\mathcal{L}(x, \lambda, \nu)))\bigg]\bigg] = \min_{\substack{x}}\Theta_P(x) $$

The inner maximization problem, $\Theta_P(x)$ or the “primal objective” turns out to just be an encoding of the original minimization problem with some special behavior which captures the constraints:

$$ \begin{align} \Theta_P(x) & = \max_{\substack{\lambda, \nu, \lambda_i \geq 0} } \bigg[ f(x) + \sum_{i=i}^m \lambda_ig_i(x) + \sum_{i=1}^p\nu_ih_i(x) \bigg] \\ & = f(x) + \max_{\substack{\lambda, \nu, \lambda_i \geq 0} } \bigg[ \sum_{i=i}^m \lambda_ig_i(x) + \sum_{i=1}^p\nu_ih_i(x) \bigg] \\ \end{align} $$

If, however, $x$ is feasible - it satisfies $g_i(x) \leq 0$ and $h_i(x) = 0$ - then the sum of inequality constraint terms are strictly negative or zero, and the equality constraints are of course zero having satisfied the constraints. Nothing fancy was done there; when the constraints are satisfied you simply have a sum of positive values times negative values, and that sum has to be less than or equal to zero. Because of these two results, for a feasible point $x^*$ the primal objective problem is exactly the original minimization problem. For infeasible points $x$ however, the violations - $g_i(x) \geq 0$, or $h_i(x) \neq 0$ - allows for either of the sums to go to infinity by choosing an arbitrarily large Lagrange multiplier:

$$ \begin{align} \Theta_P(x^*) & = f(x^*) + \max_{\substack{\lambda, \nu, \lambda_i \geq 0} } \bigg[ \underbrace{\sum_{i=i}^m \lambda_ig_i(x^*)}_{\leq 0} + \underbrace{\sum_{i=1}^p\nu_ih_i(x^*)}_\text{= 0} \bigg] \\ \Theta_P(x) & = f(x) + \max_{\substack{\lambda, \nu, \lambda_i \geq 0} } \bigg[ \underbrace{\sum_{i=i}^m \lambda_ig_i(x) + \sum_{i=1}^p\nu_ih_i(x)}_{\inf} \bigg] \\ \text{or} \\ & = f(x) + \begin{cases} 0 & \text{if x is feasible} \\ \inf & \text{if x is not feasible} \\ \end{cases} \end{align} $$

We could describe the primal maximization as being a barrier function which prevents the consideration of infeasible points. For these, the whole system blows up as a result of the structure of our inner-maximization / outer-minimization problem. Finally, we define the feasible solution $P^* = \Theta_P(x^*)$.

The Dual Problem is the minimization of the Lagrangian over x, holding all $\lambda_i \geq 0$. This quantity, which is now a function only of the multipliers $\lambda$ and $\nu$, can then be maximized.

$$ \max_{\substack{\lambda, \nu}}\bigg[\min_{\substack{x, \lambda_i \geq 0} }\bigg[ (\mathcal{L}(x, \lambda, \nu)))\bigg]\bigg] = \max_{\substack{\lambda, \nu}}\Theta_D(\lambda, \nu) $$

So long as the vector $\lambda$ is feasible - $\lambda_i \geq 0$ - then there is now a cost associated with having a positive (violated) $g_i(x)$; our minima grows as a result. Looking at feasible inputs $x*$ however, the Lagrange terms perform in the same way as they did in the primal objective:

$$ \Theta_D(\lambda, \nu\lambda, \nu) = \min_{\substack{x, \lambda_i \geq 0}} \bigg[ f(x^*) + \underbrace{\sum_{i=i}^m \lambda_ig_i(x^*)}_{\leq 0} + \underbrace{\sum_{i=1}^p\nu_ih_i(x^*)}_\text{= 0} \bigg] \\ \text{or} \\ \Theta_D(\lambda, \nu) = \min_{\substack{x^*}} \bigg[ f(x^*) + \text{[A negative number]} + \text{[zero]} \bigg] $$

Because we’ve ended up with the minimum of the original function, minus some value:

$$ \begin{align} \Theta_D(\lambda, \nu) & \leq \mathcal{L}(x^*, \lambda, \nu) \leq \min_{\substack{x}} f(x^*) = P^* \\ \Theta_D(\lambda, \nu) & \leq \Theta_P(x^*) \\ \D^* \leq P^* \\ \end{align} $$

So what we could do, instead of solving the primal problem, is solve the dual problem and that will give us a lower bound on the optimal value. Then maximizing that value as was originally posed in the dual problem will give us the tightest lower bound on $P^*$.

The Dual Problem, Kernels, and SMO

Motivation

The Primal and Dual problems

Kernels in Machine Learning and computing inner products

Sequential Minimal Optimization

References