Kernel methods, SVM
Consider ridge regression
We want to learn =
=1
Obtain w as = argmin
11 . .
1
⋮
⋮
=
⋮
⋮
1
= , =
1
⋮
⋮
2
( −( ) )2 +
=1
1
⋮
⋮
=1
(for r-th training example) = argmin −
2
+
2
Notation:
X is a matrix, x is a vector
Solve by setting derivatives to zero, to get = ( + )−1
(Px1)
(PxN)(NxP)
(PxP)
For a new example
(PxN) (Nx1)
= = ( + )−1
Getting to Dual form = ( + )−1
+ =
1
where =
=
1
− =
1
− =
−
gives the dual solution, from which we can obtain w = or =
=1
(here, xi is the i-th example)
11 . .
1
⋮
⋮
=
⋮
⋮
1
1
⋮
⋮
1
⋮
⋮
Substituting w = in =
we get = −
+ =
= + −
1
− ,
We can compute as: = ( + )−1 where K =
i.e.
= ,
11 . .
⋮
⋮
⋮
1
1
⋮
⋮
⋮
11
⋮
⋮
.....
1
1
⋮
⋮
=(xi.xj) (dot product)
K: matrix of inner products of N vectors (Gram Matrix)
K: matrix of similarities of example pairs
(since dot product gives similarity between vectors)
(1 , 1 ) . . . . .
⋮
K=
⋮
( , 1 )
(1 , )
⋮
⋮
( , )
Now, = =
= ,
=
=1
=1
(since w = )
,
So in the dual form:
Compute = ( + )−1 where K = , i.e.
= ,
Evaluate on a new data point xnew as y = f =
=1
,
Both of these require inner products between data points
Substitute the inner product with a kernel function
K(x,z) =
Then, we obtain a procedure for ridge regression over a new feature space F given by ɸ: x -> ɸ(x) ϵ F
(kernel ridge regression)
K: R2 -> R3
We can calculate the similarity (dot product) between data points in the new feature space,
….. in terms of the Kernel function on the original data points.
i.e. in terms of K(x,z)
(why is this a ‘big deal’?)
Kernel trick
Quadratic kernel
Consider x=(x1,x2) and z=(z1,z2) in 2D space.
Then, 2 = (xTz)2 = (x1z1+x2z2)2 = x12 z12 + x22 z22 + 2x1z1x2z2
= < (x12 , x22, √2x1x2), (z12 , z22, √2z1z2) >
Converted from linear to to quadratic regression
= < ɸ(x), ɸ(z)> = ɸ(x)T ɸ(z)
Here, ɸ(x) is [x12 , x22, √2x1x2]
ɸ(.) projects 2D data to 3D
But we do not need to perform computations in terms of 3D space.
Use the kernel function < ɸ(x), ɸ(z)> = < x, z >2 = Kɸ(x,z) to calculate the inner product (similarity) in 3D space in terms of 2D space operations.
Use the kernel function to train the regression model and apply it = ( + )−1 y = f =
=1
,
For classification - learn a non-linear separation by using a kernel
0
0
x x x x x x
0
0
ɸ0
ɸ0
ɸ0 ɸ
0
ɸx
ɸx ɸx
ɸ
ɸx x
ɸx
more on this next -- SVM
Linear separation in transformed space
Some commonly used kernel functions
Polynomial: K(u,v) = (u.v)d
K(u,v) = (u.v + c)d Here c is a parameter that controls influence of higher order terms in the polynomial.
Corresponds to polynomial regression in the implicit (transformed) feature space --- but without the computational burden of higher order terms and added parameters (model weights)
− 2
2 2
Gaussian (radial basis function): , = exp(−
)
(or often written, with gamma() parameter, as exp(− −
Transformed space has infinite dimension
– considers polynomials of all orders
Sigmoid: K(u,v) = tanh (ηu.v + r)
2
)
Consider data in 2D space. x=(x1,x2) and z=(z1,z2)
Polynomial kernel K(x,z) = (x.z + 1)3
This kernel maps the 2D data to 10 dimensions
1 x1 x2 x12 x22 x1x2 x1x22 x12 x2 x13
x23
We do not need to explicitly consider the 10D space. We calculate similarity between vectors in this 10D space in terms of their 2D vectors.
Kernel functions
-
-
Allow operating on data as though it were projected onto higher dimensional space, but by using calculations on the original space
They allow calculating dot products in transformed space, without having to actually project the data into the new space
-
Kernel functions need to satisfy certain properties
Can construct new kernels from basic kernels
-
Learning problems can be formulated as optimization tasks
Consider primal and dual formulations of the optimization problem
Dual problem – solved in terms of higher dimensional space
Kernel functions allow computing dot products in this new space in terms of original dimensions
-
Kernel PCA, Kernel functions in Support Vector Machines
Kernel PCA
Support vector machines
Find the maximum margin hyperplane separating two classes
Use of kernel functions to transform data into higher dimensional space
Linear separation in higher dimensions
Linear separation – which one is better ? (why?)
Larger margin between classes is better
- maximum margin hyperplane
Points at/near the boundary are the support vectors
- these determine the optimal separating hyperplane
Learn the separating hyperplane which has the maximum margin from points belonging to the two classes on different sides of the hyperplane
H+
Find the model (i.e. w, b) that maximizes the margin M
Hyperplane H wTx Consider two points p, q on the +1 plane
Then wTp + b = 1, and wTq + b = 1 and wT(p-q) =0
margin
M
class +1 data wTx + b >= 1
H-
+b=1
wTx + b = 0 wTx + b = -1
class -1 data wTx + b = 1 for all data x belonging to class +1 wTx + b =1 for all i
We will re-cast this optimization in terms of Largangian formulation
- each constraint is replaced by Lagrangian multiplier – easier to handle
- training data appears as dot products between pairs of data points
-- allows use of Kernel functions
What to do with points that may not be linearly separable,
class +1 data
- can still use quadratic programming with penalty for errors
- introduce slack variables
Minimize wTw/2 +
=1( )
εp
Separating
hyperplane H
εq
class -1 data
subject to the constraints:
(wTxi + b)yi >= 1- ≥ 0 for all
The constant C > 0 sets the tradeoff between maximizing the margin (first term in the objective) and minimizing the slack.
Note – examples with positive slack will be ‘within’ the margin (i.e. on the ‘wrong’ side of the margin)
(soft margin SVM)
Formulate the optimization in terms of Lagrangian min /2 s.t. (wTxi + b)yi ≧ 1, for all
Constraint are violated when (wTxi + b)yi -1 < 0
Use Lagrange multipliers to express the constraints
Then the optimization problems can be written in terms of the Lagrangian = /2 −
[( + ) − 1]
With all >=0 , any violated constraints will increase LP – so we seek min LP
>=0 for all i
(For the soft margin case, we constrain 0 ≧ ≦ C)
Minimize the objective by taking derivative with respect to w and b: =0
gives =
, and
=0 gives
= 0.
Substituting these in the expression above, we get the dual formulation max s.t.
− (1/2)
,
= 0, >=0 for all i
SVM –in terms of the dual: max s.t.
− (1/2)
,
.
Dot product of training data points
= 0, >=0 for all i
(For the soft margin case, 0 ≧ ≦ C)
The w vector is expressed in terms of the : w = and the separating boundary (w.x + b) is
To evaluate a new data point xnew: = x + =
+
Dot product of new data point with training data points
+
i sums over the training examples
But we only need to consider those for which is non-zero
- points along the hyperplanes H+ and H- support vectors
- for the soft margin case, there may be points within the margin for which is non-zero
Points at/near the boundary are the support vectors
- these determine the optimal separating hyperplane
Change the other data points, and we get the same model
For applying the model to new examples, we only need to store the support vectors
Note:
- we only need the dot products between examples
- values are non-zero only for data points on the separating boundary ie. on one of the hyperplanes H+ or H- these support vectors determine the separating boundary
If other training data points are removed, the same separating boundary will be returned as optimal
- for classification decisions, only the support vectors with >0 matter
Other data are not needed once training is complete
- as many parameters (i=1 to N) as the number of training examples ?
… but number of parameters is bounded by number of support vectors
- number of parameters does not depend on dimensions of the space
(can project to high dimensional space)
What if data is not linearly separable?
Use a kernel function to map the data to a higher dimensional feature space where the data will be linearly separable
With sufficiently high dimensions, the data will be linearly separable
The optimal model is given by = where the coefficients (Lagrangian multipliers) correspond to the support vectors, and which satisfy = 0, >=0 max − (1/2) , [φ( ). ( )]
s.t. = 0, >=0 for all i (for the soft margin case, 0 ≧ ≦ C)
Remember
- we do not need to operate in the high-dimensional space, do not need to explicitly know the high-dimensional mapping
Instead, use the kernel function to get the dot product φ( ). = K( , )
Mapping to higher dimensions
Consider a polynomial of degree d, and a dataset with p variables.
With d=2 and p=50, we have a transformed feature space with 2500 features.
But, we only need to compute the dot products in terms of the kernel function
- operations on original space
Example – use of quadratic kernel
Consider data in 2D space, i.e. a data point x=(x1,x2)
Then, (x.z)2 = (x1z1+x2z2)2 = x12 z12 + x22 z22 + 2x1z1x2z2
= < (x12 , x22, √2x1x2), (z12 , z22, √2z1z2) >
= (ɸ(x). ɸ(z)
Here, ɸ(x) is [x12 , x22, √2x1x2] explicit mapping in 3D space x2 2
x1 x1 No linear separation
In original space
ɸ(.) x1 2
√2x1x2
Linear separation
In transformed space
(more pictures:) x2 2
x1
ɸ(.) x1 2
x1
√2x1x2
No linear separation
In original space
Linear separation
In transformed space
SVM for classification
- general purpose classifier with good performance across problems
-
works well with appropriate choice of hyper-parameters penalty factor C, and kernel function parameters
SVM parameters are wi
(or αi) and bi values.
Parameters for the SVM model (like C) are called hyper-parameters. C can control for tradeoff between model complexity and training error
- too large can result in more support vectors, more complex model, and overfit
- try a range of values, and examine performance on a separate test set, or with cross-validation hyper Example: with polynomial kernel, try values of parameters d and C d in {2,3,4,5} and C={0.1, 1, 10, 100, 1000}
(RapidMiner operator for doing a grid search over parameters)
Support vector regression y +ε
-ε
Find model y=w.x +b such that
- predicted values have at most ε deviation from the true values.
x
- the function w.x + b is as ‘flat’ as possible (to ovoid overfit from having too many terms)
Minimize 2 /2
s.t.
− . + ≤ ε − . + ≥ −ε for all i in 1..N
y
+ε
-ε
∗
x
For points like
- add a penalty term
Minimize 2 /2 + ( + ∗ )
s.t. − . + ≤ ε + − . + ≥ −ε - , ∗ ≥ 0
for all i in 1..N
Support vector regression
Can take similar approach as earlier, with the Lagrangian multipliers and solving the dual optimization problem.
The s will have non-zero values on for points outside the margin band (support vectors)
Use of kernel functions to transform into higher dimensional space
- do the regression in higher dimensional space
Q. – how is SVM regression different from kernel ridge regression?
Support vector regression
Parameters are C, ε, and kernel function parameters.
C affects tradeoff between model complexity (how ‘flat’ it is) and extent of tolerance for points away from the band. If C is too large, can get a complex model. ε controls the width of the ε-insensitive band – large band leads to fewer support vectors, and a ‘flatter’ model ε relates to data value ranges – so, should normalize data beforehand y y
y
ɸ(.)
x
ɸ-1(.)
ɸ(x)
x
Variable selection in SVM
Magnitude of weights in w relate to importance of corresponding attribute
Variable selection - evaluate models with different attribute sets using cross-validation
Recursive Feature Elimination (SVM-RFE)
Start will all attributes
Repeat
- build a model, and evaluate (using cross-validation)
- remove one attribute corresponding to the smallest magnitude value in w until there are no more attributes
Select the best performance model which has fewest attributes
Class probabilities from SVM
Score for a data point = distance of the point from the separating hyperplane
Magnitude of score can be taken to indicate confidence of prediction
Works for ranking of cases, lift, etc.
Can scale the score to [0,1] values, but this may not give well-calibrated probability scores
Determine a histogram from validation data scores.
For a new data point, find which histogram bin it falls in
Predict the probability as proportion of class in that bin
Platt’s method: p(class| x) = 1/(1 + exp(As(x) + B) where s(x) is the distance from hyperplane, and A, B are parameters to be selected based on validation data (values that minimize the negative log-likelihood of the data