Svmdd

Kernel methods, SVM

Consider ridge regression

We want to learn =

=1

Obtain w as = argmin

11 . .
1
⋮
⋮
=
⋮
⋮

1

= , =

1
⋮
⋮

2

( −( ) )2 +
=1

1
⋮
⋮

=1

(for r-th training example) = argmin −

2

+

2

Notation:
X is a matrix, x is a vector

Solve by setting derivatives to zero, to get = ( + )−1
(Px1)

(PxN)(NxP)
(PxP)

For a new example

(PxN) (Nx1)

= = ( + )−1

Getting to Dual form = ( + )−1
 + =
1

where =

 =

1

− =

1

− =

−

gives the dual solution, from which we can obtain w = or =

=1

(here, xi is the i-th example)

11 . .
1
⋮
⋮
=
⋮
⋮

1

1
⋮
⋮

1
⋮
⋮

Substituting w = in =

we get = −
 + =
 = + −

1

− ,

We can compute as: = ( + )−1 where K =

i.e.

= ,

11 . .
⋮
⋮
⋮
1

1
⋮
⋮
⋮

11

⋮
⋮

.....

1

1
⋮
⋮

=(xi.xj) (dot product)
K: matrix of inner products of N vectors (Gram Matrix)

K: matrix of similarities of example pairs
(since dot product gives similarity between vectors)

(1 , 1 ) . . . . .
⋮
K=
⋮
( , 1 )

(1 , )
⋮
⋮
( , )

Now, = =
= ,
=

=1

=1

(since w = )

,

So in the dual form:
Compute = ( + )−1 where K = , i.e.

= ,

Evaluate on a new data point xnew as y = f =

=1

,

Both of these require inner products between data points

Substitute the inner product with a kernel function
K(x,z) =
Then, we obtain a procedure for ridge regression over a new feature space F given by ɸ: x -> ɸ(x) ϵ F
(kernel ridge regression)

K: R2 -> R3

We can calculate the similarity (dot product) between data points in the new feature space,
….. in terms of the Kernel function on the original data points.
i.e. in terms of K(x,z)
(why is this a ‘big deal’?)

Kernel trick

Quadratic kernel
Consider x=(x1,x2) and z=(z1,z2) in 2D space.
Then, 2 = (xTz)2 = (x1z1+x2z2)2 = x12 z12 + x22 z22 + 2x1z1x2z2
= < (x12 , x22, √2x1x2), (z12 , z22, √2z1z2) >

Converted from linear to to quadratic regression

= < ɸ(x), ɸ(z)> = ɸ(x)T ɸ(z)
Here, ɸ(x) is [x12 , x22, √2x1x2]
ɸ(.) projects 2D data to 3D
But we do not need to perform computations in terms of 3D space.
Use the kernel function < ɸ(x), ɸ(z)> = < x, z >2 = Kɸ(x,z) to calculate the inner product (similarity) in 3D space in terms of 2D space operations.

Use the kernel function to train the regression model and apply it = ( + )−1 y = f =

=1

,

For classification - learn a non-linear separation by using a kernel

0

0

x x x x x x
0
0

ɸ0
ɸ0
ɸ0 ɸ
0

ɸx
ɸx ɸx
ɸ
ɸx x
ɸx

more on this next -- SVM

Linear separation in transformed space

Some commonly used kernel functions

Polynomial: K(u,v) = (u.v)d
K(u,v) = (u.v + c)d Here c is a parameter that controls influence of higher order terms in the polynomial.
Corresponds to polynomial regression in the implicit (transformed) feature space --- but without the computational burden of higher order terms and added parameters (model weights)
− 2
2 2

Gaussian (radial basis function): , = exp(−
)
(or often written, with gamma() parameter, as exp(− −
Transformed space has infinite dimension
– considers polynomials of all orders

Sigmoid: K(u,v) = tanh (ηu.v + r)

2

)

Consider data in 2D space. x=(x1,x2) and z=(z1,z2)
Polynomial kernel K(x,z) = (x.z + 1)3
This kernel maps the 2D data to 10 dimensions
1 x1 x2 x12 x22 x1x2 x1x22 x12 x2 x13

x23

We do not need to explicitly consider the 10D space. We calculate similarity between vectors in this 10D space in terms of their 2D vectors.

Kernel functions
-

-

Allow operating on data as though it were projected onto higher dimensional space, but by using calculations on the original space
They allow calculating dot products in transformed space, without having to actually project the data into the new space

-

Kernel functions need to satisfy certain properties
Can construct new kernels from basic kernels

-

Learning problems can be formulated as optimization tasks
Consider primal and dual formulations of the optimization problem
Dual problem – solved in terms of higher dimensional space
Kernel functions allow computing dot products in this new space in terms of original dimensions

-

Kernel PCA, Kernel functions in Support Vector Machines

Kernel PCA

Support vector machines
Find the maximum margin hyperplane separating two classes

Use of kernel functions to transform data into higher dimensional space
Linear separation in higher dimensions

Linear separation – which one is better ? (why?)

Larger margin between classes is better
- maximum margin hyperplane
Points at/near the boundary are the support vectors
- these determine the optimal separating hyperplane
Learn the separating hyperplane which has the maximum margin from points belonging to the two classes on different sides of the hyperplane

H+

Find the model (i.e. w, b) that maximizes the margin M

Hyperplane H wTx Consider two points p, q on the +1 plane
Then wTp + b = 1, and wTq + b = 1 and wT(p-q) =0

margin
M

class +1 data wTx + b >= 1

H-

+b=1

wTx + b = 0 wTx + b = -1

class -1 data wTx + b = 1 for all data x belonging to class +1 wTx + b =1 for all i

We will re-cast this optimization in terms of Largangian formulation
- each constraint is replaced by Lagrangian multiplier – easier to handle
- training data appears as dot products between pairs of data points
-- allows use of Kernel functions

What to do with points that may not be linearly separable,

class +1 data

- can still use quadratic programming with penalty for errors
- introduce slack variables

Minimize wTw/2 +

=1( )

εp
Separating
hyperplane H

εq

class -1 data

subject to the constraints:
(wTxi + b)yi >= 1- ≥ 0 for all

The constant C > 0 sets the tradeoff between maximizing the margin (first term in the objective) and minimizing the slack.
Note – examples with positive slack will be ‘within’ the margin (i.e. on the ‘wrong’ side of the margin)
(soft margin SVM)

Formulate the optimization in terms of Lagrangian min /2 s.t. (wTxi + b)yi ≧ 1, for all
Constraint are violated when (wTxi + b)yi -1 < 0
Use Lagrange multipliers to express the constraints

Then the optimization problems can be written in terms of the Lagrangian = /2 −

[( + ) − 1]

With all >=0 , any violated constraints will increase LP – so we seek min LP

>=0 for all i

(For the soft margin case, we constrain 0 ≧ ≦ C)
Minimize the objective by taking derivative with respect to w and b: =0

gives =

, and

=0 gives

= 0.

Substituting these in the expression above, we get the dual formulation max s.t.

− (1/2)

,

= 0, >=0 for all i

SVM –in terms of the dual: max s.t.

− (1/2)

,

.
Dot product of training data points

= 0, >=0 for all i

(For the soft margin case, 0 ≧ ≦ C)
The w vector is expressed in terms of the : w = and the separating boundary (w.x + b) is
To evaluate a new data point xnew: = x + =

+
Dot product of new data point with training data points

+

i sums over the training examples
But we only need to consider those for which is non-zero
- points along the hyperplanes H+ and H- support vectors
- for the soft margin case, there may be points within the margin for which is non-zero

Points at/near the boundary are the support vectors
- these determine the optimal separating hyperplane
Change the other data points, and we get the same model
For applying the model to new examples, we only need to store the support vectors

Note:
- we only need the dot products between examples
- values are non-zero only for data points on the separating boundary ie. on one of the hyperplanes H+ or H- these support vectors determine the separating boundary
If other training data points are removed, the same separating boundary will be returned as optimal
- for classification decisions, only the support vectors with >0 matter
Other data are not needed once training is complete
- as many parameters (i=1 to N) as the number of training examples ?
… but number of parameters is bounded by number of support vectors
- number of parameters does not depend on dimensions of the space
(can project to high dimensional space)

What if data is not linearly separable?
Use a kernel function to map the data to a higher dimensional feature space where the data will be linearly separable
With sufficiently high dimensions, the data will be linearly separable
The optimal model is given by = where the coefficients (Lagrangian multipliers) correspond to the support vectors, and which satisfy = 0, >=0 max − (1/2) , [φ( ). ( )]
s.t. = 0, >=0 for all i (for the soft margin case, 0 ≧ ≦ C)
Remember
- we do not need to operate in the high-dimensional space, do not need to explicitly know the high-dimensional mapping
Instead, use the kernel function to get the dot product φ( ). = K( , )

Mapping to higher dimensions
Consider a polynomial of degree d, and a dataset with p variables.
With d=2 and p=50, we have a transformed feature space with 2500 features.
But, we only need to compute the dot products in terms of the kernel function
- operations on original space

Example – use of quadratic kernel
Consider data in 2D space, i.e. a data point x=(x1,x2)
Then, (x.z)2 = (x1z1+x2z2)2 = x12 z12 + x22 z22 + 2x1z1x2z2
= < (x12 , x22, √2x1x2), (z12 , z22, √2z1z2) >
= (ɸ(x). ɸ(z)
Here, ɸ(x) is [x12 , x22, √2x1x2]  explicit mapping in 3D space x2 2

x1 x1 No linear separation
In original space

ɸ(.) x1 2
√2x1x2

Linear separation
In transformed space

(more pictures:) x2 2

x1

ɸ(.) x1 2

x1

√2x1x2
No linear separation
In original space

Linear separation
In transformed space

SVM for classification
- general purpose classifier with good performance across problems
-

works well with appropriate choice of hyper-parameters penalty factor C, and kernel function parameters

SVM parameters are wi
(or αi) and bi values.
Parameters for the SVM model (like C) are called hyper-parameters. C can control for tradeoff between model complexity and training error
- too large can result in more support vectors, more complex model, and overfit
- try a range of values, and examine performance on a separate test set, or with cross-validation hyper Example: with polynomial kernel, try values of parameters d and C d in {2,3,4,5} and C={0.1, 1, 10, 100, 1000}
(RapidMiner operator for doing a grid search over parameters)

Support vector regression y +ε
-ε

Find model y=w.x +b such that
- predicted values have at most ε deviation from the true values.

x

- the function w.x + b is as ‘flat’ as possible (to ovoid overfit from having too many terms)
Minimize 2 /2
s.t.
− . + ≤ ε − . + ≥ −ε for all i in 1..N

y

+ε
-ε

∗

x

For points like
- add a penalty term
Minimize 2 /2 + ( + ∗ )

s.t. − . + ≤ ε + − . + ≥ −ε - , ∗ ≥ 0

for all i in 1..N

Support vector regression
Can take similar approach as earlier, with the Lagrangian multipliers and solving the dual optimization problem.
The s will have non-zero values on for points outside the margin band (support vectors)
Use of kernel functions to transform into higher dimensional space
- do the regression in higher dimensional space
Q. – how is SVM regression different from kernel ridge regression?

Support vector regression
Parameters are C, ε, and kernel function parameters.
C affects tradeoff between model complexity (how ‘flat’ it is) and extent of tolerance for points away from the band. If C is too large, can get a complex model. ε controls the width of the ε-insensitive band – large band leads to fewer support vectors, and a ‘flatter’ model ε relates to data value ranges – so, should normalize data beforehand y y

y
ɸ(.)

x

ɸ-1(.)

ɸ(x)

x

Variable selection in SVM
Magnitude of weights in w relate to importance of corresponding attribute
Variable selection - evaluate models with different attribute sets using cross-validation
Recursive Feature Elimination (SVM-RFE)
Start will all attributes
Repeat
- build a model, and evaluate (using cross-validation)
- remove one attribute corresponding to the smallest magnitude value in w until there are no more attributes
Select the best performance model which has fewest attributes

Class probabilities from SVM
Score for a data point = distance of the point from the separating hyperplane

Magnitude of score can be taken to indicate confidence of prediction
Works for ranking of cases, lift, etc.
Can scale the score to [0,1] values, but this may not give well-calibrated probability scores

Determine a histogram from validation data scores.
For a new data point, find which histogram bin it falls in
Predict the probability as proportion of class in that bin

Platt’s method: p(class| x) = 1/(1 + exp(As(x) + B) where s(x) is the distance from hyperplane, and A, B are parameters to be selected based on validation data (values that minimize the negative log-likelihood of the data

Similar Documents

Popular Essays