Support Vector Machines
Ganesh Ramakrishnan
Adjunct Professor, Department of Computer Science and Engineering, IIT Bombay
&
Research Staff Member, IBM India Research Labs
Outline
Introduction and Motivation Perceptron
Support Vector Machines Kernel Perceptron
Kernels
Least Squared Support Vector Machines Proximal Support Vector Machines
Support Vector Regression
Basics..
Notations
Perceptron
Perceptron Update Rule
What does “w” turn out to be?
Dual Representation
Dual Representation (contd)
Duality
Linear Classifiers
Statistical Learning Theory
Minimize an upper bound on the generalization error through maximizing the margin between two disjoint half planes
Minimize empirical risk
Find f such that the following expression is minimized:
Where:
y
i known decision of object i; xi its feature vector;n
number of objects( )
∑
=
=
n
1 i
i
i
- (x )
y 2
n ] 1
[
R
empf f
Motivation SVM
Age
Income : No churn
: Churn W
Classifier Margin
Classifier Margin (contd)
Handling Noise through margins
Handling Noise through margins
Standard SVM formulation
SVM: Linearly separable case
n objects: x
i ∈ Rm, i=1,…,n yi: label of object i; yi ∈ {-1,1}
Assume ∃ hyperplane w.x+b=0 separating positive from negative objects.
1 if
1
1 if
1
−
=
−
≤ +
⋅
+
= +
≥ +
⋅
i i
i i
y b
w x
y b
w x
i b
w x
y
i(
i⋅ + ) − 1 ≥ 0 ∀
equivalently:
w.x+b=0
(0,0)
| from 1
| w
b
−
(0,0)
| from 1
|
w b
−
−
w 2
What if…
Error handling
Support Vector Machines
(For Classification)
Find function f such that:
Classification error is minimized (in training set)
Margin is maximized (generalization!)
Two objectives:
Minimize Error (train model)
Maximize Margin
(generalization)
SVM: Linearly not separable case
Slack variables qi:
1 if
1
1 if
1
−
= +
−
≤ +
⋅
+
=
− +
≥ +
⋅
i i
i
i i
i
y q
b w x
y q
b w x
) (
2
2
∑
+ C q
iw
Objective function:
Primal formulation
Classification Error 1/Margin
( )
0 0 q
1 b
: subject to
2 C Minimize 1
i i i
i
i 2
q
≥
≥ +
−
+
+
⋅
∑
w x
y
W q
W: Normal to hyperplane.b : Position of hyperplane
QP Dual Formulation
∑
∑
∑
=
=
=
≥
n
1 i i
s i i
0
0
: to subject
2 - 1 Maximize
i i
i
s i s i
y α
,...,n
x x y y
α 1
α α α
KERNELS!!!
Classifier
Linear classifier:
Application to new objects
sign(f(x)) = +1 => class +1 sign(f(x)) = -1 => class -1
b x
y x α
x f
i
i
i
⋅ +
= +
⋅
= W b ∑
)
(
Support Vectors
Error-Margin tradeoff
Error-Margin tradeoff
More Complex Pattern Sets…
Limitation of Linear Machies
Handling the limitations?
Handling the limitations…
Projecting onto higher dimension
Ex-OR Gate
Ex-OR Gate
Ex-OR Gate: Non-linear Mapping
Gives Linear Separability
Projecting into higher dimensions
Learning in Feature Space
Which mapping to choose?
High dimensions - challenging
Kernels: Mapping implicitly to
Feature Space
Use the Dual!
Kernels
Using Kernels
Kernel Matrix
Kernel Matrix
Mercer’s Theorem
Mercer’s Theorem
Kernel Examples
Also…
Polynomial Kernels
Polynomial Kernels
Efficiency
Gaussian Kernels
Gaussian Kernel
Gaussian Kernel
Closure Properties
Kernel Machines
x X
(⋅) Φ
) (
) (
x X
x X
Φ
= Φ
= i
i
) , (⋅ ⋅
K (x ) (x) (x ,x)
i
i ⋅Φ = K
Φ
) )
, ( (
sign y K b
y
S i
i i
i +
=
∑
∈
x
α
x) )
( )
( (
sign y b
y
S i
i i
i Φ ⋅Φ +
=
∑
∈
x
α
x) ( )
(x x
X
Xi ⋅ = Φ i ⋅Φ
Mercer’s condition
) (
sign y b
y
S i
i i
i ⋅ +
=
∑
∈
X
α
XTraining SVMs
Training SVMs
Avoiding Memory Problems
Does it really work?
Example
Decomposition Methods
Caching
Caching Issues
Shrinking
Least Squared Support Vector Machines (LSSVM)
The classical SVM classifier aims to minimize an upper bound on the generalization error through maximizing the margin between two disjoint half planes
This involves solving a quadratic programming problem that could be prohibitive on large data sets.
To overcome this problem, Suykens and Vandewalle proposed the “least square SVM” (LSSVM) formulation The formulation considers equality constraints and adds
an extra term to the cost function. As a result, the solution follows from directly solving a set of linear equations.
LSSVM
Given a set of
M
patternsx
k , wherex
k =(…, )T, withcorresponding labels
x
k ∈ {-1, +1}, the LS-SVM determines a separating surface of the formwTφ(x) + b = 0
by solving a problem of the form
where C is a parameter.
The first term on the R.H.S. of (9) is a for regularization, while the second term is the empirical error.
The constant
C
determines the relative importance of the two.[
w( )
P b]
q i My
q C q
w w
i i
T i
T T
w b q
, 1,2,...
1, subject to
2 2
1 Minimize
, ,
=
= +
+ + φ