Support Vector Machines

(1)

Support Vector Machines

Ganesh Ramakrishnan

Adjunct Professor, Department of Computer Science and Engineering, IIT Bombay

&

Research Staff Member, IBM India Research Labs

(2)

Outline

Introduction and Motivation Perceptron

Support Vector Machines Kernel Perceptron

Kernels

Least Squared Support Vector Machines Proximal Support Vector Machines

Support Vector Regression

(3)

Basics..

(4)

Notations

(5)

Perceptron

(6)

Perceptron Update Rule

(7)

What does “w” turn out to be?

(8)

Dual Representation

(9)

Dual Representation (contd)

(10)

Duality

(11)

Linear Classifiers

(12)

Statistical Learning Theory

Minimize an upper bound on the generalization error through maximizing the margin between two disjoint half planes

Minimize empirical risk

Find f such that the following expression is minimized:

Where:

y

_i known decision of object i; x_i its feature vector;

n

number of objects

( )

∑

=

n

1 i

i

- (x )

y 2

n ] 1

[

R

^emp

f f

(13)

Motivation SVM

Age

Income : No churn

: Churn W

(14)

Classifier Margin

(15)

Classifier Margin (contd)

(16)

Handling Noise through margins

(17)

Handling Noise through margins

(18)

Standard SVM formulation

(19)

SVM: Linearly separable case

n objects: x

i ∈ R^m, i=1,…,n y_i: label of object i; y_i ∈ {-1,1}

Assume ∃ hyperplane w.x+b=0 separating positive from negative objects.

1 if

1

1 if

1

−

=

−

≤ +

⋅

+

= +

≥ +

⋅

i i

y b

w x

y b

w x

i b

w x

y

_i

(

_i

⋅ + ) − 1 ≥ 0 ∀

equivalently:

(20)

w.x+b=0

(0,0)

| from 1

| w

b

−

(0,0)

| from 1

|

w b

−

w 2

(21)

What if…

(22)

Error handling

(23)

Support Vector Machines

(For Classification)

Find function f such that:

Classification error is minimized (in training set)

Margin is maximized (generalization!)

Two objectives:

Minimize Error (train model)

Maximize Margin

(generalization)

(24)

SVM: Linearly not separable case

Slack variables q_i:

1 if

1

1 if

1

−

= +

−

≤ +

⋅

+

=

− +

≥ +

⋅

i i

i

i i

i

y q

b w x

y q

b w x

) (

2

2 ∑

+ C q

_i

w

Objective function:

(25)

Primal formulation

Classification Error 1/Margin

( )

0 0 q

1 b

: subject to

2 C Minimize 1

i i i

i

i 2

q

^≥

≥ +

−

+

⋅

∑

w x

y

W q

W: Normal to hyperplane.

b : Position of hyperplane

(26)

QP Dual Formulation

∑

=

≥

n

1 i i

s i i

0

0 : to subject

2 - 1 Maximize

i i

i

s i s i

y α

,...,n

x x y y

α 1

α α α

KERNELS!!!

(27)

Classifier

Linear classifier:

Application to new objects

sign(f(x)) = +1 => class +1 sign(f(x)) = -1 => class -1

b x

y x α

x f

i

⋅ +

= +

⋅

= ^W ^b ∑

)

(

(28)

Support Vectors

(29)

Error-Margin tradeoff

(30)

Error-Margin tradeoff

(31)

More Complex Pattern Sets…

(32)

Limitation of Linear Machies

(33)

Handling the limitations?

(34)

Handling the limitations…

(35)

Projecting onto higher dimension

(36)

Ex-OR Gate

(37)

Ex-OR Gate

(38)

Ex-OR Gate: Non-linear Mapping

Gives Linear Separability

(39)

Projecting into higher dimensions

(40)

Learning in Feature Space

(41)

Which mapping to choose?

(42)

High dimensions - challenging

(43)

Kernels: Mapping implicitly to

Feature Space

(44)

Use the Dual!

(45)

Kernels

(46)

Using Kernels

(47)

Kernel Matrix

(48)

Kernel Matrix

(49)

Mercer’s Theorem

(50)

Mercer’s Theorem

(51)

Kernel Examples

(52)

Also…

(53)

Polynomial Kernels

(54)

Polynomial Kernels

(55)

Efficiency

(56)

Gaussian Kernels

(57)

Gaussian Kernel

(58)

Gaussian Kernel

(59)

Closure Properties

(60)

Kernel Machines

x X

(⋅) Φ

) (

x X

Φ

= Φ

= _i

i

) , (⋅ ⋅

K (x ) (x) (x ,x)

i

i ⋅Φ = K

Φ

) )

, ( (

sign y K b

y

S i

i i

i +

=

∑

∈

x

α

x

) )

( )

( (

sign y b

y

S i

i i

i Φ ⋅Φ +

=

∑

∈

x

α

x

) ( )

(x x

X

X_i ⋅ = Φ _i ⋅Φ

Mercer’s condition

) (

sign y b

y

S i

i i

i ⋅ +

=

∑

∈

X

α

X

(61)

Training SVMs

(62)

Training SVMs

(63)

Avoiding Memory Problems

(64)

Does it really work?

(65)

Example

(66)

Decomposition Methods

(67)

Caching

(68)

Caching Issues

(69)

Shrinking

(70)

Least Squared Support Vector Machines (LSSVM)

The classical SVM classifier aims to minimize an upper bound on the generalization error through maximizing the margin between two disjoint half planes

This involves solving a quadratic programming problem that could be prohibitive on large data sets.

To overcome this problem, Suykens and Vandewalle proposed the “least square SVM” (LSSVM) formulation The formulation considers equality constraints and adds

an extra term to the cost function. As a result, the solution follows from directly solving a set of linear equations.

(71)

LSSVM

Given a set of

M

patterns

x

^k , where

x

^k =(…, )T, with

corresponding labels

x

^k ∈ {-1, +1}, the LS-SVM determines a separating surface of the form

w^Tφ(x) + b = 0

by solving a problem of the form

where C is a parameter.

The first term on the R.H.S. of (9) is a for regularization, while the second term is the empirical error.

The constant

C

determines the relative importance of the two.

[

^w

( )

^P ^b

]

^q ⁱ ^M

y

q C q

w w

i i

T i

T T

w b q

, 1,2,...

1, subject to

2 2

1 Minimize

, ,

=

= +

+ + φ

(72)