• No results found

Robust Regression

N/A
N/A
Protected

Academic year: 2024

Share "Robust Regression"

Copied!
85
0
0

Loading.... (view fulltext now)

Full text

(1)

Robust Regression

… and how I could relax after I stopped relaxing

Workshop on Algorithms and Optimization, ICTS, Bengaluru Purushottam Kar

(2)

IIT KANPUR

A Recommendation System Problem

Dec 03, 2017 2

(3)

A Recommendation System Problem

Dec 03, 2017 2

(4)

IIT KANPUR

A Recommendation System Problem

Dec 03, 2017 2

ITEMS

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5 𝐱6 𝐱7

(5)

A Recommendation System Problem

Dec 03, 2017 2

ITEMS

0.5 0.2 0.2 0.5 0.9 0.8 0.4

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5 𝐱6 𝐱7

(6)

IIT KANPUR

Encodes user preferences

A Recommendation System Problem

Dec 03, 2017 2

ITEMS

0.5 0.2 0.2 0.5 0.9 0.8 0.4

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5 𝐱6 𝐱7

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5 𝐱6 𝐱7

(7)

Encodes user preferences

A Recommendation System Problem

Dec 03, 2017 2

ITEMS

0.5 0.2 0.2 0.5 0.9 0.8 0.4

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5 𝐱6 𝐱7

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5 𝐱6 𝐱7

(8)

IIT KANPUR

Encodes user preferences

A Recommendation System Problem

Dec 03, 2017 2

ITEMS

0.5 0.2 0.2 0.5 0.9 0.8 0.4

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5 𝐱6 𝐱7

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5 𝐱6 𝐱7

0.8 0.2

0.7

(9)

Encodes user preferences

A Recommendation System Problem

Dec 03, 2017 2

ITEMS

0.5 0.2 0.2 0.5 0.9 0.8 0.4

= +

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5 𝐱6 𝐱7

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5 𝐱6 𝐱7

0.8 0.2

0.7

(10)

IIT KANPUR

Encodes user preferences

A Recommendation System Problem

Dec 03, 2017 2

ITEMS

0.5 0.2 0.2 0.5 0.9 0.8 0.4

= +

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5 𝐱6 𝐱7

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5 𝐱6 𝐱7

Still recover 𝑤?

0.8 0.2

0.7

(11)

Encodes user preferences

A Recommendation System Problem

Dec 03, 2017 2

ITEMS

0.5 0.2 0.2 0.5 0.9 0.8 0.4

= +

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5 𝐱6 𝐱7

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5 𝐱6 𝐱7

Still recover 𝑤?

0.8 0.2

0.7

Corruptions are systematic – inappropriate to model them as “Gaussian noise”

(12)

IIT KANPUR

Encodes user preferences

A Recommendation System Problem

Dec 03, 2017 2

ITEMS

0.5 0.2 0.2 0.5 0.9 0.8 0.4

= +

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5 𝐱6 𝐱7

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5 𝐱6 𝐱7

The items and ratings are not received in one

go – online problem!

Still recover 𝑤?

0.8 0.2

0.7

Corruptions are systematic – inappropriate to model them as “Gaussian noise”

(13)

A Biometric Identification Problem

Dec 03, 2017 13

(14)

IIT KANPUR

A Biometric Identification Problem

Dec 03, 2017 13

DATABASE

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5

(15)

A Biometric Identification Problem

Dec 03, 2017 13

DATABASE

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5

(16)

IIT KANPUR

A Biometric Identification Problem

Dec 03, 2017 13

DATABASE

0.2

0.4 0.05 0.05 0.3

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5

(17)

A Biometric Identification Problem

Dec 03, 2017 13

DATABASE

0.2

0.4 0.05 0.05 0.3

𝐱1

Weights can reveal the identity of the person

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5

𝐱2 𝐱3 𝐱4 𝐱5

(18)

IIT KANPUR

A Biometric Identification Problem

Dec 03, 2017 13

DATABASE

0.2

0.4 0.05 0.05 0.3

𝐱1

Weights can reveal the identity of the person

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5

𝐱2 𝐱3 𝐱4 𝐱5

(19)

A Biometric Identification Problem

Dec 03, 2017 13

DATABASE

0.2

0.4 0.05 0.05 0.3

𝐱1

Weights can reveal the identity of the person

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5

𝐱2 𝐱3 𝐱4 𝐱5

(20)

IIT KANPUR

A Biometric Identification Problem

Dec 03, 2017 13

DATABASE

= + ≈

+ ≈

=

0.2

0.4 0.05 0.05 0.3

𝐱1

Weights can reveal the identity of the person

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5

𝐱2 𝐱3 𝐱4 𝐱5

(21)

A Biometric Identification Problem

Dec 03, 2017 13

DATABASE

= + ≈

+ ≈

=

0.2

0.4 0.05 0.05 0.3

𝐱1

Weights can reveal the identity of the person

Corruptions are structured; not

“salt-pepper” noise

𝐱1 𝐱2 𝐱3 𝐱4 𝐱5

𝐱2 𝐱3 𝐱4 𝐱5

(22)

IIT KANPUR

Robust Learning and Estimation

• Classical subfield of statistics (Huber, 1964, Tukey, 1960)

• Newfound interest – scalability and efficiency

Robust classification (Feng et al, 2014)

Robust regression (Bhatia et al 2015, Chen et al, 2013)

Robust PCA (Candès et al, 2009, Netrapalli et al, 2014)

Robust matrix completion (Cherapanamjeri et al, 2017)

Robust optimization (Charikar et al, 2016)

Robust estimation (Diakonikolas et al, 2017, Lai et al, 2016, Pravesh’s talk)

• Extremely exciting area – from theory and app perspectives

Dec 03, 2017 22

(23)

Robust Learning and Estimation - Application

Dec 03, 2017 Netrapalli et al, 2014, Bhatia et al, 2015 23

(24)

IIT KANPUR

Robust Learning and Estimation - Application

• When data is actually corrupted e.g. click fraud in RecSys

Dec 03, 2017 Netrapalli et al, 2014, Bhatia et al, 2015 23

(25)

Robust Learning and Estimation - Application

• When data is actually corrupted e.g. click fraud in RecSys

• Even when data not corrupted (to allow problem reformulation)

Dec 03, 2017 Netrapalli et al, 2014, Bhatia et al, 2015 23

(26)

IIT KANPUR

Robust Learning and Estimation - Application

• When data is actually corrupted e.g. click fraud in RecSys

• Even when data not corrupted (to allow problem reformulation)

Foreground-background extraction can be cast as robust PCA

Dec 03, 2017 Netrapalli et al, 2014, Bhatia et al, 2015 23

(27)

Robust Learning and Estimation - Application

• When data is actually corrupted e.g. click fraud in RecSys

• Even when data not corrupted (to allow problem reformulation)

Foreground-background extraction can be cast as robust PCA

Dec 03, 2017 23

+

=

Netrapalli et al, 2014, Bhatia et al, 2015

(28)

IIT KANPUR

Robust Learning and Estimation - Application

• When data is actually corrupted e.g. click fraud in RecSys

• Even when data not corrupted (to allow problem reformulation)

Foreground-background extraction can be cast as robust PCA

Dec 03, 2017 23

+

=

Netrapalli et al, 2014, Bhatia et al, 2015

(29)

Robust Learning and Estimation - Application

• When data is actually corrupted e.g. click fraud in RecSys

• Even when data not corrupted (to allow problem reformulation)

Foreground-background extraction can be cast as robust PCA

Image in-painting can be cast as robust regression

Dec 03, 2017 23

+

=

Netrapalli et al, 2014, Bhatia et al, 2015

(30)

IIT KANPUR

Robust Learning and Estimation - Application

• When data is actually corrupted e.g. click fraud in RecSys

• Even when data not corrupted (to allow problem reformulation)

Foreground-background extraction can be cast as robust PCA

Image in-painting can be cast as robust regression

Dec 03, 2017 23

+

=

Netrapalli et al, 2014, Bhatia et al, 2015

(31)

Robust Learning and Estimation - Application

• When data is actually corrupted e.g. click fraud in RecSys

• Even when data not corrupted (to allow problem reformulation)

Foreground-background extraction can be cast as robust PCA

Image in-painting can be cast as robust regression

Dec 03, 2017 23

+

=

Netrapalli et al, 2014, Bhatia et al, 2015

(32)

IIT KANPUR

A Toy Problem befitting this near-lunch Hour

• A linear least-squares regression problem

• We have 𝑛 data points 𝐱𝑖, 𝑦𝑖 ∈ ℝ𝑑 × ℝ

• Most of the points are

clean

i.e. for some (unknown) 𝐰 ∈ ℝ𝑑 𝑦𝑖 = 𝐰, 𝐱𝑖

• However, 𝑘 of the points were corrupted by 𝑦𝑖 = 𝐰, 𝐱𝑖 + 𝑏𝑖

• Let 𝑆 denote the set of 𝑛 − 𝑘 uncorrupted points

• (When/How) can we recover 𝐰 from the data?

• Will see an extremely simple and intuitive algorithm

• … and its proof of optimality (sorry  but proof will be light )

Dec 03, 2017 32

(33)

Notation

• Let 𝐲 = 𝑦1, … , 𝑦𝑛 ⊤ ∈ ℝ𝑛, 𝐛 = 𝑏1, … , 𝑏𝑛∗ ⊤ ∈ ℝ𝑛, 𝑋 = 𝐱1, … , 𝐱𝑛 ∈ ℝ𝑑×𝑛 𝐲 = 𝑋𝐰 + 𝐛

• Assume for sake of simplicity that 𝐱𝑖

2 = 1 for all 𝑖 ∈ 𝑛

• Recall since only 𝑘 points corrupted, 𝐛 0 ≤ 𝑘 and 𝑆 = supp 𝐛 𝐯 0 = 𝑖: 𝐯𝑖 ≠ 0

• For 𝑆 ⊆ 𝑛 , 𝐲𝑆, 𝐛𝑆 ∈ ℝ 𝑆 , 𝑋𝑆 ∈ ℝ𝑑× 𝑆 denote subvectors/matrices

• Let 𝐶 = 𝑋𝑋 and for any 𝑆 ⊆ 𝑛 , denote 𝐶𝑆 = 𝑋𝑆𝑋𝑆

Dec 03, 2017 33

(34)

IIT KANPUR

Some Solution Strategies

• Discover the clean set 𝑆 and recover 𝐰from it

𝑆 =𝑛−𝑘min min

𝐰 𝐲𝑆 − 𝑋𝑆𝐰

2

2 = min

𝑆 =𝑛−𝑘min

𝐰 𝑖∈𝑆

𝑦𝑖 − 𝐰, 𝐱𝑖 2

• Discover the corruptions 𝐛 and clean up the responses 𝐲

𝐛min0=𝑘 min

𝐰 (𝐲 − 𝐛) − 𝑋𝐰 22

• However, the above problems are combinatorial, even NP-hard

• Relax, and just relax!

𝐛min1≤𝑟 min

𝐰 (𝐲 − 𝐛) − 𝑋𝐰 22 , or even min𝐰,𝐛 (𝐲 − 𝐛) − 𝑋𝐰 22 + 𝜆 ⋅ 𝐛 1

• Extremely popular and well-studied approach in image proc etc.

Dec 03, 2017 34

(35)

An “Alternate” Viewpoint

• Relaxation methods can be very slow in practice

• Will see one very simple way to do better

• Reconsider the formulation

𝑆 =𝑛−𝑘min min

𝐰 𝐲𝑆 − 𝑋𝑆𝐰

2 2

• Two variables of interest 𝐰 and 𝑆

• Recovering either one recovers the other

If we know 𝑆, do least squares on it to get 𝐰

If we know 𝐰, points with zero error gives 𝑆

• Hmm … what if we only have “good” estimates?

Can a good estimate 𝑆 of 𝑆 get me a good estimate 𝐰 of 𝐰?

Can that good estimate 𝐰 get me a still better estimate of 𝑆?

Dec 03, 2017 35

(36)

IIT KANPUR

AM-RR: Alt. Min. for Rob. Reg.

Dec 03, 2017 36

AM-RR

1. Data X ∈ ℝ

𝑑×𝑛

, 𝐲 ∈ ℝ

𝑛

, # bad pts 𝑘 2. Initialize 𝑆

1

← 1: 𝑛 − 𝑘

3. For 𝑡 = 1,2, … , 𝑇 w

𝑡+1

= arg min

𝐰

𝐲

𝑆𝑡

− 𝑋

𝑆𝑡

𝐰

2 2

𝑆

𝑡+1

= arg min

𝑆 =𝑛−𝑘

𝐲

𝑆

− 𝑋

𝑆

𝐰

𝑡+1

2 2

4. Repeat until convergence

Maintain an “active”

set 𝑆𝑡: points that we believe are clean Solve a least squares problem on active set

Find points which seem least corrupted with respected to 𝐰𝑡+1 Residual 𝐫𝑡+1 = 𝐲 − 𝑋𝐰𝑡+1 Find points with least res.

(37)

AM-RR at work

Dec 03, 2017 37

(38)

IIT KANPUR

AM-RR at work

Dec 03, 2017 37

(39)

AM-RR at work

Dec 03, 2017 37

(40)

IIT KANPUR

AM-RR at work

Dec 03, 2017 37

(41)

AM-RR at work

Dec 03, 2017 41

(42)

IIT KANPUR

AM-RR at work

Dec 03, 2017 41

(43)

AM-RR at work

Dec 03, 2017 41

(44)

IIT KANPUR

AM-RR at work

Dec 03, 2017 41

(45)

AM-RR at work

Dec 03, 2017 41

(46)

IIT KANPUR

AM-RR at work

Dec 03, 2017 41

(47)

AM-RR at work

Dec 03, 2017 42

(48)

IIT KANPUR

AM-RR at work

Dec 03, 2017 42

(49)

AM-RR at work

Dec 03, 2017 42

(50)

IIT KANPUR

AM-RR at work

Dec 03, 2017 42

(51)

AM-RR at work

Dec 03, 2017 42

(52)

IIT KANPUR

AM-RR at work

Dec 03, 2017 42

(53)

AM-RR at work

Dec 03, 2017 42

(54)

IIT KANPUR

AM-RR at work

Dec 03, 2017 42

(55)

AM-RR at work

Dec 03, 2017 42

(56)

IIT KANPUR

AM-RR at work

Dec 03, 2017 42

(57)

AM-RR at work

Dec 03, 2017 42

(58)

IIT KANPUR

AM-RR at work

Dec 03, 2017 42

(59)

AM-RR at work

Dec 03, 2017 42

(60)

IIT KANPUR

AM-RR at work

Dec 03, 2017 42

(61)

AM-RR at work

Dec 03, 2017 42

(62)

IIT KANPUR

AM-RR at work

Dec 03, 2017 42

(63)

AM-RR at work

Dec 03, 2017 42

(64)

IIT KANPUR

AM-RR at work

Dec 03, 2017 42

(65)

AM-RR at work

Dec 03, 2017 42

(66)

IIT KANPUR

AM-RR at work

Dec 03, 2017 Bhatia et al, 2015 66

(67)

AM-RR at work

• The y-axis plots the “regret” of the algorithms.

• The regret of an algorithm informally captures the amount of

“lost opportunities” due to recommending items user doesn’t like

• AM-RR based recommendations incur substantially less regret

Dec 03, 2017 Kapoor, Patel, K., 2017 67

(68)

IIT KANPUR

Feel free to doze off 

Convergence proofs ahead!

Dec 03, 2017 68

(69)

Why AM-RR works?

• Some presuppositions are necessary

• It had better be possible to uniquely identify 𝐰 using data in 𝑆

• This requires 𝑋𝑆 to be well-conditioned

• To simplify things, assume all sets 𝑆 of size 𝑛 − 𝑘 well conditioned

• This can be ensured if for some 𝑐 > 0, every 𝐯 ∈ ℝ𝑑 𝑋𝑆𝐯

2

2 ≥ 𝑐 ⋅ 𝐯 22

• We will assume the above holds with 𝑐 = 𝛼 ⋅ 𝑛 − 𝑘

• This holds w.h.p. for all 𝑆 if 𝑋 is sampled from sub-Gaussian dist.

• Note that since all covariates are unit norm, 𝐱𝑖

2 ≤ 1, also have 𝑋𝑆𝐯

2

2 ≤ 𝑆 ⋅ 𝐯 22, for all 𝑆 ⊆ 𝑛

Dec 03, 2017 69

(70)

IIT KANPUR

The Proof

• AM-RR solves least squares on the active set 𝑆𝑡 hence 𝐰𝑡+1 = 𝐶𝑆−1𝑡 𝑋𝑆𝑡𝐲𝑆𝑡

• However 𝐲𝑆𝑡 = 𝑋𝑆𝑡𝐰 + 𝐛𝑆𝑡 which gives us 𝐰𝑡+1 = 𝐰 + 𝐶𝑆−1𝑡 𝑋𝑆𝑡𝐛𝑆𝑡

• This gives us the residuals as

𝐫𝑡+1 = 𝐲 − 𝑋𝐰𝑡+1 = 𝐛 + 𝑋𝐶𝑆−1𝑡 𝑋𝑆𝑡𝐛𝑆𝑡

• AM-RR chooses 𝑆𝑡+1 to be the 𝑛 − 𝑘 points with least residual, so 𝑟𝑆𝑡+1𝑡+1

2

2 ≤ 𝑟𝑆𝑡+1

2 2

• Notice that since 𝐫𝑡+1 is simply 𝐛 plus an error term, once error goes down, my active set will only contain clean points!

Dec 03, 2017 70

Nice! 𝐰𝑡+1 is 𝐰 plus an error term

Nice! 𝐫𝑡+1 is 𝐛 plus an error term

(71)

The Proof

• Elementary manipulations and applying well conditioned-ness 𝐛𝑆𝑡+1

2

2 ≤ 𝑘2

𝛼2 𝑛 − 𝑘 2 ⋅ 𝐛𝑆𝑡

2

2 + 2𝑘

𝛼 𝑛 − 𝑘 ⋅ 𝐛𝑆𝑡

2

2 ⋅ 𝐛𝑆𝑡

2 2

• Solving this quadratic equation gives us 𝐛𝑆𝑡+1

2 ≤ 2 + 1 𝑘

𝛼 𝑛 − 𝑘 ⋅ 𝐛𝑆𝑡

2

• Thus, if 𝑘 ≤ 𝛼

3+𝛼 ⋅ 𝑛, then after 𝑡 ≥ log 𝐛 2

𝜖 we get 𝐰𝑡 − 𝐰 2 ≤ 𝜖

• A better way to rewrite the above is 𝑘 ≤ 𝑛

3𝜅+1

• The quantity 𝜅 = 1

𝛼 captures the

condition number

of the problem

Dec 03, 2017 71

(72)

IIT KANPUR

A Generalized AM-RR

• Loss function: ℓ 𝐰; 𝑦𝑖, 𝐱𝑖 with certain nice properties

Positivity: ℓ 𝐰; 𝑦𝑖, 𝐱𝑖 ≥ 0

Realizability: ℓ 𝐰; 𝑦𝑖, 𝐱𝑖 = 0 for all 𝑖 ∈ 𝑆

Normalization: ℓ 𝐰1; 𝑦𝑖, 𝐱𝑖 − ℓ 𝐰2; 𝑦𝑖, 𝐱𝑖 𝛽

2 ⋅ 𝐰1 − 𝐰2 22

Well-conditioned-ness:

ℓ 𝐰1; 𝑦𝑖, 𝐱𝑖 ≥ ℓ 𝐰2; 𝑦𝑖, 𝐱𝑖 + 𝛻ℓ 𝐰2; 𝑦𝑖, 𝐱𝑖 , 𝐰1 − 𝐰2 + 𝛼

2 ⋅ 𝐰1 − 𝐰2 22

• Denote 𝑓 𝐰, 𝑆 = 𝑖∈𝑆 ℓ 𝐰; 𝑦𝑖, 𝐱𝑖

• Actually, weaker requirement needed: for all 𝑆 ⊆ 𝑛

𝑓 𝐰1, 𝑆 ≥ 𝑓 𝐰2, 𝑆 + 𝛻𝑓 𝐰2, 𝑆 , 𝐰1 − 𝐰2 + 𝛼 𝑆

2 ⋅ 𝐰1 − 𝐰2 22 𝑓 𝐰1, 𝑆 − 𝑓 𝐰2, 𝑆 𝛽 𝑆

2 ⋅ 𝐰1 − 𝐰2 22

Dec 03, 2017 72

Weaker result for sake of clarity 

Plays same role as assump. 𝐱𝑖

2 ≤ 1 Strong Convexity

(73)

A Generalized AM-RR

Dec 03, 2017 73

AM-RR

1. Data X ∈ ℝ

𝑑×𝑛

, 𝐲 ∈ ℝ

𝑛

, # bad pts 𝑘 2. Initialize 𝑆

1

← 1: 𝑛 − 𝑘

3. For 𝑡 = 1,2, … , 𝑇 w

𝑡+1

= arg min

𝐰

𝐲

𝑆𝑡

− 𝑋

𝑆𝑡

𝐰

2 2

𝑆

𝑡+1

= arg min

𝑆 =𝑛−𝑘

𝐲

𝑆

− 𝑋

𝑆

𝐰

𝑡+1

2 2

4. Repeat until convergence

AM-RR-gen

1. Data X ∈ ℝ

𝑑×𝑛

, 𝐲 ∈ ℝ

𝑛

, # bad pts 𝑘 2. Initialize 𝑆

1

← 1: 𝑛 − 𝑘

3. For 𝑡 = 1,2, … , 𝑇 w

𝑡+1

= arg min

𝐰

𝑓 𝐰, 𝑆

𝑡

𝑆

𝑡+1

= arg min

𝑆 =𝑛−𝑘

𝑓 𝐰

𝑡+1

, 𝑆 4. Repeat until convergence

Since 𝑓 is a convex function, it is usually

easy to solve this Find the 𝑛 − 𝑘 points

with the least loss in terms of ℓ 𝐰; 𝑦𝑖, 𝐱𝑖

(74)

IIT KANPUR

Why AM-RR-gen works!

• Since w𝑡+1 minimizes 𝑓 𝐰, 𝑆𝑡 , we must have 𝛻𝑓 𝐰𝑡+1, 𝑆𝑡 = 𝟎

• Applying well conditioned-ness then gives us 𝐰 − 𝐰𝑡+1 22 ≤ 2

𝛼 𝑛 − 𝑘 𝑓 𝐰, 𝑆𝑡 − 𝑓 𝐰𝑡+1, 𝑆𝑡 ≤ 2

𝛼 𝑛 − 𝑘 𝑓 𝐰, 𝑆𝑡

• Since 𝑆𝑡+1 minimizes 𝑓 𝐰𝑡+1, 𝑆 , we have 𝑓 𝐰𝑡+1, 𝑆𝑡+1 ≤ 𝑓 𝐰𝑡+1, 𝑆 𝑓 𝐰𝑡+1, 𝑆𝑡+1\S ≤ 𝑓 𝐰𝑡+1, 𝑆\S𝑡+1

• Since 𝑆𝑡+1\S ≤ 𝑘 and S\𝑆𝑡+1 ≤ 𝑘, normalization gives us 𝑓 𝐰, 𝑆𝑡+1 = 𝑓 𝐰, 𝑆𝑡+1\S ≤ 𝛽𝑘 ⋅ 𝐰 − 𝐰𝑡+1 22

• Putting things together gives us 𝑓 𝐰, 𝑆𝑡+12𝛽𝑘

𝛼 𝑛−𝑘 𝑓 𝐰, 𝑆𝑡

• For 𝑘 ≤ 𝑛

3𝜅+1 we get linear convergence (𝜅 = 𝛽

𝛼 is condition number)

Dec 03, 2017 74

(75)

AM-RR with Kernels!

• Calculations get messy so just a sketch-of-an-algo for now 

• Data 𝐱𝑖, 𝑦𝑖 and kernel 𝐾 with associated feature map 𝜙: ℝ𝑑 → ℋ 𝐾 𝑥𝑖, 𝑥𝑗 = 𝜙 𝐱𝑖 , 𝜙 𝐱𝑗

• Model a bit different 𝑦𝑖 = 𝐖, 𝜙 𝐱𝑖 + 𝐛𝑖, for some 𝐖 ∈ ℋ

• Can only hope to recover component of 𝐖 in span 𝜙 𝐱1 , … , 𝜙 𝐱𝑛

• A simplified algo that uses AM-RR as a black box

Receive the data and create new features out of “empirical kernel map”

Create 𝐳𝑖 = 𝐾 𝐱𝑖, 𝐱1 , 𝐾 𝐱𝑖, 𝐱2 , … , 𝐾 𝐱𝑖, 𝐱𝑛 ∈ ℝ𝑛

Perform AM-RR with covariates 𝐳𝑖, responses 𝑦𝑖

• Making this bullet-proof needs more work

Dec 03, 2017 75

(76)

IIT KANPUR

Please consider waking up

Open problems ahead

Dec 03, 2017 76

(77)

Non-toy Problems for relaxed introspection

• Presence of dense noise 𝐲 = 𝑋𝐰 + 𝛜 + 𝐛

Corruptions are still sparse 𝐛 0 ≤ 𝑘

Dense noise is Gaussian 𝛜 ∼ 𝒩 𝟎, 𝐼

Can we ensure consistent recovery i.e. lim

𝑛→∞ 𝐰 − 𝐰 2 = 0?

• Adversary Model

Fully adaptive: 𝐛 chosen with knowledge of 𝑋, 𝐰, 𝛜

Fully oblivious: 𝐛 chosen without knowledge of 𝑋, 𝐰, 𝛜

Partially oblivious: 𝐛 chosen with knowledge of 𝑋, 𝐰 or just 𝑋

• Breakdown point

Can we tolerate up to 𝑘 = 𝑛

2 − 1 corruptions, 𝑘 = 𝑛 − 𝑑 log 𝑑 corruptions?

What if adversary is fully oblivious?

Dec 03, 2017 77

(78)

IIT KANPUR

Non-toy Problems for relaxed introspection

• Non-linear/structured Models

What if the linear model 𝑦 ≈ 𝐰, 𝐱 is not appropriate?

Rob. Reg. with sparse models?, kernels?, deep-nets?

• Loss Function

The least squares loss function 𝑦 − 𝐰, 𝐱 2 may not suit all applications

Robust regression with other loss functions? We already saw an example

• Speed

I am solving log 1

𝜖 reg. problems to solve one corrupted reg. problem

Can I invoke the least squares solver less frequently?

• Feature corruption

What if the features 𝐱𝑖 are corrupted (too)?

Can pass the “blame” for corruption onto 𝑦𝑖 but does not always work

Distributed corruption: each 𝐱𝑖 has only few of 𝑑 coordinates corrupted

Dec 03, 2017 78

(79)

We do have some answers

• We can ensure consistent recovery of 𝐰 in the presence of

dense Gaussian noise and sparse corruptions if adversary is fully oblivious (Bhatia et al, 2017)

• We can handle sparse linear models/kernels for a fully adaptive adversary (but no dense noise) (Bhatia et al, 2015)

• Feature corruptions can be handled (Chen et al, 2013)

• AM-RR can be sped up quite a bit

Unless active set 𝑆𝑡 stable, do gradient steps

May replace with SGD/ConjGD steps

In practice, very few least squares calls

Extremely rapid execution in practice

• AM-RR can be extended to other losses

Dec 03, 2017 79

(80)

IIT KANPUR

We do have some answers

• A “softer” approach also can be shown to work

Instead of throwing away points, down-weigh them

Points with small residual get up-weighed

Perform weighted least-squares estimation (IRLS)

• Our breakdown points are quite bad 

For simple linear models with no feature corruption, best result 𝑘 ≈ 𝑛

100

For feature corruption even worse 𝑘 ≈ 𝑛

𝑑

• Biggest missing piece: answering several questions together e.g.

consistent recovery with kernel models in the presence of an adaptive adversary and feature corruption?

That’s All!

Dec 03, 2017 80

(81)

Shameless Ad Alert!

Dec 03, 2017 81

(82)

IIT KANPUR

Non-convex Optimization For Machine Learning

• A new monograph!

• Accessible but comprehensive

• Foundations: PGD/SGD, Alt-Min, EM

• Apps: sparse rec., matrix comp., rob. reg.

• 130 references, 50 exercises, 20 figures

• Official publication available from now

publishers https://tinyurl.com/ncom-book

• For benefit of students, an arXiv version https://tinyurl.com/ncom-arxiv

Grateful to now publishers for this!

• Don’t Relax!

Dec 03, 2017 82

Non-convex Optimization for Machine Learning

Prateek Jain and Purushottam Kar Foundations and Trends® in

Machine Learning 10:3-4

(83)

The Data Sciences Gang @ CSE, IITK

Dec 03, 2017 83

(84)

IIT KANPUR Dec 03, 2017

Strengths Our

Machine Learning

Vision, Image Processing

Online, Streaming Algorithms

Databases, Data Mining

Cyber-physical Systems

84

(85)

Gratitude

Dec 03, 2017 85

Prateek Jain

MSR Kush Bhatia

UC Berkeley Govind Gopakumar

IIT Kanpur

Sayash Kapoor

IIT Kanpur Kshitij Patel

IIT Kanpur

References

Related documents