## Robust Regression

… and how I could relax after I stopped relaxing

**Workshop on Algorithms and Optimization, ICTS, Bengaluru**
**Purushottam Kar**

**IIT KANPUR**

A Recommendation System Problem

**Dec 03, 2017** **2**

A Recommendation System Problem

**Dec 03, 2017** **2**

**IIT KANPUR**

A Recommendation System Problem

**Dec 03, 2017** **2**

**ITEMS**

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5} 𝐱^{6} 𝐱^{7}

## A Recommendation System Problem

**Dec 03, 2017** **2**

**ITEMS**

0.5 0.2 0.2 0.5 0.9 0.8 0.4

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5} 𝐱^{6} 𝐱^{7}

**IIT KANPUR**

Encodes user preferences

A Recommendation System Problem

**Dec 03, 2017** **2**

**ITEMS**

0.5 0.2 0.2 0.5 0.9 0.8 0.4

≈

𝐱^{1}
𝐱^{2}
𝐱^{3}
𝐱^{4}
𝐱^{5}
𝐱^{6}
𝐱^{7}

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5} 𝐱^{6} 𝐱^{7}

Encodes user preferences

A Recommendation System Problem

**Dec 03, 2017** **2**

**ITEMS**

0.5 0.2 0.2 0.5 0.9 0.8 0.4

≈

𝐱^{1}
𝐱^{2}
𝐱^{3}
𝐱^{4}
𝐱^{5}
𝐱^{6}
𝐱^{7}

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5} 𝐱^{6} 𝐱^{7}

**IIT KANPUR**

Encodes user preferences

A Recommendation System Problem

**Dec 03, 2017** **2**

**ITEMS**

0.5 0.2 0.2 0.5 0.9 0.8 0.4

≈

𝐱^{1}
𝐱^{2}
𝐱^{3}
𝐱^{4}
𝐱^{5}
𝐱^{6}
𝐱^{7}

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5} 𝐱^{6} 𝐱^{7}

**0.8** **0.2**

**0.7**

Encodes user preferences

## A Recommendation System Problem

**Dec 03, 2017** **2**

**ITEMS**

0.5 0.2 0.2 0.5 0.9 0.8 0.4

≈

= +

𝐱^{1}
𝐱^{2}
𝐱^{3}
𝐱^{4}
𝐱^{5}
𝐱^{6}
𝐱^{7}

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5} 𝐱^{6} 𝐱^{7}

**0.8** **0.2**

**0.7**

**IIT KANPUR**

Encodes user preferences

A Recommendation System Problem

**Dec 03, 2017** **2**

**ITEMS**

0.5 0.2 0.2 0.5 0.9 0.8 0.4

≈

= +

𝐱^{1}
𝐱^{2}
𝐱^{3}
𝐱^{4}
𝐱^{5}
𝐱^{6}
𝐱^{7}

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5} 𝐱^{6} 𝐱^{7}

Still recover
𝑤^{∗}?

**0.8** **0.2**

**0.7**

Encodes user preferences

A Recommendation System Problem

**Dec 03, 2017** **2**

**ITEMS**

0.5 0.2 0.2 0.5 0.9 0.8 0.4

≈

= +

𝐱^{1}
𝐱^{2}
𝐱^{3}
𝐱^{4}
𝐱^{5}
𝐱^{6}
𝐱^{7}

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5} 𝐱^{6} 𝐱^{7}

Still recover
𝑤^{∗}?

**0.8** **0.2**

**0.7**

Corruptions are systematic – inappropriate to model them as “Gaussian noise”

**IIT KANPUR**

Encodes user preferences

## A Recommendation System Problem

**Dec 03, 2017** **2**

**ITEMS**

0.5 0.2 0.2 0.5 0.9 0.8 0.4

≈

= +

𝐱^{1}
𝐱^{2}
𝐱^{3}
𝐱^{4}
𝐱^{5}
𝐱^{6}
𝐱^{7}

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5} 𝐱^{6} 𝐱^{7}

The items and ratings are not received in one

go – online problem!

Still recover
𝑤^{∗}?

**0.8** **0.2**

**0.7**

Corruptions are systematic – inappropriate to model them as “Gaussian noise”

A Biometric Identification Problem

**Dec 03, 2017** **13**

**IIT KANPUR**

A Biometric Identification Problem

**Dec 03, 2017** **13**

**DATABASE**

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5}

## A Biometric Identification Problem

**Dec 03, 2017** **13**

**DATABASE**

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5}

**IIT KANPUR**

A Biometric Identification Problem

**Dec 03, 2017** **13**

**DATABASE**

≈

0.2

0.4 0.05 0.05 0.3

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5}

A Biometric Identification Problem

**Dec 03, 2017** **13**

**DATABASE**

≈

≈

0.2

0.4 0.05 0.05 0.3

𝐱^{1}

Weights can reveal the identity of the person

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5}

𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5}

**IIT KANPUR**

A Biometric Identification Problem

**Dec 03, 2017** **13**

**DATABASE**

≈

≈

0.2

0.4 0.05 0.05 0.3

𝐱^{1}

Weights can reveal the identity of the person

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5}

𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5}

## A Biometric Identification Problem

**Dec 03, 2017** **13**

**DATABASE**

≈

≈

0.2

0.4 0.05 0.05 0.3

𝐱^{1}

Weights can reveal the identity of the person

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5}

𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5}

**IIT KANPUR**

A Biometric Identification Problem

**Dec 03, 2017** **13**

**DATABASE**

= + ≈

+ ≈

=

0.2

0.4 0.05 0.05 0.3

𝐱^{1}

Weights can reveal the identity of the person

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5}

𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5}

A Biometric Identification Problem

**Dec 03, 2017** **13**

**DATABASE**

= + ≈

+ ≈

=

0.2

0.4 0.05 0.05 0.3

𝐱^{1}

Weights can reveal the identity of the person

Corruptions are structured; not

“salt-pepper” noise

𝐱^{1} 𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5}

𝐱^{2} 𝐱^{3} 𝐱^{4} 𝐱^{5}

**IIT KANPUR**

## Robust Learning and Estimation

• Classical subfield of statistics (Huber, 1964, Tukey, 1960)

• Newfound interest – scalability and efficiency

• Robust classification (Feng et al, 2014)

• Robust regression (Bhatia et al 2015, Chen et al, 2013)

• Robust PCA (Candès et al, 2009, Netrapalli et al, 2014)

• Robust matrix completion (Cherapanamjeri et al, 2017)

• Robust optimization (Charikar et al, 2016)

• Robust estimation (Diakonikolas et al, 2017, Lai et al, 2016, Pravesh’s talk)

• Extremely exciting area – from theory and app perspectives

**Dec 03, 2017** **22**

Robust Learning and Estimation - Application

**Dec 03, 2017** Netrapalli et al, 2014, Bhatia et al, 2015 **23**

**IIT KANPUR**

## Robust Learning and Estimation - Application

• When data is actually corrupted e.g. click fraud in RecSys

**Dec 03, 2017** Netrapalli et al, 2014, Bhatia et al, 2015 **23**

Robust Learning and Estimation - Application

• When data is actually corrupted e.g. click fraud in RecSys

• Even when data not corrupted (to allow problem reformulation)

**Dec 03, 2017** Netrapalli et al, 2014, Bhatia et al, 2015 **23**

**IIT KANPUR**

## Robust Learning and Estimation - Application

• When data is actually corrupted e.g. click fraud in RecSys

• Even when data not corrupted (to allow problem reformulation)

• Foreground-background extraction can be cast as robust PCA

**Dec 03, 2017** Netrapalli et al, 2014, Bhatia et al, 2015 **23**

Robust Learning and Estimation - Application

• When data is actually corrupted e.g. click fraud in RecSys

• Even when data not corrupted (to allow problem reformulation)

• Foreground-background extraction can be cast as robust PCA

**Dec 03, 2017** **23**

+

=

Netrapalli et al, 2014, Bhatia et al, 2015

**IIT KANPUR**

## Robust Learning and Estimation - Application

• When data is actually corrupted e.g. click fraud in RecSys

• Even when data not corrupted (to allow problem reformulation)

• Foreground-background extraction can be cast as robust PCA

**Dec 03, 2017** **23**

+

=

Netrapalli et al, 2014, Bhatia et al, 2015

Robust Learning and Estimation - Application

• When data is actually corrupted e.g. click fraud in RecSys

• Even when data not corrupted (to allow problem reformulation)

• Foreground-background extraction can be cast as robust PCA

• Image in-painting can be cast as robust regression

**Dec 03, 2017** **23**

+

=

Netrapalli et al, 2014, Bhatia et al, 2015

**IIT KANPUR**

## Robust Learning and Estimation - Application

• When data is actually corrupted e.g. click fraud in RecSys

• Even when data not corrupted (to allow problem reformulation)

• Foreground-background extraction can be cast as robust PCA

• Image in-painting can be cast as robust regression

**Dec 03, 2017** **23**

+

=

Netrapalli et al, 2014, Bhatia et al, 2015

Robust Learning and Estimation - Application

• When data is actually corrupted e.g. click fraud in RecSys

• Even when data not corrupted (to allow problem reformulation)

• Foreground-background extraction can be cast as robust PCA

• Image in-painting can be cast as robust regression

**Dec 03, 2017** **23**

+

=

Netrapalli et al, 2014, Bhatia et al, 2015

**IIT KANPUR**

## A Toy Problem befitting this near-lunch Hour

• A linear least-squares regression problem

• We have 𝑛 data points 𝐱^{𝑖}, 𝑦^{𝑖} ∈ ℝ^{𝑑} × ℝ

• Most of the points are

clean

i.e. for some (unknown) 𝐰^{∗}∈ ℝ

^{𝑑}𝑦

^{𝑖}= 𝐰

^{∗}, 𝐱

^{𝑖}

• However, 𝑘 of the points were corrupted by
𝑦^{𝑖} = 𝐰^{∗}, 𝐱^{𝑖} + 𝑏_{𝑖}^{∗}

• Let 𝑆^{∗} denote the set of 𝑛 − 𝑘 uncorrupted points

• (When/How) can we recover 𝐰^{∗} from the data?

• Will see an extremely simple and intuitive algorithm

• … and its proof of optimality (sorry but proof will be light )

**Dec 03, 2017** **32**

## Notation

• Let 𝐲 = 𝑦^{1}, … , 𝑦^{𝑛 ⊤} ∈ ℝ^{𝑛}, 𝐛^{∗} = 𝑏_{1}^{∗}, … , 𝑏_{𝑛}^{∗ ⊤} ∈ ℝ^{𝑛}, 𝑋 = 𝐱^{1}, … , 𝐱^{𝑛} ∈ ℝ^{𝑑×𝑛}
𝐲 = 𝑋^{⊤}𝐰^{∗} + 𝐛^{∗}

• Assume for sake of simplicity that 𝐱^{𝑖}

2 = 1 for all 𝑖 ∈ 𝑛

• Recall since only 𝑘 points corrupted, 𝐛^{∗} _{0} ≤ 𝑘 and 𝑆^{∗} = supp 𝐛^{∗}
𝐯 _{0} = 𝑖: 𝐯_{𝑖} ≠ 0

• For 𝑆 ⊆ 𝑛 , 𝐲_{𝑆}, 𝐛_{𝑆}^{∗} ∈ ℝ ^{𝑆} , 𝑋_{𝑆} ∈ ℝ^{𝑑× 𝑆} denote subvectors/matrices

• Let 𝐶 = 𝑋𝑋^{⊤} and for any 𝑆 ⊆ 𝑛 , denote 𝐶_{𝑆} = 𝑋_{𝑆}𝑋_{𝑆}^{⊤}

**Dec 03, 2017** **33**

**IIT KANPUR**

Some Solution Strategies

• Discover the clean set 𝑆^{∗} and recover 𝐰^{∗}from it

𝑆 =𝑛−𝑘min min

𝐰 𝐲_{𝑆} − 𝑋_{𝑆}^{⊤}𝐰

2

2 = min

𝑆 =𝑛−𝑘min

𝐰 𝑖∈𝑆

𝑦^{𝑖} − 𝐰, 𝐱^{𝑖} ^{2}

• Discover the corruptions 𝐛^{∗} and clean up the responses 𝐲

𝐛min_{0}=𝑘 min

𝐰 (𝐲 − 𝐛) − 𝑋^{⊤}𝐰 _{2}^{2}

• However, the above problems are combinatorial, even NP-hard

• Relax, and just relax!

𝐛min_{1}≤𝑟 min

𝐰 (𝐲 − 𝐛) − 𝑋^{⊤}𝐰 _{2}^{2} , or even
min𝐰,𝐛 (𝐲 − 𝐛) − 𝑋^{⊤}𝐰 _{2}^{2} + 𝜆 ⋅ 𝐛 _{1}

• Extremely popular and well-studied approach in image proc etc.

**Dec 03, 2017** **34**

## An “Alternate” Viewpoint

• Relaxation methods can be very slow in practice

• Will see one very simple way to do better

• Reconsider the formulation

𝑆 =𝑛−𝑘min min

𝐰 𝐲_{𝑆} − 𝑋_{𝑆}^{⊤}𝐰

2 2

• Two variables of interest 𝐰^{∗} and 𝑆^{∗}

• Recovering either one recovers the other

• If we know 𝑆^{∗}, do least squares on it to get 𝐰^{∗}

• If we know 𝐰^{∗}, points with zero error gives 𝑆^{∗}

• Hmm … what if we only have “good” estimates?

• Can a good estimate 𝑆 of 𝑆^{∗} get me a good estimate 𝐰 of 𝐰^{∗}?

• Can that good estimate 𝐰 get me a still better estimate of 𝑆^{∗}?

**Dec 03, 2017** **35**

**IIT KANPUR**

## AM-RR: Alt. Min. for Rob. Reg.

**Dec 03, 2017** **36**

AM-RR

1. Data X ∈ ℝ

^{𝑑×𝑛}

, 𝐲 ∈ ℝ

^{𝑛}

, # bad pts 𝑘 2. Initialize 𝑆

^{1}

← 1: 𝑛 − 𝑘

3. For 𝑡 = 1,2, … , 𝑇 w

^{𝑡+1}

= arg min

𝐰

𝐲

_{𝑆}

^{𝑡}

− 𝑋

_{𝑆}⊤𝑡

𝐰

2 2

𝑆

^{𝑡+1}

= arg min

𝑆 =𝑛−𝑘

𝐲

_{𝑆}

− 𝑋

_{𝑆}

^{⊤}

𝐰

^{𝑡+1}

2 2

4. Repeat until convergence

Maintain an “active”

set 𝑆^{𝑡}: points that we
believe are clean
Solve a least squares
problem on active set

Find points which
seem least corrupted
with respected to 𝐰^{𝑡+1}
Residual 𝐫^{𝑡+1} = 𝐲 − 𝑋𝐰^{𝑡+1}
Find points with least res.

## AM-RR at work

**Dec 03, 2017** **37**

**IIT KANPUR**

AM-RR at work

**Dec 03, 2017** **37**

AM-RR at work

**Dec 03, 2017** **37**

**IIT KANPUR**

AM-RR at work

**Dec 03, 2017** **37**

AM-RR at work

**Dec 03, 2017** **41**

**IIT KANPUR**

AM-RR at work

**Dec 03, 2017** **41**

AM-RR at work

**Dec 03, 2017** **41**

**IIT KANPUR**

AM-RR at work

**Dec 03, 2017** **41**

AM-RR at work

**Dec 03, 2017** **41**

**IIT KANPUR**

AM-RR at work

**Dec 03, 2017** **41**

AM-RR at work

**Dec 03, 2017** **42**

**IIT KANPUR**

## AM-RR at work

**Dec 03, 2017** **42**

AM-RR at work

**Dec 03, 2017** **42**

**IIT KANPUR**

AM-RR at work

**Dec 03, 2017** **42**

AM-RR at work

**Dec 03, 2017** **42**

**IIT KANPUR**

AM-RR at work

**Dec 03, 2017** **42**

AM-RR at work

**Dec 03, 2017** **42**

**IIT KANPUR**

AM-RR at work

**Dec 03, 2017** **42**

AM-RR at work

**Dec 03, 2017** **42**

**IIT KANPUR**

AM-RR at work

**Dec 03, 2017** **42**

AM-RR at work

**Dec 03, 2017** **42**

**IIT KANPUR**

AM-RR at work

**Dec 03, 2017** **42**

## AM-RR at work

**Dec 03, 2017** **42**

**IIT KANPUR**

AM-RR at work

**Dec 03, 2017** **42**

AM-RR at work

**Dec 03, 2017** **42**

**IIT KANPUR**

AM-RR at work

**Dec 03, 2017** **42**

AM-RR at work

**Dec 03, 2017** **42**

**IIT KANPUR**

AM-RR at work

**Dec 03, 2017** **42**

AM-RR at work

**Dec 03, 2017** **42**

**IIT KANPUR**

AM-RR at work

**Dec 03, 2017** Bhatia et al, 2015 **66**

AM-RR at work

• The y-axis plots the “regret” of the algorithms.

• The regret of an algorithm informally captures the amount of

“lost opportunities” due to recommending items user doesn’t like

• AM-RR based recommendations incur substantially less regret

**Dec 03, 2017** Kapoor, Patel, K., 2017 **67**

**IIT KANPUR**

## Feel free to doze off

**Convergence proofs ahead!**

**Dec 03, 2017** **68**

Why AM-RR works?

• Some presuppositions are necessary

• It had better be possible to uniquely identify 𝐰^{∗} using data in 𝑆^{∗}

• This requires 𝑋_{𝑆}^{∗} to be well-conditioned

• To simplify things, assume all sets 𝑆 of size 𝑛 − 𝑘 well conditioned

• This can be ensured if for some 𝑐 > 0, every 𝐯 ∈ ℝ^{𝑑}
𝑋_{𝑆}^{⊤}𝐯

2

2 ≥ 𝑐 ⋅ 𝐯 _{2}^{2}

• We will assume the above holds with 𝑐 = 𝛼 ⋅ 𝑛 − 𝑘

• This holds w.h.p. for all 𝑆 if 𝑋 is sampled from sub-Gaussian dist.

• Note that since all covariates are unit norm, 𝐱^{𝑖}

2 ≤ 1, also have
𝑋_{𝑆}^{⊤}𝐯

2

2 ≤ 𝑆 ⋅ 𝐯 _{2}^{2}, for all 𝑆 ⊆ 𝑛

**Dec 03, 2017** **69**

**IIT KANPUR**

## The Proof

• AM-RR solves least squares on the active set 𝑆^{𝑡} hence
𝐰^{𝑡+1} = 𝐶_{𝑆}^{−1}_{𝑡} 𝑋_{𝑆}_{𝑡}𝐲_{𝑆}_{𝑡}

• However 𝐲_{𝑆}_{𝑡} = 𝑋_{𝑆}^{⊤}_{𝑡}𝐰^{∗} + 𝐛_{𝑆}^{∗}_{𝑡} which gives us
𝐰^{𝑡+1} = 𝐰^{∗} + 𝐶_{𝑆}^{−1}_{𝑡} 𝑋_{𝑆}_{𝑡}𝐛_{𝑆}^{∗}_{𝑡}

• This gives us the residuals as

𝐫^{𝑡+1} = 𝐲 − 𝑋^{⊤}𝐰^{𝑡+1} = 𝐛^{∗} + 𝑋𝐶_{𝑆}−1𝑡 𝑋_{𝑆}^{𝑡}𝐛_{𝑆}∗𝑡

• AM-RR chooses 𝑆^{𝑡+1} to be the 𝑛 − 𝑘 points with least residual, so
𝑟_{𝑆}𝑡+1𝑡+1

2

2 ≤ 𝑟_{𝑆}^{𝑡+1}^{∗}

2 2

• Notice that since 𝐫^{𝑡+1} is simply 𝐛^{∗} plus an error term, once error
goes down, my active set will only contain clean points!

**Dec 03, 2017** **70**

Nice! 𝐰^{𝑡+1} is 𝐰^{∗}
plus an error term

Nice! 𝐫^{𝑡+1} is 𝐛^{∗}
plus an error term

## The Proof

• Elementary manipulations and applying well conditioned-ness
𝐛_{𝑆}∗𝑡+1

2

2 ≤ 𝑘^{2}

𝛼^{2} 𝑛 − 𝑘 ^{2} ⋅ 𝐛_{𝑆}∗𝑡

2

2 + 2𝑘

𝛼 𝑛 − 𝑘 ⋅ 𝐛_{𝑆}∗𝑡

2

2 ⋅ 𝐛_{𝑆}∗𝑡

2 2

• Solving this quadratic equation gives us
𝐛_{𝑆}∗𝑡+1

2 ≤ 2 + 1 𝑘

𝛼 𝑛 − 𝑘 ⋅ 𝐛_{𝑆}∗𝑡

2

• Thus, if 𝑘 ≤ ^{𝛼}

3+𝛼 ⋅ 𝑛, then after 𝑡 ≥ log ^{𝐛}^{∗} ^{2}

𝜖 we get 𝐰^{𝑡} − 𝐰^{∗} _{2} ≤ 𝜖

• A better way to rewrite the above is 𝑘 ≤ ^{𝑛}

3𝜅+1

• The quantity 𝜅 = ^{1}

𝛼 captures the

condition number

of the problem**Dec 03, 2017** **71**

**IIT KANPUR**

## A Generalized AM-RR

• Loss function: ℓ 𝐰; 𝑦^{𝑖}, 𝐱^{𝑖} with certain nice properties

• Positivity: ℓ 𝐰; 𝑦^{𝑖}, 𝐱^{𝑖} ≥ 0

• Realizability: ℓ 𝐰^{∗}; 𝑦^{𝑖}, 𝐱^{𝑖} = 0 for all 𝑖 ∈ 𝑆^{∗}

• Normalization: ℓ 𝐰^{1}; 𝑦^{𝑖}, 𝐱^{𝑖} − ℓ 𝐰^{2}; 𝑦^{𝑖}, 𝐱^{𝑖} ≤ ^{𝛽}

2 ⋅ 𝐰^{1} − 𝐰^{2} _{2}^{2}

• Well-conditioned-ness:

ℓ 𝐰^{1}; 𝑦^{𝑖}, 𝐱^{𝑖} ≥ ℓ 𝐰^{2}; 𝑦^{𝑖}, 𝐱^{𝑖} + 𝛻ℓ 𝐰^{2}; 𝑦^{𝑖}, 𝐱^{𝑖} , 𝐰^{1} − 𝐰^{2} + 𝛼

2 ⋅ 𝐰^{1} − 𝐰^{2} _{2}^{2}

• Denote 𝑓 𝐰, 𝑆 = _{𝑖∈𝑆} ℓ 𝐰; 𝑦^{𝑖}, 𝐱^{𝑖}

• Actually, weaker requirement needed: for all 𝑆 ⊆ 𝑛

𝑓 𝐰^{1}, 𝑆 ≥ 𝑓 𝐰^{2}, 𝑆 + 𝛻𝑓 𝐰^{2}, 𝑆 , 𝐰^{1} − 𝐰^{2} + 𝛼 𝑆

2 ⋅ 𝐰^{1} − 𝐰^{2} _{2}^{2}
𝑓 𝐰^{1}, 𝑆 − 𝑓 𝐰^{2}, 𝑆 ≤ 𝛽 𝑆

2 ⋅ 𝐰^{1} − 𝐰^{2} _{2}^{2}

**Dec 03, 2017** **72**

Weaker result for sake of clarity

Plays same role as
assump. 𝐱^{𝑖}

2 ≤ 1 Strong Convexity

## A Generalized AM-RR

**Dec 03, 2017** **73**

AM-RR

1. Data X ∈ ℝ

^{𝑑×𝑛}

, 𝐲 ∈ ℝ

^{𝑛}

, # bad pts 𝑘 2. Initialize 𝑆

^{1}

← 1: 𝑛 − 𝑘

3. For 𝑡 = 1,2, … , 𝑇 w

^{𝑡+1}

= arg min

𝐰

𝐲

_{𝑆}

^{𝑡}

− 𝑋

_{𝑆}⊤𝑡

𝐰

2 2

𝑆

^{𝑡+1}

= arg min

𝑆 =𝑛−𝑘

𝐲

_{𝑆}

− 𝑋

_{𝑆}

^{⊤}

𝐰

^{𝑡+1}

2 2

4. Repeat until convergence

AM-RR-gen

1. Data X ∈ ℝ

^{𝑑×𝑛}

, 𝐲 ∈ ℝ

^{𝑛}

, # bad pts 𝑘 2. Initialize 𝑆

^{1}

← 1: 𝑛 − 𝑘

3. For 𝑡 = 1,2, … , 𝑇 w

^{𝑡+1}

= arg min

𝐰

𝑓 𝐰, 𝑆

^{𝑡}

𝑆

^{𝑡+1}

= arg min

𝑆 =𝑛−𝑘

𝑓 𝐰

^{𝑡+1}

, 𝑆 4. Repeat until convergence

Since 𝑓 is a convex function, it is usually

easy to solve this Find the 𝑛 − 𝑘 points

with the least loss in
terms of ℓ 𝐰; 𝑦^{𝑖}, 𝐱^{𝑖}

**IIT KANPUR**

## Why AM-RR-gen works!

• Since w^{𝑡+1} minimizes 𝑓 𝐰, 𝑆^{𝑡} , we must have 𝛻𝑓 𝐰^{𝑡+1}, 𝑆^{𝑡} = 𝟎

• Applying well conditioned-ness then gives us
𝐰^{∗} − 𝐰^{𝑡+1} _{2}^{2} ≤ 2

𝛼 𝑛 − 𝑘 𝑓 𝐰^{∗}, 𝑆^{𝑡} − 𝑓 𝐰^{𝑡+1}, 𝑆^{𝑡} ≤ 2

𝛼 𝑛 − 𝑘 𝑓 𝐰^{∗}, 𝑆^{𝑡}

• Since 𝑆^{𝑡+1} minimizes 𝑓 𝐰^{𝑡+1}, 𝑆 , we have 𝑓 𝐰^{𝑡+1}, 𝑆^{𝑡+1} ≤ 𝑓 𝐰^{𝑡+1}, 𝑆^{∗}
𝑓 𝐰^{𝑡+1}, 𝑆^{𝑡+1}\S^{∗} ≤ 𝑓 𝐰^{𝑡+1}, 𝑆^{∗}\S^{𝑡+1}

• Since 𝑆^{𝑡+1}\S^{∗} ≤ 𝑘 and S^{∗}\𝑆^{𝑡+1} ≤ 𝑘, normalization gives us
𝑓 𝐰^{∗}, 𝑆^{𝑡+1} = 𝑓 𝐰^{∗}, 𝑆^{𝑡+1}\S^{∗} ≤ 𝛽𝑘 ⋅ 𝐰^{∗} − 𝐰^{𝑡+1} _{2}^{2}

• Putting things together gives us 𝑓 𝐰^{∗}, 𝑆^{𝑡+1} ≤ ^{2𝛽𝑘}

𝛼 𝑛−𝑘 𝑓 𝐰^{∗}, 𝑆^{𝑡}

• For 𝑘 ≤ ^{𝑛}

3𝜅+1 we get linear convergence (𝜅 = ^{𝛽}

𝛼 is condition number)

**Dec 03, 2017** **74**

AM-RR with Kernels!

• Calculations get messy so just a sketch-of-an-algo for now

• Data 𝐱^{𝑖}, 𝑦^{𝑖} and kernel 𝐾 with associated feature map 𝜙: ℝ^{𝑑} → ℋ
𝐾 𝑥^{𝑖}, 𝑥^{𝑗} = 𝜙 𝐱^{𝑖} , 𝜙 𝐱^{𝑗}

• Model a bit different 𝑦^{𝑖} = 𝐖, 𝜙 𝐱^{𝑖} + 𝐛_{𝑖}^{∗}, for some 𝐖 ∈ ℋ

• Can only hope to recover component of 𝐖 in span 𝜙 𝐱^{1} , … , 𝜙 𝐱^{𝑛}

• A simplified algo that uses AM-RR as a black box

• Receive the data and create new features out of “empirical kernel map”

• Create 𝐳^{𝑖} = 𝐾 𝐱^{𝑖}, 𝐱^{1} , 𝐾 𝐱^{𝑖}, 𝐱^{2} , … , 𝐾 𝐱^{𝑖}, 𝐱^{𝑛} ∈ ℝ^{𝑛}

• Perform AM-RR with covariates 𝐳^{𝑖}, responses 𝑦^{𝑖}

• Making this bullet-proof needs more work

**Dec 03, 2017** **75**

**IIT KANPUR**

## Please consider waking up

**Open problems ahead **

**Dec 03, 2017** **76**

Non-toy Problems for relaxed introspection

• Presence of dense noise 𝐲 = 𝑋𝐰^{∗} + 𝛜 + 𝐛

• Corruptions are still sparse 𝐛 _{0} ≤ 𝑘

• Dense noise is Gaussian 𝛜 ∼ 𝒩 𝟎, 𝐼

• Can we ensure consistent recovery i.e. lim

𝑛→∞ 𝐰 − 𝐰^{∗} _{2} = 0?

• Adversary Model

• Fully adaptive: 𝐛 chosen with knowledge of 𝑋, 𝐰^{∗}, 𝛜

• Fully oblivious: 𝐛 chosen without knowledge of 𝑋, 𝐰^{∗}, 𝛜

• Partially oblivious: 𝐛 chosen with knowledge of 𝑋, 𝐰^{∗} or just 𝑋

• Breakdown point

• Can we tolerate up to 𝑘 = ^{𝑛}

2 − 1 corruptions, 𝑘 = 𝑛 − 𝑑 log 𝑑 corruptions?

• What if adversary is fully oblivious?

**Dec 03, 2017** **77**

**IIT KANPUR**

## Non-toy Problems for relaxed introspection

• Non-linear/structured Models

• What if the linear model 𝑦 ≈ 𝐰, 𝐱 is not appropriate?

• Rob. Reg. with sparse models?, kernels?, deep-nets?

• Loss Function

• The least squares loss function 𝑦 − 𝐰, 𝐱 ^{2} may not suit all applications

• Robust regression with other loss functions? We already saw an example

• Speed

• I am solving log ^{1}

𝜖 reg. problems to solve one corrupted reg. problem

• Can I invoke the least squares solver less frequently?

• Feature corruption

• What if the features 𝐱^{𝑖} are corrupted (too)?

• Can pass the “blame” for corruption onto 𝑦^{𝑖} but does not always work

• Distributed corruption: each 𝐱^{𝑖} has only few of 𝑑 coordinates corrupted

**Dec 03, 2017** **78**

## We do have some answers

• We can ensure consistent recovery of 𝐰^{∗} in the presence of

dense Gaussian noise and sparse corruptions if adversary is fully oblivious (Bhatia et al, 2017)

• We can handle sparse linear models/kernels for a fully adaptive adversary (but no dense noise) (Bhatia et al, 2015)

• Feature corruptions can be handled (Chen et al, 2013)

• AM-RR can be sped up quite a bit

• Unless active set 𝑆^{𝑡} stable, do gradient steps

• May replace with SGD/ConjGD steps

• In practice, very few least squares calls

• Extremely rapid execution in practice

• AM-RR can be extended to other losses

**Dec 03, 2017** **79**

**IIT KANPUR**

## We do have some answers

• A “softer” approach also can be shown to work

• Instead of throwing away points, down-weigh them

• Points with small residual get up-weighed

• Perform weighted least-squares estimation (IRLS)

• Our breakdown points are quite bad

• For simple linear models with no feature corruption, best result 𝑘 ≈ ^{𝑛}

100

• For feature corruption even worse 𝑘 ≈ ^{𝑛}

𝑑

• Biggest missing piece: answering several questions together e.g.

consistent recovery with kernel models in the presence of an adaptive adversary and feature corruption?

## That’s All!

**Dec 03, 2017** **80**

Shameless Ad Alert!

**Dec 03, 2017** **81**

**IIT KANPUR**

Non-convex Optimization For Machine Learning

• A new monograph!

• Accessible but comprehensive

• Foundations: PGD/SGD, Alt-Min, EM

• Apps: sparse rec., matrix comp., rob. reg.

• 130 references, 50 exercises, 20 figures

• Official publication available from now

publishers https://tinyurl.com/ncom-book

• For benefit of students, an arXiv version https://tinyurl.com/ncom-arxiv

Grateful to now publishers for this!

• Don’t Relax!

**Dec 03, 2017** **82**

**Non-convex Optimization**
**for Machine Learning**

**Prateek Jain and Purushottam Kar**
Foundations and Trends® in

Machine Learning 10:3-4

## The Data Sciences Gang @ CSE, IITK

**Dec 03, 2017** **83**

**IIT KANPUR**
**Dec 03, 2017**

Strengths Our

Machine Learning

Vision, Image Processing

Online, Streaming Algorithms

Databases, Data Mining

Cyber-physical Systems

**84**

Gratitude

**Dec 03, 2017** **85**

Prateek Jain

MSR Kush Bhatia

UC Berkeley Govind Gopakumar

IIT Kanpur

Sayash Kapoor

IIT Kanpur Kshitij Patel

IIT Kanpur