# Overview of learning to rank

(1)

## Conditional Models for Non-smooth Ranking Loss Functions

Avinava Dubey1 Jinesh Machchhar2 Chiranjib Bhattacharyya3 Soumen Chakrabarti4

1IBM Research India

2IIT Bombay

3IISc Bangalore

4IIT Bombay

(2)

## Ranking

I Given a set of documents and a query, order the documents in such a way that those documents which are relevant to the query is placed above those that are irrelevant

I Not the same as binary classification

I User attention drops off with rank

I Minimize cognitive burden till information need satisfaction

I Has led to many models of ranking loss functions

(3)

## Overview of learning to rank

I Training data

I Query set Q

I Document set Dq associated with each query qQ

I Each documenti Dq is assigned relevance zqi

I For binary relevance Dq is partitioned into relevant/good documentsDq+ Dq and irrelevant/bad documents Dq=Dq\Dq+

I Feature vectorxqi Rd is constructed from q and doc i

I Training fits a model w ∈ Rd

I At test time assign score w>xti to each feature vector xti ∈ Dt for query t

I Sort by decreasing score, take top 10, say

(4)

## Structured learning interpretation

I Instead of scoring individual documents . . .

I . . . think of assigning Dq a permutation or partial order as a classification label

I From a huge label set Y

I In case of permutations, Y = (Dq1:1 Dq)

I In case of good/bad partial orders,

y ∈ Yq = {−1,+1}n+qnq, where n+q = |Dq+| and nq = |Dq|

I ygb = +1 (−1) if document g is ranked above (below) document b

I What’s the benefit of thinking this way?

(5)

## List-oriented loss functions

I yq is the best y ∈ Y for query q

I All good docs ranked above all bad docs

I Loss function ∆(y,yq) ∈ R≥0 tells us how bad y is wrt ideal yq

I ∆ can encode what item- or pair-decomposable losses cannot, e.g.:

I Number of misclassified documents (0/1 additive error)

I Number of pair inversions (not rank-sensitive)

I Can be exploited to target w to

I Mean average precision (MAP)

I Normalized discounted cumulative gain (NDCG)

I Mean reciprocal rank (MRR)

I . . . and traditional decomposable losses like area under curve (AUC)

(6)

## Feature map φ(x , y )

I y-cognizant aggregation of document features

I E.g., add vectors from rank 1 through 10 and subtract rest

I More commonly used:

φpo(xq,y) = 1 nq+nq

X

g,b

ygb(xg −xb)

I At test time, goal is to find arg maxy w>φ(xq,y)

I Can be intractable, but sometimes not

(7)

## Training

I Look for w that minimizes P

q∆ yq,arg maxyw>φ(xq,y)

I Objective not continuous, differentiable, or convex

I Therefore, use convex upper bound hinge loss minw

X

q

max

0,max

yq(y)−w>δφq(y)

I The bound can be loose

I Only support vectors decide w

I May not always yield a good model

(8)

## Our contributions

I Parametric conditional probabilistic model Pr(y|x;w) ∝ exp(w>φ(x,y))

I Intuitive minimization of expected ranking loss (unfortunately non-convex) as an alternative to hinge loss

I For specific φ and ∆, exact, efficient, closed-form expressions for above optimization

I Convex bound on expected loss objective

(unfortunately sums over exponentially many ys)

I Monte-Carlo recipe to sample few ys and still optimize well

I Favorable experimental outcome on LETOR data

(9)

## Minimizing aggregated expected loss

I Define

Pr(y|xq;w) = exp(w>φ(xq,y)) Zq

where Zq = P

y0 exp(w>φ(x,y0))

I Minimize aggregated expected loss X

q

X

y

Pr(y|xq;w)∆(yq,y) or log of monotone function of expected loss

X

q

log X

y

Pr(y|xq;w)f(∆(yq,y))

!

(10)

## From loss to gain

I Objective has P

y· · ·

I Most ys are terrible rankings with ∆ →1

I Expected loss may drop very slightly on optimization

I To better condition the problem, use gain G rather than loss ∆, e.g., NDCG instead of ∆NDCG ExpGain: max

w

X

q

log X

y

Pr(y|xq;w)Gq(y)

! .

I Neither loss nor gain optimization is convex

I Beware the sum over y

(11)

## Polynomial form for AUC and φ

po

I Notation:

Sqi = w>xqi hqgb(ygb) = exp

ygb(Sqg −Sqb) n+qnq

I With gain G being AUC, objective can be written as

X

q

 log

 X

y

#1y

 Y

gb

hgbq (ygb)

−logZq

 where #1y = P

g,b~ygb = 1

(12)

po

## (2)

I Because of P

y · · · we would still take P

q(nq+nq)! or P

q2n+qnq time, impractical

I Handy identities (see paper) to replace the P

y · · · with an expression that can be computed in P

q(nq+nq)2 time

(13)

## Toward convexity

I Define δφq(y) = φ(xq,yq)−φ(xq,y) and consider the distribution

Pr(y|xq,yq;w) = exp(−w>δφq(y)) P

y0exp(−w>δφq(y0)) (1)

I Maximum Likelihood Estimate MLE L1(w) =X

q

L1q(w) =X

q

logZq(w) (2)

(14)

## Toward convexity (2)

I Expected loss using new distribution L2(w) = X

q

L2q(w) = X

q

EY∼(1)(∆q(Y)) (3)

I Finally consider another distribution

Pr(y|xq,yq) = exp(−w>δφq(y) + ∆q(y)) X

y0

exp(−w>δφq(y0) + ∆q(y0)) (4)

(15)

## Toward convexity (3)

I . . . and corresponding ConvexLoss:

L(w) =X

q

logX

y0

exp(−w>δφq(y0) + ∆q(y0)) (5)

I Can find w to minimize ConvexLoss

I Problem: P

y · · · is back

I Anyway we want to go beyond AUC and φpo

(16)

## Markov Chain Monte Carlo sampling

1: while not enough samples do

2: (re)start a random walk at a well-chosen state y0

3: for some number of steps t = 1,2, . . . do

4: transition from yt−1 to yt

5: make an accept/reject decision on yt

6: collect some subset of y0,y1,y2, . . . as samples

I Choose restart states to span a variety of ∆s

I In each walk, make local changes in y so as to stay near to the restart ∆

I New swap method to make transitions (see paper)

(17)

## Experiments: Accuracy

I All data sets from LETOR

I Baselines RankSVM, SVMmap, RankBoost

I Example accuracy comparison on HP2004:

NDCG@1 NDCG@5 NDCG@10 MAP

SVMmap 0.665 0.835 0.845 0.746

Rankboost 0.653 0.821 0.845 0.739

RankSVM 0.695 0.852 0.877 0.764

MLE 0.665 0.854 0.872 0.755

ExpGain_AUC 0.695 0.862 0.885 0.774

ExpGain_MAP 0.680 0.875 0.895 0.775

ExpGain_NDCG 0.680 0.872 0.890 0.772

L_3 AUC 0.695 0.880 0.898 0.771

L_3 MAP 0.709 0.883 0.900 0.786

L_3 NDCG 0.681 0.885 0.900 0.773

ConvexLoss_AUC 0.682 0.890 0.905 0.778 ConvexLoss_MAP 0.724 0.874 0.885 0.791 ConvexLoss_NDCG 0.709 0.892 0.904 0.793

(18)

## Accuracy vs. MCMC samples

0.580 0.680 0.780 0.880

1000 3000 5000

Accuracy

Number of Samples

NDCG@1 NDCG@5 NDCG@10 MAP

I More samples =⇒ better accuracy

I Saturates long before Y is approached

(19)

## Effect of restart skew

0.600 0.700 0.800 0.900

Skewed towards worst possible y

Uniform Skewed towards optimal ordereing y_q

Accuracy

Skewness

NDCG@1 NDCG@5 NDCG@10 MAP

I Two restart points, ∆ = 0,∆max ≈ 1

I Skewing toward best y is better

(20)

## Effect of restart skew (2)

I Potentially surprising, given learning needs both good and bad examples to learn

I Sample y w.p. ∝ exp(k∆)

I k 0 means favor ∆ →0, better

k → 5 0 −5 −10 −20

MAP .033 .089 .706 .774 .827 NDCG@10 .041 .124 .815 .905 .918

I Why?

(21)

## Effect of restart skew (3)

0 20 40 60 80 100

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Sample Delta

Percent

Delta=0 Delta=0.5 Delta=1

I Three restart seeds

(22)

## Effect of restart skew (4)

I Perturbing y with ∆ = 0 can push it far, say,

∆≈ 0.7 quite easily

I If y has ∆≈ 1, most perturbations will keep ∆ near 1

I There algo will not see enough good ys

I Which is why skewing restarts toward ∆ ≈ 0 is good

(23)

## Conclusion

I Conditional probabilistic model

Pr(y|x;w) ∝ exp(w>φ(x,y)) for ranking

I Challenges: P

y∈Y · · ·, nonconvexity

I Efficient closed form for specific but widely-used φ and ∆

I Convex upper bound to expected loss in all cases

I Monte-Carlo sampling method to avoid P

y∈Y· · ·

I Competitive accuracy in experiments

I In particular, generally better than

I Hinge loss with max-margin learning techniques

I Boosting with weak learners

I Training by sampling the space of ys this way may have broader application

Updating...

## References

Related subjects :