Overview of learning to rank

23  Download (0)

Full text


Conditional Models for Non-smooth Ranking Loss Functions

Avinava Dubey1 Jinesh Machchhar2 Chiranjib Bhattacharyya3 Soumen Chakrabarti4

1IBM Research India

2IIT Bombay

3IISc Bangalore

4IIT Bombay



I Given a set of documents and a query, order the documents in such a way that those documents which are relevant to the query is placed above those that are irrelevant

I Not the same as binary classification

I User attention drops off with rank

I Minimize cognitive burden till information need satisfaction

I Has led to many models of ranking loss functions


Overview of learning to rank

I Training data

I Query set Q

I Document set Dq associated with each query qQ

I Each documenti Dq is assigned relevance zqi

I For binary relevance Dq is partitioned into relevant/good documentsDq+ Dq and irrelevant/bad documents Dq=Dq\Dq+

I Feature vectorxqi Rd is constructed from q and doc i

I Training fits a model w ∈ Rd

I At test time assign score w>xti to each feature vector xti ∈ Dt for query t

I Sort by decreasing score, take top 10, say


Structured learning interpretation

I Instead of scoring individual documents . . .

I . . . think of assigning Dq a permutation or partial order as a classification label

I From a huge label set Y

I In case of permutations, Y = (Dq1:1 Dq)

I In case of good/bad partial orders,

y ∈ Yq = {−1,+1}n+qnq, where n+q = |Dq+| and nq = |Dq|

I ygb = +1 (−1) if document g is ranked above (below) document b

I What’s the benefit of thinking this way?


List-oriented loss functions

I yq is the best y ∈ Y for query q

I All good docs ranked above all bad docs

I Loss function ∆(y,yq) ∈ R≥0 tells us how bad y is wrt ideal yq

I ∆ can encode what item- or pair-decomposable losses cannot, e.g.:

I Number of misclassified documents (0/1 additive error)

I Number of pair inversions (not rank-sensitive)

I Can be exploited to target w to

I Mean average precision (MAP)

I Normalized discounted cumulative gain (NDCG)

I Mean reciprocal rank (MRR)

I . . . and traditional decomposable losses like area under curve (AUC)


Feature map φ(x , y )

I y-cognizant aggregation of document features

I E.g., add vectors from rank 1 through 10 and subtract rest

I More commonly used:

φpo(xq,y) = 1 nq+nq



ygb(xg −xb)

I At test time, goal is to find arg maxy w>φ(xq,y)

I Can be intractable, but sometimes not



I Look for w that minimizes P

q∆ yq,arg maxyw>φ(xq,y)

I Objective not continuous, differentiable, or convex

I Therefore, use convex upper bound hinge loss minw






I The bound can be loose

I Only support vectors decide w

I May not always yield a good model


Our contributions

I Parametric conditional probabilistic model Pr(y|x;w) ∝ exp(w>φ(x,y))

I Intuitive minimization of expected ranking loss (unfortunately non-convex) as an alternative to hinge loss

I For specific φ and ∆, exact, efficient, closed-form expressions for above optimization

I Convex bound on expected loss objective

(unfortunately sums over exponentially many ys)

I Monte-Carlo recipe to sample few ys and still optimize well

I Favorable experimental outcome on LETOR data


Minimizing aggregated expected loss

I Define

Pr(y|xq;w) = exp(w>φ(xq,y)) Zq

where Zq = P

y0 exp(w>φ(x,y0))

I Minimize aggregated expected loss X




Pr(y|xq;w)∆(yq,y) or log of monotone function of expected loss



log X





From loss to gain

I Objective has P

y· · ·

I Most ys are terrible rankings with ∆ →1

I Expected loss may drop very slightly on optimization

I To better condition the problem, use gain G rather than loss ∆, e.g., NDCG instead of ∆NDCG ExpGain: max




log X



! .

I Neither loss nor gain optimization is convex

I Beware the sum over y


Polynomial form for AUC and φ


I Notation:

Sqi = w>xqi hqgb(ygb) = exp

ygb(Sqg −Sqb) n+qnq

I With gain G being AUC, objective can be written as



 log

 X



 Y


hgbq (ygb)


 where #1y = P

g,b~ygb = 1


Polynomial form for AUC and φ



I Because of P

y · · · we would still take P

q(nq+nq)! or P

q2n+qnq time, impractical

I Handy identities (see paper) to replace the P

y · · · with an expression that can be computed in P

q(nq+nq)2 time

I Likewise with gradient expression


Toward convexity

I Define δφq(y) = φ(xq,yq)−φ(xq,y) and consider the distribution

Pr(y|xq,yq;w) = exp(−w>δφq(y)) P

y0exp(−w>δφq(y0)) (1)

I Maximum Likelihood Estimate MLE L1(w) =X


L1q(w) =X


logZq(w) (2)


Toward convexity (2)

I Expected loss using new distribution L2(w) = X


L2q(w) = X


EY∼(1)(∆q(Y)) (3)

I Finally consider another distribution

Pr(y|xq,yq) = exp(−w>δφq(y) + ∆q(y)) X


exp(−w>δφq(y0) + ∆q(y0)) (4)


Toward convexity (3)

I . . . and corresponding ConvexLoss:

L(w) =X




exp(−w>δφq(y0) + ∆q(y0)) (5)

I Can find w to minimize ConvexLoss

I Problem: P

y · · · is back

I Anyway we want to go beyond AUC and φpo


Markov Chain Monte Carlo sampling

1: while not enough samples do

2: (re)start a random walk at a well-chosen state y0

3: for some number of steps t = 1,2, . . . do

4: transition from yt−1 to yt

5: make an accept/reject decision on yt

6: collect some subset of y0,y1,y2, . . . as samples

I Choose restart states to span a variety of ∆s

I In each walk, make local changes in y so as to stay near to the restart ∆

I New swap method to make transitions (see paper)


Experiments: Accuracy

I All data sets from LETOR

I Baselines RankSVM, SVMmap, RankBoost

I Example accuracy comparison on HP2004:


SVMmap 0.665 0.835 0.845 0.746

Rankboost 0.653 0.821 0.845 0.739

RankSVM 0.695 0.852 0.877 0.764

MLE 0.665 0.854 0.872 0.755

ExpGain_AUC 0.695 0.862 0.885 0.774

ExpGain_MAP 0.680 0.875 0.895 0.775

ExpGain_NDCG 0.680 0.872 0.890 0.772

L_3 AUC 0.695 0.880 0.898 0.771

L_3 MAP 0.709 0.883 0.900 0.786

L_3 NDCG 0.681 0.885 0.900 0.773

ConvexLoss_AUC 0.682 0.890 0.905 0.778 ConvexLoss_MAP 0.724 0.874 0.885 0.791 ConvexLoss_NDCG 0.709 0.892 0.904 0.793


Accuracy vs. MCMC samples

0.580 0.680 0.780 0.880

1000 3000 5000


Number of Samples


I More samples =⇒ better accuracy

I Saturates long before Y is approached


Effect of restart skew

0.600 0.700 0.800 0.900

Skewed towards worst possible y

Uniform Skewed towards optimal ordereing y_q




I Two restart points, ∆ = 0,∆max ≈ 1

I Skewing toward best y is better


Effect of restart skew (2)

I Potentially surprising, given learning needs both good and bad examples to learn

I Sample y w.p. ∝ exp(k∆)

I k 0 means favor ∆ →0, better

k → 5 0 −5 −10 −20

MAP .033 .089 .706 .774 .827 NDCG@10 .041 .124 .815 .905 .918

I Why?


Effect of restart skew (3)

0 20 40 60 80 100

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Sample Delta


Delta=0 Delta=0.5 Delta=1

I Three restart seeds


Effect of restart skew (4)

I Perturbing y with ∆ = 0 can push it far, say,

∆≈ 0.7 quite easily

I If y has ∆≈ 1, most perturbations will keep ∆ near 1

I There algo will not see enough good ys

I Which is why skewing restarts toward ∆ ≈ 0 is good



I Conditional probabilistic model

Pr(y|x;w) ∝ exp(w>φ(x,y)) for ranking

I Challenges: P

y∈Y · · ·, nonconvexity

I Efficient closed form for specific but widely-used φ and ∆

I Convex upper bound to expected loss in all cases

I Monte-Carlo sampling method to avoid P

y∈Y· · ·

I Competitive accuracy in experiments

I In particular, generally better than

I Hinge loss with max-margin learning techniques

I Boosting with weak learners

I Training by sampling the space of ys this way may have broader application




Related subjects :