• No results found

Overview of learning to rank


Academic year: 2022

Share "Overview of learning to rank"


Loading.... (view fulltext now)

Full text


Conditional Models for Non-smooth Ranking Loss Functions

Avinava Dubey1 Jinesh Machchhar2 Chiranjib Bhattacharyya3 Soumen Chakrabarti4

1IBM Research India

2IIT Bombay

3IISc Bangalore

4IIT Bombay



I Given a set of documents and a query, order the documents in such a way that those documents which are relevant to the query is placed above those that are irrelevant

I Not the same as binary classification

I User attention drops off with rank

I Minimize cognitive burden till information need satisfaction

I Has led to many models of ranking loss functions


Overview of learning to rank

I Training data

I Query set Q

I Document set Dq associated with each query qQ

I Each documenti Dq is assigned relevance zqi

I For binary relevance Dq is partitioned into relevant/good documentsDq+ Dq and irrelevant/bad documents Dq=Dq\Dq+

I Feature vectorxqi Rd is constructed from q and doc i

I Training fits a model w ∈ Rd

I At test time assign score w>xti to each feature vector xti ∈ Dt for query t

I Sort by decreasing score, take top 10, say


Structured learning interpretation

I Instead of scoring individual documents . . .

I . . . think of assigning Dq a permutation or partial order as a classification label

I From a huge label set Y

I In case of permutations, Y = (Dq1:1 Dq)

I In case of good/bad partial orders,

y ∈ Yq = {−1,+1}n+qnq, where n+q = |Dq+| and nq = |Dq|

I ygb = +1 (−1) if document g is ranked above (below) document b

I What’s the benefit of thinking this way?


List-oriented loss functions

I yq is the best y ∈ Y for query q

I All good docs ranked above all bad docs

I Loss function ∆(y,yq) ∈ R≥0 tells us how bad y is wrt ideal yq

I ∆ can encode what item- or pair-decomposable losses cannot, e.g.:

I Number of misclassified documents (0/1 additive error)

I Number of pair inversions (not rank-sensitive)

I Can be exploited to target w to

I Mean average precision (MAP)

I Normalized discounted cumulative gain (NDCG)

I Mean reciprocal rank (MRR)

I . . . and traditional decomposable losses like area under curve (AUC)


Feature map φ(x , y )

I y-cognizant aggregation of document features

I E.g., add vectors from rank 1 through 10 and subtract rest

I More commonly used:

φpo(xq,y) = 1 nq+nq



ygb(xg −xb)

I At test time, goal is to find arg maxy w>φ(xq,y)

I Can be intractable, but sometimes not



I Look for w that minimizes P

q∆ yq,arg maxyw>φ(xq,y)

I Objective not continuous, differentiable, or convex

I Therefore, use convex upper bound hinge loss minw






I The bound can be loose

I Only support vectors decide w

I May not always yield a good model


Our contributions

I Parametric conditional probabilistic model Pr(y|x;w) ∝ exp(w>φ(x,y))

I Intuitive minimization of expected ranking loss (unfortunately non-convex) as an alternative to hinge loss

I For specific φ and ∆, exact, efficient, closed-form expressions for above optimization

I Convex bound on expected loss objective

(unfortunately sums over exponentially many ys)

I Monte-Carlo recipe to sample few ys and still optimize well

I Favorable experimental outcome on LETOR data


Minimizing aggregated expected loss

I Define

Pr(y|xq;w) = exp(w>φ(xq,y)) Zq

where Zq = P

y0 exp(w>φ(x,y0))

I Minimize aggregated expected loss X




Pr(y|xq;w)∆(yq,y) or log of monotone function of expected loss



log X





From loss to gain

I Objective has P

y· · ·

I Most ys are terrible rankings with ∆ →1

I Expected loss may drop very slightly on optimization

I To better condition the problem, use gain G rather than loss ∆, e.g., NDCG instead of ∆NDCG ExpGain: max




log X



! .

I Neither loss nor gain optimization is convex

I Beware the sum over y


Polynomial form for AUC and φ


I Notation:

Sqi = w>xqi hqgb(ygb) = exp

ygb(Sqg −Sqb) n+qnq

I With gain G being AUC, objective can be written as



 log

 X



 Y


hgbq (ygb)


 where #1y = P

g,b~ygb = 1


Polynomial form for AUC and φ



I Because of P

y · · · we would still take P

q(nq+nq)! or P

q2n+qnq time, impractical

I Handy identities (see paper) to replace the P

y · · · with an expression that can be computed in P

q(nq+nq)2 time

I Likewise with gradient expression


Toward convexity

I Define δφq(y) = φ(xq,yq)−φ(xq,y) and consider the distribution

Pr(y|xq,yq;w) = exp(−w>δφq(y)) P

y0exp(−w>δφq(y0)) (1)

I Maximum Likelihood Estimate MLE L1(w) =X


L1q(w) =X


logZq(w) (2)


Toward convexity (2)

I Expected loss using new distribution L2(w) = X


L2q(w) = X


EY∼(1)(∆q(Y)) (3)

I Finally consider another distribution

Pr(y|xq,yq) = exp(−w>δφq(y) + ∆q(y)) X


exp(−w>δφq(y0) + ∆q(y0)) (4)


Toward convexity (3)

I . . . and corresponding ConvexLoss:

L(w) =X




exp(−w>δφq(y0) + ∆q(y0)) (5)

I Can find w to minimize ConvexLoss

I Problem: P

y · · · is back

I Anyway we want to go beyond AUC and φpo


Markov Chain Monte Carlo sampling

1: while not enough samples do

2: (re)start a random walk at a well-chosen state y0

3: for some number of steps t = 1,2, . . . do

4: transition from yt−1 to yt

5: make an accept/reject decision on yt

6: collect some subset of y0,y1,y2, . . . as samples

I Choose restart states to span a variety of ∆s

I In each walk, make local changes in y so as to stay near to the restart ∆

I New swap method to make transitions (see paper)


Experiments: Accuracy

I All data sets from LETOR

I Baselines RankSVM, SVMmap, RankBoost

I Example accuracy comparison on HP2004:


SVMmap 0.665 0.835 0.845 0.746

Rankboost 0.653 0.821 0.845 0.739

RankSVM 0.695 0.852 0.877 0.764

MLE 0.665 0.854 0.872 0.755

ExpGain_AUC 0.695 0.862 0.885 0.774

ExpGain_MAP 0.680 0.875 0.895 0.775

ExpGain_NDCG 0.680 0.872 0.890 0.772

L_3 AUC 0.695 0.880 0.898 0.771

L_3 MAP 0.709 0.883 0.900 0.786

L_3 NDCG 0.681 0.885 0.900 0.773

ConvexLoss_AUC 0.682 0.890 0.905 0.778 ConvexLoss_MAP 0.724 0.874 0.885 0.791 ConvexLoss_NDCG 0.709 0.892 0.904 0.793


Accuracy vs. MCMC samples

0.580 0.680 0.780 0.880

1000 3000 5000


Number of Samples


I More samples =⇒ better accuracy

I Saturates long before Y is approached


Effect of restart skew

0.600 0.700 0.800 0.900

Skewed towards worst possible y

Uniform Skewed towards optimal ordereing y_q




I Two restart points, ∆ = 0,∆max ≈ 1

I Skewing toward best y is better


Effect of restart skew (2)

I Potentially surprising, given learning needs both good and bad examples to learn

I Sample y w.p. ∝ exp(k∆)

I k 0 means favor ∆ →0, better

k → 5 0 −5 −10 −20

MAP .033 .089 .706 .774 .827 NDCG@10 .041 .124 .815 .905 .918

I Why?


Effect of restart skew (3)

0 20 40 60 80 100

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Sample Delta


Delta=0 Delta=0.5 Delta=1

I Three restart seeds


Effect of restart skew (4)

I Perturbing y with ∆ = 0 can push it far, say,

∆≈ 0.7 quite easily

I If y has ∆≈ 1, most perturbations will keep ∆ near 1

I There algo will not see enough good ys

I Which is why skewing restarts toward ∆ ≈ 0 is good



I Conditional probabilistic model

Pr(y|x;w) ∝ exp(w>φ(x,y)) for ranking

I Challenges: P

y∈Y · · ·, nonconvexity

I Efficient closed form for specific but widely-used φ and ∆

I Convex upper bound to expected loss in all cases

I Monte-Carlo sampling method to avoid P

y∈Y· · ·

I Competitive accuracy in experiments

I In particular, generally better than

I Hinge loss with max-margin learning techniques

I Boosting with weak learners

I Training by sampling the space of ys this way may have broader application


Related documents

All relevant required documents along with evidences are to be enclosed and properly tagged along with technical bid in Envelope I , and only quoted rates (as per Format

These gains in crop production are unprecedented which is why 5 million small farmers in India in 2008 elected to plant 7.6 million hectares of Bt cotton which

INDEPENDENT MONITORING BOARD | RECOMMENDED ACTION.. Rationale: Repeatedly, in field surveys, from front-line polio workers, and in meeting after meeting, it has become clear that

 The World Wide Web (WWW) is an internet based service, which uses common set of rules known as Protocols, to distribute documents across the Internet in a standard way..  The

In addition a crux objective of the collaborative tagging systems is to discover rich and riveting resources and documents such that the virtuosity dimension of the users are fully

3.6., which is a Smith Predictor based NCS (SPNCS). The plant model is considered in the minor feedback loop with a virtual time delay to compensate for networked induced

Area for uploading Techno- Commercial Unpriced Bid*.. Also, they must ensure that above documents which are to be submitted in a sealed envelope are also submitted at the

3.4.1 Encoding Domain Knowledge using Fuzzy Ontology Structures 84 3.4.2 Creation of Fuzzy Ontology Structure 92 3.5 Handling Inconsistent Ontology Concept Descriptions 98