Conditional Models for Non-smooth Ranking Loss Functions
Avinava Dubey1 Jinesh Machchhar2 Chiranjib Bhattacharyya3 Soumen Chakrabarti4
1IBM Research India
2IIT Bombay
3IISc Bangalore
4IIT Bombay
Ranking
I Given a set of documents and a query, order the documents in such a way that those documents which are relevant to the query is placed above those that are irrelevant
I Not the same as binary classification
I User attention drops off with rank
I Minimize cognitive burden till information need satisfaction
I Has led to many models of ranking loss functions
Overview of learning to rank
I Training data
I Query set Q
I Document set Dq associated with each query q∈Q
I Each documenti ∈Dq is assigned relevance zqi
I For binary relevance Dq is partitioned into relevant/good documentsDq+ ⊂Dq and irrelevant/bad documents Dq−=Dq\Dq+
I Feature vectorxqi ∈Rd is constructed from q and doc i
I Training fits a model w ∈ Rd
I At test time assign score w>xti to each feature vector xti ∈ Dt for query t
I Sort by decreasing score, take top 10, say
Structured learning interpretation
I Instead of scoring individual documents . . .
I . . . think of assigning Dq a permutation or partial order as a classification label
I From a huge label set Y
I In case of permutations, Y = (Dq →1:1 Dq)
I In case of good/bad partial orders,
y ∈ Yq = {−1,+1}n+qn−q, where n+q = |Dq+| and nq− = |Dq−|
I ygb = +1 (−1) if document g is ranked above (below) document b
I What’s the benefit of thinking this way?
List-oriented loss functions
I yq is the best y ∈ Y for query q
I All good docs ranked above all bad docs
I Loss function ∆(y,yq) ∈ R≥0 tells us how bad y is wrt ideal yq
I ∆ can encode what item- or pair-decomposable losses cannot, e.g.:
I Number of misclassified documents (0/1 additive error)
I Number of pair inversions (not rank-sensitive)
I Can be exploited to target w to
I Mean average precision (MAP)
I Normalized discounted cumulative gain (NDCG)
I Mean reciprocal rank (MRR)
I . . . and traditional decomposable losses like area under curve (AUC)
Feature map φ(x , y )
I y-cognizant aggregation of document features
I E.g., add vectors from rank 1 through 10 and subtract rest
I More commonly used:
φpo(xq,y) = 1 nq+n−q
X
g,b
ygb(xg −xb)
I At test time, goal is to find arg maxy w>φ(xq,y)
I Can be intractable, but sometimes not
Training
I Look for w that minimizes P
q∆ yq,arg maxyw>φ(xq,y)
I Objective not continuous, differentiable, or convex
I Therefore, use convex upper bound hinge loss minw
X
q
max
0,max
y ∆q(y)−w>δφq(y)
I The bound can be loose
I Only support vectors decide w
I May not always yield a good model
Our contributions
I Parametric conditional probabilistic model Pr(y|x;w) ∝ exp(w>φ(x,y))
I Intuitive minimization of expected ranking loss (unfortunately non-convex) as an alternative to hinge loss
I For specific φ and ∆, exact, efficient, closed-form expressions for above optimization
I Convex bound on expected loss objective
(unfortunately sums over exponentially many ys)
I Monte-Carlo recipe to sample few ys and still optimize well
I Favorable experimental outcome on LETOR data
Minimizing aggregated expected loss
I Define
Pr(y|xq;w) = exp(w>φ(xq,y)) Zq
where Zq = P
y0 exp(w>φ(x,y0))
I Minimize aggregated expected loss X
q
X
y
Pr(y|xq;w)∆(yq,y) or log of monotone function of expected loss
X
q
log X
y
Pr(y|xq;w)f(∆(yq,y))
!
From loss to gain
I Objective has P
y· · ·
I Most ys are terrible rankings with ∆ →1
I Expected loss may drop very slightly on optimization
I To better condition the problem, use gain G rather than loss ∆, e.g., NDCG instead of ∆NDCG ExpGain: max
w
X
q
log X
y
Pr(y|xq;w)Gq(y)
! .
I Neither loss nor gain optimization is convex
I Beware the sum over y
Polynomial form for AUC and φ
poI Notation:
Sqi = w>xqi hqgb(ygb) = exp
ygb(Sqg −Sqb) n+qnq−
I With gain G being AUC, objective can be written as
X
q
log
X
y
#1y
Y
gb
hgbq (ygb)
−logZq
where #1y = P
g,b~ygb = 1
Polynomial form for AUC and φ
po(2)
I Because of P
y · · · we would still take P
q(nq+n−q)! or P
q2n+qn−q time, impractical
I Handy identities (see paper) to replace the P
y · · · with an expression that can be computed in P
q(nq+n−q)2 time
I Likewise with gradient expression
Toward convexity
I Define δφq(y) = φ(xq,yq)−φ(xq,y) and consider the distribution
Pr(y|xq,yq;w) = exp(−w>δφq(y)) P
y0exp(−w>δφq(y0)) (1)
I Maximum Likelihood Estimate MLE L1(w) =X
q
L1q(w) =X
q
logZq(w) (2)
Toward convexity (2)
I Expected loss using new distribution L2(w) = X
q
L2q(w) = X
q
EY∼(1)(∆q(Y)) (3)
I Finally consider another distribution
Pr(y|xq,yq) = exp(−w>δφq(y) + ∆q(y)) X
y0
exp(−w>δφq(y0) + ∆q(y0)) (4)
Toward convexity (3)
I . . . and corresponding ConvexLoss:
L(w) =X
q
logX
y0
exp(−w>δφq(y0) + ∆q(y0)) (5)
I Can find w to minimize ConvexLoss
I Problem: P
y · · · is back
I Anyway we want to go beyond AUC and φpo
Markov Chain Monte Carlo sampling
1: while not enough samples do
2: (re)start a random walk at a well-chosen state y0
3: for some number of steps t = 1,2, . . . do
4: transition from yt−1 to yt
5: make an accept/reject decision on yt
6: collect some subset of y0,y1,y2, . . . as samples
I Choose restart states to span a variety of ∆s
I In each walk, make local changes in y so as to stay near to the restart ∆
I New swap method to make transitions (see paper)
Experiments: Accuracy
I All data sets from LETOR
I Baselines RankSVM, SVMmap, RankBoost
I Example accuracy comparison on HP2004:
NDCG@1 NDCG@5 NDCG@10 MAP
SVMmap 0.665 0.835 0.845 0.746
Rankboost 0.653 0.821 0.845 0.739
RankSVM 0.695 0.852 0.877 0.764
MLE 0.665 0.854 0.872 0.755
ExpGain_AUC 0.695 0.862 0.885 0.774
ExpGain_MAP 0.680 0.875 0.895 0.775
ExpGain_NDCG 0.680 0.872 0.890 0.772
L_3 AUC 0.695 0.880 0.898 0.771
L_3 MAP 0.709 0.883 0.900 0.786
L_3 NDCG 0.681 0.885 0.900 0.773
ConvexLoss_AUC 0.682 0.890 0.905 0.778 ConvexLoss_MAP 0.724 0.874 0.885 0.791 ConvexLoss_NDCG 0.709 0.892 0.904 0.793
Accuracy vs. MCMC samples
0.580 0.680 0.780 0.880
1000 3000 5000
Accuracy
Number of Samples
NDCG@1 NDCG@5 NDCG@10 MAP
I More samples =⇒ better accuracy
I Saturates long before Y is approached
Effect of restart skew
0.600 0.700 0.800 0.900
Skewed towards worst possible y
Uniform Skewed towards optimal ordereing y_q
Accuracy
Skewness
NDCG@1 NDCG@5 NDCG@10 MAP
I Two restart points, ∆ = 0,∆max ≈ 1
I Skewing toward best y is better
Effect of restart skew (2)
I Potentially surprising, given learning needs both good and bad examples to learn
I Sample y w.p. ∝ exp(k∆)
I k 0 means favor ∆ →0, better
k → 5 0 −5 −10 −20
MAP .033 .089 .706 .774 .827 NDCG@10 .041 .124 .815 .905 .918
I Why?
Effect of restart skew (3)
0 20 40 60 80 100
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Sample Delta
Percent
Delta=0 Delta=0.5 Delta=1
I Three restart seeds
Effect of restart skew (4)
I Perturbing y with ∆ = 0 can push it far, say,
∆≈ 0.7 quite easily
I If y has ∆≈ 1, most perturbations will keep ∆ near 1
I There algo will not see enough good ys
I Which is why skewing restarts toward ∆ ≈ 0 is good
Conclusion
I Conditional probabilistic model
Pr(y|x;w) ∝ exp(w>φ(x,y)) for ranking
I Challenges: P
y∈Y · · ·, nonconvexity
I Efficient closed form for specific but widely-used φ and ∆
I Convex upper bound to expected loss in all cases
I Monte-Carlo sampling method to avoid P
y∈Y· · ·
I Competitive accuracy in experiments
I In particular, generally better than
I Hinge loss with max-margin learning techniques
I Boosting with weak learners
I Training by sampling the space of ys this way may have broader application