## Conditional Models for Non-smooth Ranking Loss Functions

Avinava Dubey^{1} Jinesh Machchhar^{2}
Chiranjib Bhattacharyya^{3} Soumen Chakrabarti^{4}

1IBM Research India

2IIT Bombay

3IISc Bangalore

4IIT Bombay

## Ranking

I Given a set of documents and a query, order the documents in such a way that those documents which are relevant to the query is placed above those that are irrelevant

I Not the same as binary classification

I User attention drops off with rank

I Minimize cognitive burden till information need satisfaction

I Has led to many models of ranking loss functions

## Overview of learning to rank

I Training data

I Query set Q

I Document set D_{q} associated with each query q∈Q

I Each documenti ∈D_{q} is assigned relevance z_{qi}

I For binary relevance D_{q} is partitioned into relevant/good
documentsD_{q}^{+} ⊂D_{q} and irrelevant/bad documents
D_{q}^{−}=D_{q}\D_{q}^{+}

I Feature vectorx_{qi} ∈R^{d} is constructed from q and doc i

I Training fits a model w ∈ R^{d}

I At test time assign score w^{>}x_{ti} to each feature
vector x_{ti} ∈ D_{t} for query t

I Sort by decreasing score, take top 10, say

## Structured learning interpretation

I Instead of scoring individual documents . . .

I . . . think of assigning D_{q} a permutation or partial
order as a classification label

I From a huge label set Y

I In case of permutations, Y = (D_{q} →^{1:1} D_{q})

I In case of good/bad partial orders,

y ∈ Y_{q} = {−1,+1}^{n}^{+}^{q}^{n}^{−}^{q}, where n^{+}_{q} = |D_{q}^{+}| and
n_{q}^{−} = |D_{q}^{−}|

I ygb = +1 (−1) if document g is ranked above (below) document b

I What’s the benefit of thinking this way?

## List-oriented loss functions

I yq is the best y ∈ Y for query q

I All good docs ranked above all bad docs

I Loss function ∆(y,y_{q}) ∈ R^{≥0} tells us how bad y
is wrt ideal y_{q}

I ∆ can encode what item- or pair-decomposable losses cannot, e.g.:

I Number of misclassified documents (0/1 additive error)

I Number of pair inversions (not rank-sensitive)

I Can be exploited to target w to

I Mean average precision (MAP)

I Normalized discounted cumulative gain (NDCG)

I Mean reciprocal rank (MRR)

I . . . and traditional decomposable losses like area under curve (AUC)

## Feature map φ(x , y )

I y-cognizant aggregation of document features

I E.g., add vectors from rank 1 through 10 and subtract rest

I More commonly used:

φ_{po}(x_{q},y) = 1
n_{q}^{+}n^{−}_{q}

X

g,b

y_{gb}(x_{g} −x_{b})

I At test time, goal is to find arg max_{y} w^{>}φ(x_{q},y)

I Can be intractable, but sometimes not

## Training

I Look for w that minimizes P

q∆ y_{q},arg max_{y}w^{>}φ(x_{q},y)

I Objective not continuous, differentiable, or convex

I Therefore, use convex upper bound hinge loss minw

X

q

max

0,max

y ∆q(y)−w^{>}δφq(y)

I The bound can be loose

I Only support vectors decide w

I May not always yield a good model

## Our contributions

I Parametric conditional probabilistic model
Pr(y|x;w) ∝ exp(w^{>}φ(x,y))

I Intuitive minimization of expected ranking loss (unfortunately non-convex) as an alternative to hinge loss

I For specific φ and ∆, exact, efficient, closed-form expressions for above optimization

I Convex bound on expected loss objective

(unfortunately sums over exponentially many ys)

I Monte-Carlo recipe to sample few ys and still optimize well

I Favorable experimental outcome on LETOR data

## Minimizing aggregated expected loss

I Define

Pr(y|x_{q};w) = exp(w^{>}φ(xq,y))
Z_{q}

where Z_{q} = P

y^{0} exp(w^{>}φ(x,y^{0}))

I Minimize aggregated expected loss X

q

X

y

Pr(y|x_{q};w)∆(y_{q},y)
or log of monotone function of expected loss

X

q

log X

y

Pr(y|x_{q};w)f(∆(y_{q},y))

!

## From loss to gain

I Objective has P

y· · ·

I Most ys are terrible rankings with ∆ →1

I Expected loss may drop very slightly on optimization

I To better condition the problem, use gain G
rather than loss ∆, e.g., NDCG instead of ∆_{NDCG}
ExpGain: max

w

X

q

log X

y

Pr(y|x_{q};w)G_{q}(y)

! .

I Neither loss nor gain optimization is convex

I Beware the sum over y

## Polynomial form for AUC and φ

_{po}

I Notation:

S_{qi} = w^{>}x_{qi}
h^{q}_{gb}(y_{gb}) = exp

y_{gb}(S_{qg} −S_{qb})
n^{+}_{q}n_{q}^{−}

I With gain G being AUC, objective can be written as

X

q

log

X

y

#1_{y}

Y

gb

h_{gb}^{q} (y_{gb})

−logZ_{q}

where #1_{y} = P

g,b~y_{gb} = 1

## Polynomial form for AUC and φ

_{po}

## (2)

I Because of P

y · · · we would still take P

q(n_{q}^{+}n^{−}_{q})! or P

q2^{n}^{+}^{q}^{n}^{−}^{q} time, impractical

I Handy identities (see paper) to replace the P

y · · · with an expression that can be computed in P

q(n_{q}^{+}n^{−}_{q})^{2} time

I Likewise with gradient expression

## Toward convexity

I Define δφq(y) = φ(xq,yq)−φ(xq,y) and consider the distribution

Pr(y|x_{q},y_{q};w) = exp(−w^{>}δφq(y))
P

y^{0}exp(−w^{>}δφ_{q}(y^{0})) (1)

I Maximum Likelihood Estimate MLE L1(w) =X

q

L1q(w) =X

q

logZq(w) (2)

## Toward convexity (2)

I Expected loss using new distribution
L_{2}(w) = X

q

L_{2q}(w) = X

q

EY∼(1)(∆_{q}(Y)) (3)

I Finally consider another distribution

Pr(y|x_{q},y_{q}) = exp(−w^{>}δφ_{q}(y) + ∆_{q}(y))
X

y^{0}

exp(−w^{>}δφ_{q}(y^{0}) + ∆_{q}(y^{0}))
(4)

## Toward convexity (3)

I . . . and corresponding ConvexLoss:

L(w) =X

q

logX

y^{0}

exp(−w^{>}δφ_{q}(y^{0}) + ∆_{q}(y^{0}))
(5)

I Can find w to minimize ConvexLoss

I Problem: P

y · · · is back

I Anyway we want to go beyond AUC and φ_{po}

## Markov Chain Monte Carlo sampling

1: while not enough samples do

2: (re)start a random walk at a well-chosen
state y^{0}

3: for some number of steps t = 1,2, . . . do

4: transition from y^{t−1} to y^{t}

5: make an accept/reject decision on y^{t}

6: collect some subset of y^{0},y^{1},y^{2}, . . . as
samples

I Choose restart states to span a variety of ∆s

I In each walk, make local changes in y so as to stay near to the restart ∆

I New swap method to make transitions (see paper)

## Experiments: Accuracy

I All data sets from LETOR

I Baselines RankSVM, SVMmap, RankBoost

I Example accuracy comparison on HP2004:

NDCG@1 NDCG@5 NDCG@10 MAP

SVMmap 0.665 0.835 0.845 0.746

Rankboost 0.653 0.821 0.845 0.739

RankSVM 0.695 0.852 0.877 0.764

MLE 0.665 0.854 0.872 0.755

ExpGain_AUC 0.695 0.862 0.885 0.774

ExpGain_MAP 0.680 0.875 0.895 0.775

ExpGain_NDCG 0.680 0.872 0.890 0.772

L_3 AUC 0.695 0.880 0.898 0.771

L_3 MAP 0.709 0.883 0.900 0.786

L_3 NDCG 0.681 0.885 0.900 0.773

ConvexLoss_AUC 0.682 0.890 0.905 0.778 ConvexLoss_MAP 0.724 0.874 0.885 0.791 ConvexLoss_NDCG 0.709 0.892 0.904 0.793

## Accuracy vs. MCMC samples

0.580 0.680 0.780 0.880

1000 3000 5000

Accuracy

Number of Samples

NDCG@1 NDCG@5 NDCG@10 MAP

I More samples =⇒ better accuracy

I Saturates long before Y is approached

## Effect of restart skew

0.600 0.700 0.800 0.900

Skewed towards worst possible y

Uniform Skewed towards optimal ordereing y_q

Accuracy

Skewness

NDCG@1 NDCG@5 NDCG@10 MAP

I Two restart points, ∆ = 0,∆_{max} ≈ 1

I Skewing toward best y is better

## Effect of restart skew (2)

I Potentially surprising, given learning needs both good and bad examples to learn

I Sample y w.p. ∝ exp(k∆)

I k 0 means favor ∆ →0, better

k → 5 0 −5 −10 −20

MAP .033 .089 .706 .774 .827 NDCG@10 .041 .124 .815 .905 .918

I Why?

## Effect of restart skew (3)

0 20 40 60 80 100

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Sample Delta

Percent

Delta=0 Delta=0.5 Delta=1

I Three restart seeds

## Effect of restart skew (4)

I Perturbing y with ∆ = 0 can push it far, say,

∆≈ 0.7 quite easily

I If y has ∆≈ 1, most perturbations will keep ∆ near 1

I There algo will not see enough good ys

I Which is why skewing restarts toward ∆ ≈ 0 is good

## Conclusion

I Conditional probabilistic model

Pr(y|x;w) ∝ exp(w^{>}φ(x,y)) for ranking

I Challenges: P

y∈Y · · ·, nonconvexity

I Efficient closed form for specific but widely-used φ and ∆

I Convex upper bound to expected loss in all cases

I Monte-Carlo sampling method to avoid P

y∈Y· · ·

I Competitive accuracy in experiments

I In particular, generally better than

I Hinge loss with max-margin learning techniques

I Boosting with weak learners

I Training by sampling the space of ys this way may have broader application