Learning to Rank using Gradient Descent

Posted by c cm on June 26, 2016

关键字:Bing,L2R,RankNet,gradient descent

Probabilistic Ranking Cost Function

问题表述为:
给定在所有文档集$R^d$一对文档$d_a, d_b$,需要预测$\hat{P}_{ab} $,文档a排序在b之前的概率。

模型$\ f: R^d \to R$,$f(x_1) > f(x_2)$代表前者排序高于后者,即$x_1 \triangleright x_2$。

记,$P(x_i \triangleright x_j) \equiv P_{ij},\ o_i \equiv f(x_i),\ o_{ij} \equiv f(x_i) - f(x_j)$

类似于逻辑回归,记
则cross entrophy cost function为

渐进于线性,且在P=1/2时对称。

Adjacency Posteriors

p.3: Theorem:

Given a sample set $x_i, i = 1, . . . , m$ and any permutation $Q$ of the consecutive integers ${1, 2, . . . , m}$, suppose that an arbitrary target posterior $0 ≤ \hat P_{kj} ≤ 1$ is specified for every adjacent pair $k = Q(i),j = Q(i+1), i = 1,…,m−1$.

Denote the set of such $\hat P$’s, for a given choice of $Q$, a set of ’adjacency posteriors’.

Then specifying any set of adjacency posteriors is necessary and sufficient to uniquely identify a target posterior $0 ≤ \hat P_{ij} ≤ 1$ for every pair of samples $x_i, x_j$.

RankNet

对于一般的神经网络,有:

对第i条训练数据,记神经网络的输出为$o_i$,目标为$t_i$,假设output node总共有q个,cost function为$\sum_{i=1}^q f(o_i, t_i)$。

记$\alpha_k$为模型的参数,那么gradient decent的一步为($\eta_k$为learning rate):

第j层的transfer function为$g^j$,net可以表述为:

求偏导结果:

结合之前所述cost function $f(o_2 - o_1)$和上面神经网络的结果,我们得到RankNet。

优点:

  1. 可以训练不consistant的数据, 如rank a > rank b, rank b > rank c, rank c > rank a
  2. cost在$o_{ij} = 0$时取到最小值,且在离0点较远的地方渐进于线性,更加robust
  3. certain -> more certain

Furthur reading

LambdaRank

视频