Giter VIP home page Giter VIP logo

pigrank's Introduction

PigRank

Build Status

Purpose

Apache Pig UDFs for computing ranking measures.

Installation

To build the JAR from the source release, run:

./gradlew assemble

Overview

This project provides three user-defined functions for the Apache Pig language useful for Learning-to-Rank applications: DCG and MRR as evaluation measures, and Similarity to compare two distinct rankings.

DCG and MRR expect as input unordered bags of tuples; each tuple should have one column containing the rank score, and one column containing the target. The UDF sorts the bag in descending order of the former, and uses the latter one to compute the ranking quality. For DCG, any positive numbers are valid, while for MRR, any nonzero value will be regarded as a positive target.

Similarity expects two bags of tuples with corresponding rank score columns. Two additional columns are used as unique identifiers to decide whether an item in the first list is identical to one in the second list.

Note that ties in the rank score can give rise to multiple different rankings and hence rank measures. DCG and MRR take this into account by computing the expectation over all possible over all possible permutations of the tied items.

DCG

Compute (normalized) discounted cumulative gain or rank-weighted average, with weights logarithmically decreasing with rank.

DCG(normalization, cutoff, scoreCol, targetCol)

with

  • normalization: one of the strings

    • "unnormalized": Absolute DCG.
    • "normalized": Divide DCG by maximum achievable (i.e., compute nDCG).
    • "weighted_average": Divide DCG by total sum of logarithmic discount factors.
  • cutoff: Maximum rank to consider in measure, as a string. Values of zero or less are interpreted as 'no cutoff'.

  • scoreCol: Zero-based column index of the ranking score, as a string.

  • targetCol: Zero-based column index of the target score, as a string.

Example

 define NDCG pigrank.DCG('normalized', '-1', '1', '2');
 -- nDCG without rank cutoff
 -- assuming the second column contains ranking scores, the third one the target

data = load 'input' using PigStorage('\t') as ( query:chararray, score:double, target:double );

data_gr = group data by query;

eval = foreach data_gr generate flatten(group) as query, NDCG(data)

store eval into 'output';

MRR

Compute Mean Reciprocal Rank (MRR) from an unordered bag; i.e., the inverse of the rank of the first non-zero target.

MRR(scoreCol, targetCol)

Example

 define MRR pigrank.MRR('1', '2');
 -- the second column contains ranking scores, the third one the target

data = load 'input' using PigStorage('\t') as ( query:chararray, score:double, target:double );

data_gr = group data by query;

eval = foreach data_gr generate flatten(group) as query, MRR(data) as mrr ;

store eval into 'output';

Similarity

Called with two unordered bags, computes the similarity of two rankings according to one of the following measures:

  • Jaccard coefficient: Size of intersection over size of union.
  • Cosine similarity: Cosine between two vectors with dimensions corresponding to items, and inverse ranks as weights.
  • Rank-biased Overlap (RBO): Set overlap at common prefix length, averaged over exponential user top-down exploration. RBO has an intuitive probabilistic interpretation, and accounts for uneven list size across rankings and queries.

Similarity(simType, param, idCol1, scoreCol1, idCol2, scoreCol2)

with

  • simType: Type of similarity function, one of the strings "jaccard", "cosine", or "rbo".
  • param: Parameter for similarity function.
    • For jaccard or cosine, the maximum rank to include (cutoff). Values less than one are interpreted as "no cutoff".
    • For rbo, the "persistence" probability that the user will look at the next rank. Typical values: 0.9 (resp. 0.98) means that the first 10 (resp. 50) ranks have 86% of the weight if the evaluation.
  • idCol1: Unique identifier for items in the first bag, used to test for equality with items in the second bag (zero-based column index).
  • scoreCol1: Zero-based column index of ranking score for the first bag.
  • idCol2: Identifier for items in the second bag (column index).
  • scoreCol2:" Ranking score for the second bag (column index).

Example

 define JACCARD   pigrank.Similarity('jaccard', '-1', '2', '3', '2', '3');
 define JACCARD_2 com.a9.pigudf.eval.ranking.Similarity('jaccard', '2', '2', '3', '2', '3');
 define COSINE    com.a9.pigudf.eval.ranking.Similarity('cosine', '-1', '2', '3', '2', '3');
 define RBO       com.a9.pigudf.eval.ranking.Similarity('rbo', '0.9', '2', '3', '2', '3');

data = load 'input' using PigStorage('\t') as ( query:chararray, treatment:chararray, asin:chararray, score:double );

data_gr = group data by (query, treatment);

data_gr = foreach data_gr generate flatten(group) as (query, treatment), data ;

split data_gr into data1 if treatment=='t1', data2 otherwise;

side_by_side = cogroup data1 by query, data2 by query;

eval = foreach side_by_side generate flatten(group) as query, flatten(data1.data) as group_t1, flatten(data2.data) as group_t2 ;

describe eval;

eval = foreach eval generate query, JACCARD(group_t1, group_t2), JACCARD_2(group_t1, group_t2), COSINE(group_t1, group_t2), RBO(group_t1, group_t2) ;

store eval into 'output';

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.