mpkato / interleaving Goto Github PK

A python library for conducting interleaving, which compares two or multiple rankers based on observed user clicks by interleaving their results.

License: MIT License

Python 100.00%

interleaving's Introduction

Interleaving

A python library for conducting interleaving, which comparing two or multiple rankers based on observed user clicks by interleaving their results.

Introduction

A/B testing is a well-known technique for comparing two or more systems based on user behaviors in a production environment, and has been used for improving the quality of systems in many services. Interleaving, which can be an alternative to A/B testing for comparing rankings, has shown x100 efficiency compared to A/B testing^{1, 2}. Since the efficiency matters a lot in particular for many alternatives in comparison, interleaving is a promising technique for user-based ranking evaluation. This library aims to provide most of the algorithms that have been proposed in the literature.

Interleaving algorithms

Interleaving for two rankers

Balanced interleaving³
Team draft interleaving⁴
Probabilistic interleaving⁵
Optimized interleaving⁶

Interleaving for multiple rankers

Team draft multileaving⁷
Probabilistic multileaving⁸
Optimized multileaving⁷
Roughly optimized multileaving⁹
Pairwise preference multileaving¹⁰

Note that probabilistic interleaving and probabilistic multileaving use different strategies to select a ranker from which a document is selected. In the original papers, probabilistic interleaving samples a ranker with replacement, i.e. one of the two rankers is sampled at every document selection. Probabilistic multileaving samples a ranker without replacement. Let D be a set of all the rankers. A ranker is sampled from D without replacement. When D is empty, all the rankers are put into D again. Probabilistic has an keyword argument replace by which either of these strategies can be used.

Prerequisites

Numpy
Scipy
Pulp

Installation

interleaving and its prerequisites can be installed by

$ pip install git+https://github.com/mpkato/interleaving.git

An alternative can be

$ git clone git+https://github.com/mpkato/interleaving.git
$ cd interleaving
$ python setup.py install

Usage

>>> import interleaving
>>>
>>> a = [1, 2, 3, 4, 5] # Ranking 1
>>> b = [4, 3, 5, 1, 2] # Ranking 2
>>> method = interleaving.TeamDraft([a, b]) # initialize an interleaving method
>>>
>>> ranking = method.interleave() # interleaving
>>> ranking
[1, 4, 2, 3, 5]
>>>
>>> clicks = [0, 2] # observed clicks, i.e. documents 1 and 2 are clicked
>>> result = interleaving.TeamDraft.evaluate(ranking, clicks)
>>> result # (0, 1) indicates Ranking 1 won Ranking 2.
[(0, 1)]
>>>
>>> clicks = [1, 3] # observed clicks, i.e. documents 4 and 3 are clicked
>>> result = interleaving.TeamDraft.evaluate(ranking, clicks)
>>> result # (1, 0) indicates Ranking 2 won Ranking 1.
[(1, 0)]
>>>
>>> clicks = [0, 1] # observed clicks, i.e. documents 1 and 4 are clicked
>>> result = interleaving.TeamDraft.evaluate(ranking, clicks)
>>> result # if (0, 1) or (1, 0) does not appear in the result,
>>>        # it indicates a tie between Rankings 1 and 2.
[]

Note

The ranking sampling algorithm of optimized multileaving⁷ and roughly optimized multileaving⁹ may take a long time or even runs into an inifinite loop. To work around this problem, this implementation supports secure_sampling flag to limit the number of sampling attempts to sample_num.

>>> import interleaving
>>> interleaving.Optimized([[1, 2], [2, 3]], sample_num=4, secure_sampling=True)

References

Chapelle et al. "Large-scale Validation and Analysis of Interleaved Search Evaluation." ACM TOIS 30.1 (2012): 6.
Schuth, Hofmann, Radlinski. "Predicting Search Satisfaction Metrics with Interleaved Comparisons." SIGIR 2015.
Joachims. "Evaluating retrieval performance using clickthrough data". Text Mining 2003.
Radlinski, Kurup, and Joachims. "How does clickthrough data reflect retrieval quality?" CIKM 2008.
Hofmann, Whiteson, and de Rijke. "A probabilistic method for inferring preferences from clicks." CIKM 2011.
Radlinski and Craswell. "Optimized Interleaving for Online Retrieval Evaluation." WSDM 2013.
Schuth et al. "Multileaved Comparisons for Fast Online Evaluation." CIKM 2014.
Schuth et al. "Probabilistic Multileave for Online Retrieval Evaluation." SIGIR 2015.
Manabe et al. "A Comparative Live Evaluation of Multileaving Methods on a Commercial cQA Search", SIGIR 2017.
Oosterhuis and de Rijke. "Sensitive and Scalable Online Evaluation with Theoretical Guarantees", CIKM 2017.

License

MIT License (see LICENSE file).

interleaving's People

Contributors

Stargazers

Watchers

Forkers

peterdm alxsoares levylll moonsn kaolabears ybm1 zhouqp631 luwangg anuragreddygv323 kiminh leihuaye arita37 penganzhen

interleaving's Issues

Proposal for revising MultileavingMethod

The current implementation generates an interleaved ranking when two rankings are given.
This may not be convenient when

The method requires much time and can save it by caching, and
The distribution of rankings is necessary (e.g. Store the results in a file and use them for some purposes).

A proposal here is to obtain two or more ranking through init, compute the distribution in the initialization, and return a ranking if the interleave method is called (without any argument).
Moreover, we can return a dict or a list of tuples that includes all the possible rankings with their probability.

If computing the probability or generating all the possible rankings is not feasible, we can just receive two or more rankings in the initialization, and generate a ranking when the interleave method is called.

Documentation

Needless to say, documentation is necessary

Optimized's infinite loop

The current Optimized._sample_ranking samples rankings until the number of sampled rankings becomes sample_num.
However, this sometimes never ends: the number of possible rankings can be less than sample_num.

So it must be "until rankings are sampled sample_num times", not "until the number of sampled rankings reaches sample_num".

Should the length of multileaving be input?

It may be convenient if we can input the (maximum) length of resulting inter/multileaving.

SystemError: Parent module '' not loaded

>python ./tests/test_balanced.py
Traceback (most recent call last):
  File "./tests/test_balanced.py", line 4, in <module>
    from .test_methods import TestMethods
SystemError: Parent module '' not loaded, cannot perform relative import

Handling overconstrained Optimized Multileaving

I found a way to handle too tight unbiasedness constraints. It just is to ignore them.
It is described on the last sentence of Section 4.2.3 of the original paper [Schuth+, CIKM 2014].

Dumping Ranking instances

I have been thinking about how to pass the interleaving results to a production environment.

Probably, an instance of InterleavingMethod can have function dump_rankings(self, file), which dumps sampled rankings into a file. Rankings, their probability, and some additional information (e.g. teams or credits) must be stored in the dump file. It is probably convenient to assign a unique ID for each ranking by using hash.

The format of the dump file should be something readable by the other programs, such as json or xml.

A production system can load the dump file, show the rankings to users, observe user clicks, and record the number of impressions and rank of clicks for each ranking.

InterleavingMethod.load_rankings(cls, file) can then load the rankings from the dump file and somehow obtain the number of impressions and rank of clicks. InterleavingMethod.evaluate(cls, ranking, clicks) can be used for deciding the ranking preference based on the loaded information.

In summary, my proposal is to implement:

InterleavingMethod.dump_rankings(self, file) for dumping sampled rankings
InterleavingMethod.load_rankings(cls, file) for loading the dumped rankings
Ranking.dumps(self) and Ranking.loads(self, s)
The format of the dump file can be

{ 
    ranking_id: {
        'probability': float,
        'ranking': {
            'ranking_list': list,
            'teams': { ranker_index: list }
        }
    }
}

Output of InterleavingMethod.evaluate

Is it OK if the output of InterleavingMethod.evaluate is changed to a list where each element (i,j) indicates i won j?

Current:

Return one of the following tuples:
- (1, 0): Ranking 'a' won
- (0, 1): Ranking 'b' won
- (0, 0): Tie

Expected:

Return a list of pairs of ranker indices in which element (i, j) indicates i won j.
e.g. a result [(1, 0), (2, 1), (2, 0)]  indicates ranker 1 won ranker 0, and ranker 2 won rankers 0 as well as 1.

Probabilistic._compute_scores IS WRONG

I think the original PM paper is misleading.

When we compute P(a|l,q) by using P(l|a,q), it is necessary to use P(l|q) for normalization.
More precisely,

P(a|l,q) = P(l|a,q)P(a|q)/P(l|q)

Since P(l|q) = sum_a(P(l|a,q)P(a|q)) and P(a|q) = 1/(# rankers), the correct equation should be

P(a|l,q) = P(l|a,q)/sum_a(P(l|a, q))

Therefore, Algorithm 2 in the original paper must normalize p' before computing the expected outcome.

Update CircleCI Settings

Simulation

Implement the simulation with LETOR data.

pypi package

This library is great. Would it be possible to make a pypi package out of it?

If you need help in doing so, I'd love to help out.

Needs a fix for Ranking.hash from TeamDraft

Suppose A = [1, 2, 3] and B = [2, 3, 1]. Two rankings are obtained by TeamDraft: [1, 2, 3] and [2, 1, 3]. However, there must be four types of rankings, i.e.

[1, 2, 3] with team A = [1, 3] and team B = [2],
[1, 2, 3] with team A = [1] and team B = [2, 3],
[2, 1, 3] with team A = [1, 3] and team B = [2], and
[2, 1, 3] with team A = [1] and team B = [2, 3].

The current implementation of Ranking.hash only takes into account of the equality of the ranking but not that of the team.

Speed up probabilistic interleaving

setup.py does not install the simulation subpackage

>python setup.py build
...
>python setup.py install
...
>python ./tests/test_methods.py
Traceback (most recent call last):
  File "./tests/test_methods.py", line 1, in <module>
    import interleaving as il
  File "C:\Users\owner\AppData\Local\Programs\Python\Python35\lib\site-packages\interleaving-0.0.1-py3.5.egg\interleaving\__init__.py", line 6, in <module>
ImportError: cannot import name 'simulation'

Prefix constraint sampling

In [Schuth+, CIKM 2014], the return value L of the prefix constraint sampling is a set.
Therefore, each prefix should be distinct. However, the current implementation of
Optimized::_sample_rankings (inherited from InterleavingMethod) seems to allow to
generate one prefix multiple times. Don't we have to modify the method?

Implement Optimized Multileaving

Implement Team Draft Multileaving

test_probabilistic.py may be too slow

A tiny bugfix

I forgot to remove a call to print from the test code for Probabilistic.

Deprecation Warning

/Users/kato/.local/share/virtualenvs/interleaving-M68ZqYhg/lib/python3.6/site-packages/pulp/solvers.py:71: DeprecationWarning: The SafeConfigParser class has been renamed to ConfigParser in Python 3.2. This alias will be removed in future versions. Use ConfigParser directly instead.
  'os':operating_system, 'arch':arch})

/Users/kato/dev/interleaving/interleaving/probabilistic.py:215: DeprecationWarning: `logsumexp` is deprecated!
Importing `logsumexp` from scipy.misc is deprecated in scipy 1.0.0. Use `scipy.special.logsumexp` instead.
  p_all = np.exp(p_log_all - misc.logsumexp(p_log_all))

There are two messages about deprecation.

setup.py does not install this module

The current setup.py cannot install this module due to some errors regarding numpy and scipy.

Random team selection at Probablistic.multileave

I wonder whether the current implementation of the random team selection at Probablistic.multileave is correct.

The current code randomizes the order of the team list, use all of them in order, and randomizes it again if all the teams are used.

In "Probabilistic Multileave for Online Retrieval Evaluation, SIGIR 2015", on the other hand, Line 6 in Algorithm 1 may suggest that it draws one team from the team list every time.

How do you think?

Is Probablistic.evaluate different from that in the original paper?

Sections 4.2 and 4.3 in "A Probabilistic Method for Inferring Preferences from Clicks, CIKM2011" explain how to infer the preference based on user clicks.

They seems different from the current implementation.

Use `dict.fromkeys` instead of `set` for deterministic behavior

Some methods use sets to ensure the uniqueness of items and rely on the order of items of the set. Python's set order is not deterministic across python interpreters, which makes it hard to control, especially during tests as there is no seed for set that can be controlled.

Using dict.fromkeys instead of set achieves the same uniqueness but guaranteeing deterministic ordering of items.

The following methods are affected:

PairwisePreference._current_candidates
Optimized._sample_rankings
Optimized._sample

Too many samplings

Some tests conduct too many samplings, which makes the tests quite slow.

Optimized sometimes produces interleaved rankings with invalid probability

The solver used in Optimized or RoughlyOptimized sometimes returns probability whose sum is not 1.0. We can just renormalize the probability before returning the result.

Pairwise Preference Multileaving

Harrie Oosterhuis, and Maarten de Rijke. "Sensitive and Scalable Online Evaluation with Theoretical Guarantees." Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 2017.

Implement Probabilistic Interleaving

The choice of rankers in Probabilistic Interleaving and Multileaving

It seems that the choice of rankers is different at Probabilistic Interleaving and Multileaving.

In Probabilistic Interleaving, the authors said:

After constructing s1 and s2, l is generated similarly to the team draft method. However, instead of randomizing the ranker to contribute the next document per pair, one of the softmax functions is randomly selected at each rank. Doing so is mathematically convenient, as the only component that changes at each rank is the distribution over documents.

"one of the softmax functions is randomly selected at each rank" may mean that it randomly selects a ranker every time it decides a document to be put in the combined ranking.

In Probabilistic Multileaving, on the other hand, it randomly picks up one of the rankers without replacement, as you explained.

The conclusion might be one of the following:

my understanding is wrong: they use the same method for selecting a ranker
Probabilistic Interleaving and Multileaving use different strategies for selecting a ranker; so we need two types of implementation.
Probabilistic Interleaving and Multileaving use different strategies for selecting a ranker; yet we should select either for simplicity.

Small bugs in Simulator

The input argument was changed unintentionally. Need to be fixed.

Discussion on the arguments in InterleavingMethod

Sorry for trying to change the arguments in InterleavingMethod again.

The current arguments of InterleavingMethod are:

def __init__(self, max_length, sample_num, *lists)

This could be improved since keyword arguments with default values cannot be used with a variable-length argument.

There are some options:

def __init__(self, max_length, lists, sample_num=None, some_keyword_argument=3)

(lists is no longer a variable-length argument)

def __init__(self, max_length, *lists, **kwargs)

(kwargs is a dict for keyword arguments)
e.g.

im = InterleavingMethod(10, [1, 2], [3, 4], hoge=3)
# kwargs = {"hoge": 3}
# If "foo" is missing in this dict, we may use a default value for "foo" by explicitly doing so:
if not "foo" in kwargs:
    kwargs["foo"] = FOO_DEFAULT_VALUE

Which do you prefer to?

A little speed up for Probablistic

I tried speed up Probablistic. Can you try this?