Giter VIP home page Giter VIP logo

interleaving's Introduction

Interleaving

A python library for conducting interleaving, which comparing two or multiple rankers based on observed user clicks by interleaving their results.

Circle CI

Introduction

A/B testing is a well-known technique for comparing two or more systems based on user behaviors in a production environment, and has been used for improving the quality of systems in many services. Interleaving, which can be an alternative to A/B testing for comparing rankings, has shown x100 efficiency compared to A/B testing1, 2. Since the efficiency matters a lot in particular for many alternatives in comparison, interleaving is a promising technique for user-based ranking evaluation. This library aims to provide most of the algorithms that have been proposed in the literature.

Interleaving algorithms

Interleaving for two rankers

  • Balanced interleaving3
  • Team draft interleaving4
  • Probabilistic interleaving5
  • Optimized interleaving6

Interleaving for multiple rankers

  • Team draft multileaving7
  • Probabilistic multileaving8
  • Optimized multileaving7
  • Roughly optimized multileaving9
  • Pairwise preference multileaving10

Note that probabilistic interleaving and probabilistic multileaving use different strategies to select a ranker from which a document is selected. In the original papers, probabilistic interleaving samples a ranker with replacement, i.e. one of the two rankers is sampled at every document selection. Probabilistic multileaving samples a ranker without replacement. Let D be a set of all the rankers. A ranker is sampled from D without replacement. When D is empty, all the rankers are put into D again. Probabilistic has an keyword argument replace by which either of these strategies can be used.

Prerequisites

  • Numpy
  • Scipy
  • Pulp

Installation

interleaving and its prerequisites can be installed by

$ pip install git+https://github.com/mpkato/interleaving.git

An alternative can be

$ git clone git+https://github.com/mpkato/interleaving.git
$ cd interleaving
$ python setup.py install

Usage

>>> import interleaving
>>>
>>> a = [1, 2, 3, 4, 5] # Ranking 1
>>> b = [4, 3, 5, 1, 2] # Ranking 2
>>> method = interleaving.TeamDraft([a, b]) # initialize an interleaving method
>>>
>>> ranking = method.interleave() # interleaving
>>> ranking
[1, 4, 2, 3, 5]
>>>
>>> clicks = [0, 2] # observed clicks, i.e. documents 1 and 2 are clicked
>>> result = interleaving.TeamDraft.evaluate(ranking, clicks)
>>> result # (0, 1) indicates Ranking 1 won Ranking 2.
[(0, 1)]
>>>
>>> clicks = [1, 3] # observed clicks, i.e. documents 4 and 3 are clicked
>>> result = interleaving.TeamDraft.evaluate(ranking, clicks)
>>> result # (1, 0) indicates Ranking 2 won Ranking 1.
[(1, 0)]
>>>
>>> clicks = [0, 1] # observed clicks, i.e. documents 1 and 4 are clicked
>>> result = interleaving.TeamDraft.evaluate(ranking, clicks)
>>> result # if (0, 1) or (1, 0) does not appear in the result,
>>>        # it indicates a tie between Rankings 1 and 2.
[]

Note

The ranking sampling algorithm of optimized multileaving7 and roughly optimized multileaving9 may take a long time or even runs into an inifinite loop. To work around this problem, this implementation supports secure_sampling flag to limit the number of sampling attempts to sample_num.

>>> import interleaving
>>> interleaving.Optimized([[1, 2], [2, 3]], sample_num=4, secure_sampling=True)

References

  1. Chapelle et al. "Large-scale Validation and Analysis of Interleaved Search Evaluation." ACM TOIS 30.1 (2012): 6.
  2. Schuth, Hofmann, Radlinski. "Predicting Search Satisfaction Metrics with Interleaved Comparisons." SIGIR 2015.
  3. Joachims. "Evaluating retrieval performance using clickthrough data". Text Mining 2003.
  4. Radlinski, Kurup, and Joachims. "How does clickthrough data reflect retrieval quality?" CIKM 2008.
  5. Hofmann, Whiteson, and de Rijke. "A probabilistic method for inferring preferences from clicks." CIKM 2011.
  6. Radlinski and Craswell. "Optimized Interleaving for Online Retrieval Evaluation." WSDM 2013.
  7. Schuth et al. "Multileaved Comparisons for Fast Online Evaluation." CIKM 2014.
  8. Schuth et al. "Probabilistic Multileave for Online Retrieval Evaluation." SIGIR 2015.
  9. Manabe et al. "A Comparative Live Evaluation of Multileaving Methods on a Commercial cQA Search", SIGIR 2017.
  10. Oosterhuis and de Rijke. "Sensitive and Scalable Online Evaluation with Theoretical Guarantees", CIKM 2017.

License

MIT License (see LICENSE file).

interleaving's People

Contributors

jkawamoto avatar tmanabe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

interleaving's Issues

Proposal for revising MultileavingMethod

The current implementation generates an interleaved ranking when two rankings are given.
This may not be convenient when

  1. The method requires much time and can save it by caching, and
  2. The distribution of rankings is necessary (e.g. Store the results in a file and use them for some purposes).

A proposal here is to obtain two or more ranking through init, compute the distribution in the initialization, and return a ranking if the interleave method is called (without any argument).
Moreover, we can return a dict or a list of tuples that includes all the possible rankings with their probability.

If computing the probability or generating all the possible rankings is not feasible, we can just receive two or more rankings in the initialization, and generate a ranking when the interleave method is called.

Optimized's infinite loop

The current Optimized._sample_ranking samples rankings until the number of sampled rankings becomes sample_num.
However, this sometimes never ends: the number of possible rankings can be less than sample_num.

So it must be "until rankings are sampled sample_num times", not "until the number of sampled rankings reaches sample_num".

SystemError: Parent module '' not loaded

>python ./tests/test_balanced.py
Traceback (most recent call last):
  File "./tests/test_balanced.py", line 4, in <module>
    from .test_methods import TestMethods
SystemError: Parent module '' not loaded, cannot perform relative import

Dumping Ranking instances

I have been thinking about how to pass the interleaving results to a production environment.

Probably, an instance of InterleavingMethod can have function dump_rankings(self, file), which dumps sampled rankings into a file. Rankings, their probability, and some additional information (e.g. teams or credits) must be stored in the dump file. It is probably convenient to assign a unique ID for each ranking by using hash.

The format of the dump file should be something readable by the other programs, such as json or xml.

A production system can load the dump file, show the rankings to users, observe user clicks, and record the number of impressions and rank of clicks for each ranking.

InterleavingMethod.load_rankings(cls, file) can then load the rankings from the dump file and somehow obtain the number of impressions and rank of clicks. InterleavingMethod.evaluate(cls, ranking, clicks) can be used for deciding the ranking preference based on the loaded information.

In summary, my proposal is to implement:

  1. InterleavingMethod.dump_rankings(self, file) for dumping sampled rankings
  2. InterleavingMethod.load_rankings(cls, file) for loading the dumped rankings
  3. Ranking.dumps(self) and Ranking.loads(self, s)
  4. The format of the dump file can be
{ 
    ranking_id: {
        'probability': float,
        'ranking': {
            'ranking_list': list,
            'teams': { ranker_index: list }
        }
    }
}

Output of InterleavingMethod.evaluate

Is it OK if the output of InterleavingMethod.evaluate is changed to a list where each element (i,j) indicates i won j?

Current:

Return one of the following tuples:
- (1, 0): Ranking 'a' won
- (0, 1): Ranking 'b' won
- (0, 0): Tie

Expected:

Return a list of pairs of ranker indices in which element (i, j) indicates i won j.
e.g. a result [(1, 0), (2, 1), (2, 0)]  indicates ranker 1 won ranker 0, and ranker 2 won rankers 0 as well as 1.

Probabilistic._compute_scores IS WRONG

I think the original PM paper is misleading.

When we compute P(a|l,q) by using P(l|a,q), it is necessary to use P(l|q) for normalization.
More precisely,

P(a|l,q) = P(l|a,q)P(a|q)/P(l|q)

Since P(l|q) = sum_a(P(l|a,q)P(a|q)) and P(a|q) = 1/(# rankers), the correct equation should be

P(a|l,q) = P(l|a,q)/sum_a(P(l|a, q))

Therefore, Algorithm 2 in the original paper must normalize p' before computing the expected outcome.

Simulation

Implement the simulation with LETOR data.

pypi package

This library is great. Would it be possible to make a pypi package out of it?

If you need help in doing so, I'd love to help out.

Needs a fix for Ranking.__hash__ from TeamDraft

Suppose A = [1, 2, 3] and B = [2, 3, 1]. Two rankings are obtained by TeamDraft: [1, 2, 3] and [2, 1, 3]. However, there must be four types of rankings, i.e.

  • [1, 2, 3] with team A = [1, 3] and team B = [2],
  • [1, 2, 3] with team A = [1] and team B = [2, 3],
  • [2, 1, 3] with team A = [1, 3] and team B = [2], and
  • [2, 1, 3] with team A = [1] and team B = [2, 3].

The current implementation of Ranking.hash only takes into account of the equality of the ranking but not that of the team.

setup.py does not install the simulation subpackage

>python setup.py build
...
>python setup.py install
...
>python ./tests/test_methods.py
Traceback (most recent call last):
  File "./tests/test_methods.py", line 1, in <module>
    import interleaving as il
  File "C:\Users\owner\AppData\Local\Programs\Python\Python35\lib\site-packages\interleaving-0.0.1-py3.5.egg\interleaving\__init__.py", line 6, in <module>
ImportError: cannot import name 'simulation'

Prefix constraint sampling

In [Schuth+, CIKM 2014], the return value L of the prefix constraint sampling is a set.
Therefore, each prefix should be distinct. However, the current implementation of
Optimized::_sample_rankings (inherited from InterleavingMethod) seems to allow to
generate one prefix multiple times. Don't we have to modify the method?

A tiny bugfix

I forgot to remove a call to print from the test code for Probabilistic.

Deprecation Warning

/Users/kato/.local/share/virtualenvs/interleaving-M68ZqYhg/lib/python3.6/site-packages/pulp/solvers.py:71: DeprecationWarning: The SafeConfigParser class has been renamed to ConfigParser in Python 3.2. This alias will be removed in future versions. Use ConfigParser directly instead.
  'os':operating_system, 'arch':arch})

/Users/kato/dev/interleaving/interleaving/probabilistic.py:215: DeprecationWarning: `logsumexp` is deprecated!
Importing `logsumexp` from scipy.misc is deprecated in scipy 1.0.0. Use `scipy.special.logsumexp` instead.
  p_all = np.exp(p_log_all - misc.logsumexp(p_log_all))

There are two messages about deprecation.

Random team selection at Probablistic.multileave

I wonder whether the current implementation of the random team selection at Probablistic.multileave is correct.

The current code randomizes the order of the team list, use all of them in order, and randomizes it again if all the teams are used.

In "Probabilistic Multileave for Online Retrieval Evaluation, SIGIR 2015", on the other hand, Line 6 in Algorithm 1 may suggest that it draws one team from the team list every time.

How do you think?

Use `dict.fromkeys` instead of `set` for deterministic behavior

Some methods use sets to ensure the uniqueness of items and rely on the order of items of the set. Python's set order is not deterministic across python interpreters, which makes it hard to control, especially during tests as there is no seed for set that can be controlled.

Using dict.fromkeys instead of set achieves the same uniqueness but guaranteeing deterministic ordering of items.

The following methods are affected:

  • PairwisePreference._current_candidates
  • Optimized._sample_rankings
  • Optimized._sample

Too many samplings

Some tests conduct too many samplings, which makes the tests quite slow.

Pairwise Preference Multileaving

Harrie Oosterhuis, and Maarten de Rijke. "Sensitive and Scalable Online Evaluation with Theoretical Guarantees." Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 2017.

The choice of rankers in Probabilistic Interleaving and Multileaving

It seems that the choice of rankers is different at Probabilistic Interleaving and Multileaving.

In Probabilistic Interleaving, the authors said:

After constructing s1 and s2, l is generated similarly to the team draft method. However, instead of randomizing the ranker to contribute the next document per pair, one of the softmax functions is randomly selected at each rank. Doing so is mathematically convenient, as the only component that changes at each rank is the distribution over documents.

"one of the softmax functions is randomly selected at each rank" may mean that it randomly selects a ranker every time it decides a document to be put in the combined ranking.

In Probabilistic Multileaving, on the other hand, it randomly picks up one of the rankers without replacement, as you explained.

The conclusion might be one of the following:

  1. my understanding is wrong: they use the same method for selecting a ranker
  2. Probabilistic Interleaving and Multileaving use different strategies for selecting a ranker; so we need two types of implementation.
  3. Probabilistic Interleaving and Multileaving use different strategies for selecting a ranker; yet we should select either for simplicity.

Discussion on the arguments in InterleavingMethod

Sorry for trying to change the arguments in InterleavingMethod again.

The current arguments of InterleavingMethod are:

def __init__(self, max_length, sample_num, *lists)

This could be improved since keyword arguments with default values cannot be used with a variable-length argument.

There are some options:

def __init__(self, max_length, lists, sample_num=None, some_keyword_argument=3)

(lists is no longer a variable-length argument)

def __init__(self, max_length, *lists, **kwargs)

(kwargs is a dict for keyword arguments)
e.g.

im = InterleavingMethod(10, [1, 2], [3, 4], hoge=3)
# kwargs = {"hoge": 3}
# If "foo" is missing in this dict, we may use a default value for "foo" by explicitly doing so:
if not "foo" in kwargs:
    kwargs["foo"] = FOO_DEFAULT_VALUE

Which do you prefer to?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.