Giter VIP home page Giter VIP logo

pyclickmodels's Introduction

pyClickModels Build Status Coverage Status PyPI version Pyversions GitHub license

A Cython implementation of ClickModels that uses Probabilistic Graphical Models to infer user behavior when interacting with Search Page Results (Ranking).

How It Works

ClickModels uses the concept of Probabilistic Graphical Models to model components that describe the interactions between users and a list of items ranked by a set of retrieval rules.

These models tend to be useful when it's desired to understand whether a given document is a good match for a given search query or not which is also known in literature as Judgments grades. This is possible through evaluating past observed clicks and the positions at which the document appeared on the results pages for each query.

There are several proposed approaches to handle this problem. This repository implements a Dynamic Bayesian Network, similar to previous works also done in Python:

dbn

Main differences are:

  1. Implemented on top of Cython: solutions already public available rely on CPython integrated with PyPy for additional speed ups. Unfortunatelly this still might not be good enough in terms of performance. To work on that, this implementation relies 100% on C/C++ for further optimization in speed. Despite not having an official benchmark, it's expected an improvement of 15x ~ 18x on top of CPython (same data lead to an increase of ~3x when using PyPy).
  2. Memory Friendly: expects input data to follow a JSON format with all sessions of clickstream already expressed for each row. This saves memory and allows for the library to process bigger amounts of data.
  3. Purchase variable: as businesses such as eCommerces can greately benefit from better understanding their search engine, this repository added the variable Purchase to further describe customers behaviors.

The file notebooks/DBN.ipynb has a complete description of how the model has been implemented along with all the mathematics involved.

Instalation

As this project relies on binaries compiled by Cython, currently only Linux (manylinux) platform is supported. It can be installed with:

pip install pyClickModels

Getting Started

Input Data

pyClickModels expects input data to be stored in a set of compressed gz files located on the same folder. They all should start with the string "judgments", for instance, judgments0.gz. Each file should contain line separated JSONs. The following is an example of each JSON line:

{
    "search_keys": {
        "search_term": "blue shoes",
        "region": "south",
	"favorite_brand": "super brand",
	"user_size": "L",
	"avg_ticket": 10
    },
    "judgment_keys": [
        {
	    "session": [
                {"click": 0, "purchase": 0, "doc": "doc0"}
                {"click": 1, "purchase": 0, "doc": "doc1"}
                {"click": 1, "purchase": 1, "doc": "doc2"}
	    ]
        },
        {
	    "session": [
                {"click": 1, "purchase": 0, "doc": "doc0"}
                {"click": 0, "purchase": 0, "doc": "doc1"}
                {"click": 0, "purchase": 0, "doc": "doc2"}
	    ]
        }
    ]
}

The key search_keys sets the context for the search. In the above example, a given customer (or cluster of customers with the same context) searched for blue shoes. Their region is south (it could be any chosen value), favorite brand is super brand and so on.

These keys sets the context for which the search happened. When pyClickModels runs its optimization, it will consider all the context at once. This means that the Judgments obtained are also on the whole context setting.

If no context is desired, just use {"search_keys": {"search_term": "user search"}}.

There's no required schema here which means the library loops through all keys available in search_keys and builds the optimization process considering the whole context as a single query.

As for the judgment_keys, this is a list of sessions. The key session is mandatory. Each session contains the clickstream of users (if the variable purchase is not required set it to 0).

For running DBN from pyClickModels, here's a simple example:

from pyClickModels.DBN import DBN

model = DBN()
model.fit(input_folder="/tmp/clicks_data/", iters=10)
model.export_judgments("/tmp/output.gz")

Output file will contain a NEWLINE JSON separated file with the judgments for each query and each document observed for that query, i.e.:

{"search_term:blue shoes|region:south|brand:super brand": {"doc0": 0.2, "doc1": 0.3, "doc2": 0.4}}
{"search_term:query|region:north|brand:other_brand": {"doc0": 0.0, "doc1": 0.0, "doc2": 0.1}}

Judgments here varies between 0 and 1. Some libraries requires it to range between integers 0 and 4. Choose a proper transformation in this case that better suits your data.

Warnings

This library is still alpha! Use it with caution. It's been fully unittested but still parts of it uses pure C whose exceptions might not have been fully considered yet. It's recommended to, before using this library in production evironments, to fully test it with different datasets and sizes to evaluate how it performs.

Contributing

Contributions are very welcome! Also, if you find bugs, please report them :).

pyclickmodels's People

Contributors

willianfuks avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pyclickmodels's Issues

Exception ignored in: 'pyClickModels.DBN.DBNModel.build_CP_vector_given_e' MemoryError: std::bad_alloc

I am getting the below error
Exception ignored in: 'pyClickModels.DBN.DBNModel.build_CP_vector_given_e'
MemoryError: std::bad_alloc

I ran the following code:

from pyClickModels.DBN import DBN

model = DBN()
model.fit(input_folder="/clicks_data/")
model.export_judgments("/output.gz")

There are 80+ judgment.gz files of size 60KB-100KB each. The output.gz file is still generated with some data but is it missing or skipping some data because of this error?

Model fitting produces NaNs for certain inputs

First of all I want to thank you for the library and a great work you did in order to make it work with very efficient implementation.

While using the library I noticed that some of the data that I'm using can converge to NaN values. I managed to find an example that converges to NaN in two iterations

Input json:

{
  "search_keys": {
    "search_term": "xyz"
  },
  "judgment_keys": [
    {
      "session": [
        {"click": 0, "purchase": 0, "doc": "doc0"},
        {"click": 0, "purchase": 0, "doc": "doc1"},
        {"click": 0, "purchase": 0, "doc": "doc2"},
        {"click": 1, "purchase": 1, "doc": "doc3"},
        {"click": 0, "purchase": 0, "doc": "doc4"},
        {"click": 0, "purchase": 0, "doc": "doc5"},
        {"click": 0, "purchase": 0, "doc": "doc6"},
        {"click": 0, "purchase": 0, "doc": "doc7"},
        {"click": 0, "purchase": 0, "doc": "doc8"},
        {"click": 0, "purchase": 0, "doc": "doc9"},
        {"click": 0, "purchase": 0, "doc": "doc10"},
        {"click": 0, "purchase": 0, "doc": "doc11"},
        {"click": 1, "purchase": 1, "doc": "doc12"},
        {"click": 0, "purchase": 0, "doc": "doc13"},
        {"click": 1, "purchase": 0, "doc": "doc14"},
        {"click": 0, "purchase": 0, "doc": "doc15"},
        {"click": 0, "purchase": 0, "doc": "doc16"},
        {"click": 0, "purchase": 0, "doc": "doc17"},
        {"click": 0, "purchase": 0, "doc": "doc18"},
        {"click": 0, "purchase": 0, "doc": "doc19"}
      ]
    }
  ]
}

Result after one iterations

{
  "search_term:xyz": {
    "doc18": 0.32974863052368164,
    "doc17": 0.21041131019592285,
    "doc13": 0.1666666716337204,
    "doc12": 0.3333333432674408,
    "doc11": 0.1666666716337204,
    "doc8": 0.1666666716337204,
    "doc16": 0.23937760293483734,
    "doc7": 0.1666666716337204,
    "doc14": 0.26920539140701294,
    "doc3": 0.3333333432674408,
    "doc5": 0.1666666716337204,
    "doc4": 0.1666666716337204,
    "doc9": 0.1666666716337204,
    "doc19": 0.3070479929447174,
    "doc15": 0.2477390617132187,
    "doc0": 0.1666666716337204,
    "doc2": 0.1666666716337204,
    "doc10": 0.1666666716337204,
    "doc6": 0.1666666716337204,
    "doc1": 0.1666666716337204
  }
}

Result after two iterations:

{
  "search_term:xyz": {
    "doc18": null,
    "doc17": null,
    "doc13": 0.1666666716337204,
    "doc12": 0.3333333432674408,
    "doc11": 0.1666666716337204,
    "doc8": 0.1666666716337204,
    "doc16": null,
    "doc7": 0.1666666716337204,
    "doc14": null,
    "doc3": 0.3333333432674408,
    "doc5": 0.1666666716337204,
    "doc4": 0.1666666716337204,
    "doc9": 0.1666666716337204,
    "doc19": null,
    "doc15": null,
    "doc0": 0.1666666716337204,
    "doc2": 0.1666666716337204,
    "doc10": 0.1666666716337204,
    "doc6": 0.1666666716337204,
    "doc1": 0.1666666716337204
  }
}

It might be just a coincidence, but it looks like nulls only appear in the later documents (from 15th to 20th) and the first null appears at position 15 which is the place where last final happened.

I would appreciate if you can help me to identify the problem.

Let me know if I can provide you with more information.

Not able to pip install on macbook

pip install pyClickModels

Collecting pyClickModels
Using cached pyClickModels-0.0.2.tar.gz (167 kB)
ERROR: Command errored out with exit status 1:
command: /Users/username/.virtualenvs/project/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/hp/p_tnsqq1451f5n9v0z5l08xr0000gn/T/pip-install-_evtqamc/pyclickmodels_4eda7aacdb76462085bec22ea68494e0/setup.py'"'"'; file='"'"'/private/var/folders/hp/p_tnsqq1451f5n9v0z5l08xr0000gn/T/pip-install-_evtqamc/pyclickmodels_4eda7aacdb76462085bec22ea68494e0/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/hp/p_tnsqq1451f5n9v0z5l08xr0000gn/T/pip-pip-egg-info-r34sbm5i
cwd: /private/var/folders/hp/p_tnsqq1451f5n9v0z5l08xr0000gn/T/pip-install-_evtqamc/pyclickmodels_4eda7aacdb76462085bec22ea68494e0/
Complete output (11 lines):
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/hp/p_tnsqq1451f5n9v0z5l08xr0000gn/T/pip-install-_evtqamc/pyclickmodels_4eda7aacdb76462085bec22ea68494e0/setup.py", line 158, in
ext_modules=cythonize(
File "/Users/username/.virtualenvs/project/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 965, in cythonize
module_list, module_metadata = create_extension_list(
File "/Users/username/.virtualenvs/project/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 815, in create_extension_list
for file in nonempty(sorted(extended_iglob(filepattern)), "'%s' doesn't match any files" % filepattern):
File "/Users/username/.virtualenvs/project/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 114, in nonempty
raise ValueError(error_msg)
ValueError: 'pyClickModels/DBN.pyx' doesn't match any files
----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/3f/5f/5229d10f6eec879ad957594e179cc1e320353e4870f77e20987e2cc34117/pyClickModels-0.0.2.tar.gz#sha256=3ac63b80498c74cec1af488a9a934c085be10dce093d380703eea9cf4a0a8ca1 (from https://pypi.org/simple/pyclickmodels/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Using cached pyClickModels-0.0.1.tar.gz (167 kB)
ERROR: Command errored out with exit status 1:
command: /Users/username/.virtualenvs/project/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/hp/p_tnsqq1451f5n9v0z5l08xr0000gn/T/pip-install-_evtqamc/pyclickmodels_5f775ef69f4f464b9c1dc572769066db/setup.py'"'"'; file='"'"'/private/var/folders/hp/p_tnsqq1451f5n9v0z5l08xr0000gn/T/pip-install-_evtqamc/pyclickmodels_5f775ef69f4f464b9c1dc572769066db/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/hp/p_tnsqq1451f5n9v0z5l08xr0000gn/T/pip-pip-egg-info-3nnvo4zu
cwd: /private/var/folders/hp/p_tnsqq1451f5n9v0z5l08xr0000gn/T/pip-install-_evtqamc/pyclickmodels_5f775ef69f4f464b9c1dc572769066db/
Complete output (11 lines):
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/hp/p_tnsqq1451f5n9v0z5l08xr0000gn/T/pip-install-_evtqamc/pyclickmodels_5f775ef69f4f464b9c1dc572769066db/setup.py", line 158, in
ext_modules=cythonize(
File "/Users/username/.virtualenvs/project/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 965, in cythonize
module_list, module_metadata = create_extension_list(
File "/Users/username/.virtualenvs/project/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 815, in create_extension_list
for file in nonempty(sorted(extended_iglob(filepattern)), "'%s' doesn't match any files" % filepattern):
File "/Users/username/.virtualenvs/project/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 114, in nonempty
raise ValueError(error_msg)
ValueError: 'pyClickModels/DBN.pyx' doesn't match any files
----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/b6/6e/ed40b3e17f85bfd48430b3c101a597cb6baac82cd41c5445b57de48a05aa/pyClickModels-0.0.1.tar.gz#sha256=bcfb2a352966925106f9c11ae96f708132b6430276b7fd3e65b2339f8a37fc7e (from https://pypi.org/simple/pyclickmodels/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Using cached pyClickModels-0.0.0.tar.gz (170 kB)
ERROR: Command errored out with exit status 1:
command: /Users/username/.virtualenvs/project/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/hp/p_tnsqq1451f5n9v0z5l08xr0000gn/T/pip-install-_evtqamc/pyclickmodels_4586157e779b49808b4d954180a9f488/setup.py'"'"'; file='"'"'/private/var/folders/hp/p_tnsqq1451f5n9v0z5l08xr0000gn/T/pip-install-_evtqamc/pyclickmodels_4586157e779b49808b4d954180a9f488/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/hp/p_tnsqq1451f5n9v0z5l08xr0000gn/T/pip-pip-egg-info-7v1x5660
cwd: /private/var/folders/hp/p_tnsqq1451f5n9v0z5l08xr0000gn/T/pip-install-_evtqamc/pyclickmodels_4586157e779b49808b4d954180a9f488/
Complete output (11 lines):
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/hp/p_tnsqq1451f5n9v0z5l08xr0000gn/T/pip-install-_evtqamc/pyclickmodels_4586157e779b49808b4d954180a9f488/setup.py", line 158, in
ext_modules=cythonize(
File "/Users/username/.virtualenvs/project/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 965, in cythonize
module_list, module_metadata = create_extension_list(
File "/Users/username/.virtualenvs/project/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 815, in create_extension_list
for file in nonempty(sorted(extended_iglob(filepattern)), "'%s' doesn't match any files" % filepattern):
File "/Users/username/.virtualenvs/project/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 114, in nonempty
raise ValueError(error_msg)
ValueError: 'pyClickModels/DBN.pyx' doesn't match any files
----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/bc/a3/9518743addb2593dfbd9e4d7ec0bb22e3fae73c573615c4e4aa84ac4c4c9/pyClickModels-0.0.0.tar.gz#sha256=e6c58070cda55a63ba39c491f1a5b55ce65f50b9995375101deb4ca10d62e0e7 (from https://pypi.org/simple/pyclickmodels/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Multiple purchases per sessions ?

Hi,

First of all, thanks for the great package,

I was wondering how you would go about having multiple purchase in a sessions (this can happen in some context).

I thought about it and got two ideas:

  1. Rework the maths (hugh...) to replace equation seven from here P(Er = 1|Sr-1=0 ) =0 by something like P(Er = 1|Sr-1=0 ) = gamma_2 . This would means that despite being satisfied there is still a chance you continue to evaluate the next items. I didn't work the full math out because i wanted first to ask your opinions on this. Do you think this would work ?
  2. Much more simple hack: consider session with multiple purchase as separated sessions (either incorporating the first purchase in the second sessions as simple click with no purchase or simply not incorporating it)

Any chance to get your opinion on this ?

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.