Giter VIP home page Giter VIP logo

plackettluce's Introduction

PlackettLuce

CRAN_Status_Badge R-CMD-check Codecov test coverage

Package website: https://hturner.github.io/PlackettLuce/.

Overview

The PlackettLuce package implements a generalization of the model jointly attributed to Plackett (1975) and Luce (1959) for modelling rankings data. Examples of rankings data might be the finishing order of competitors in a race, or the preference of consumers over a set of competing products.

The output of the model is an estimated worth for each item that appears in the rankings. The parameters are generally presented on the log scale for inference.

The implementation of the Plackett-Luce model in PlackettLuce:

  • Accommodates ties (of any order) in the rankings, e.g. bananas $\succ$ {apples, oranges} $\succ$ pears.
  • Accommodates sub-rankings, e.g. pears $\succ$ apples, when the full set of items is {apples, bananas, oranges, pears}.
  • Handles disconnected or weakly connected networks implied by the rankings, e.g. where one item always loses as in figure below. This is achieved by adding pseudo-rankings with a hypothetical or ghost item.


In addition the package provides methods for

  • Obtaining quasi-standard errors, that don’t depend on the constraints applied to the worth parameters for identifiability.
  • Fitting Plackett-Luce trees, i.e. a tree that partitions the rankings by covariate values, such as consumer attributes or racing conditions, identifying subgroups with different sets of worth parameters for the items.

Installation

The package may be installed from CRAN via

install.packages("PlackettLuce")

The development version can be installed via

# install.packages("devtools")
devtools::install_github("hturner/PlackettLuce")

Usage

The Netflix Prize was a competition devised by Netflix to improve the accuracy of its recommendation system. To facilitate this they released ratings about movies from the users of the system that have been transformed to preference data and are available from PrefLib, (Bennett and Lanning 2007). Each data set comprises rankings of a set of 3 or 4 movies selected at random. Here we consider rankings for just one set of movies to illustrate the functionality of PlackettLuce.

The data can be read in using the read.soc function in PlackettLuce

library(PlackettLuce)
preflib <- "https://www.preflib.org/static/data/"
netflix <- read.soc(file.path(preflib, "netflix/00004-00000138.soc"))
head(netflix, 2)
##   Freq Rank 1 Rank 2 Rank 3 Rank 4
## 1   68      2      1      4      3
## 2   53      1      2      4      3

Each row corresponds to a unique ordering of the four movies in this data set. The number of Netflix users that assigned that ordering is given in the first column, followed by the four movies in preference order. So for example, 68 users ranked movie 2 first, followed by movie 1, then movie 4 and finally movie 3.

PlackettLuce, the model-fitting function in PlackettLuce requires that the data are provided in the form of rankings rather than orderings, i.e.  the rankings are expressed by giving the rank for each item, rather than ordering the items. We can create a "rankings" object from a set of orderings as follows

R <- as.rankings(netflix[,-1], input = "orderings",
                 items = attr(netflix, "items"))
R[1:3, as.rankings = FALSE]
##   Mean Girls Beverly Hills Cop The Mummy Returns Mission: Impossible II
## 1          2                 1                 4                      3
## 2          1                 2                 4                      3
## 3          2                 1                 3                      4

Note that read.soc saved the names of the movies in the "items" attribute of netflix, so we have used these to label the items. Subsetting the rankings object R with as.rankings = FALSE, returns the underlying matrix of rankings corresponding to the subset. So for example, in the first ranking the second movie (Beverly Hills Cop) is ranked number 1, followed by the first movie (Mean Girls) with rank 2, followed by the fourth movie (Mission: Impossible II) and finally the third movie (The Mummy Returns), giving the same ordering as in the original data.

Various methods are provided for "rankings" objects, in particular if we subset the rankings without as.rankings = FALSE, the result is again a "rankings" object and the corresponding print method is used:

R[1:3]
##                                          1 
## "Beverly Hills Cop > Mean Girls > Mis ..." 
##                                          2 
## "Mean Girls > Beverly Hills Cop > Mis ..." 
##                                          3 
## "Beverly Hills Cop > Mean Girls > The ..."
print(R[1:3], width = 60)
##                                                              1 
## "Beverly Hills Cop > Mean Girls > Mission: Impossible II  ..." 
##                                                              2 
## "Mean Girls > Beverly Hills Cop > Mission: Impossible II  ..." 
##                                                              3 
## "Beverly Hills Cop > Mean Girls > The Mummy Returns > Mis ..."

The rankings can now be passed to PlackettLuce to fit the Plackett-Luce model. The counts of each ranking provided in the downloaded data are used as weights when fitting the model.

mod <- PlackettLuce(R, weights = netflix$Freq)
coef(mod, log = FALSE)
##             Mean Girls      Beverly Hills Cop      The Mummy Returns 
##              0.2306285              0.4510655              0.1684719 
## Mission: Impossible II 
##              0.1498342

Calling coef with log = FALSE gives the worth parameters, constrained to sum to one. These parameters represent the probability that each movie is ranked first.

For inference these parameters are converted to the log scale, by default setting the first parameter to zero so that the standard errors are estimable:

summary(mod)
## Call: PlackettLuce(rankings = R, weights = netflix$Freq)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## Mean Girls              0.00000         NA      NA       NA    
## Beverly Hills Cop       0.67080    0.07472   8.978  < 2e-16 ***
## The Mummy Returns      -0.31404    0.07593  -4.136 3.53e-05 ***
## Mission: Impossible II -0.43128    0.07489  -5.759 8.47e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual deviance:  3493.5 on 3525 degrees of freedom
## AIC:  3499.5 
## Number of iterations: 6

In this way, Mean Girls is treated as the reference movie, the positive parameter for Beverly Hills Cop shows this was more popular among the users, while the negative parameters for the other two movies show these were less popular.

Comparisons between different pairs of movies can be made visually by plotting the log-worth parameters with comparison intervals based on quasi standard errors.

qv <- qvcalc(mod)
plot(qv, ylab = "Worth (log)", main = NULL)

If the intervals overlap there is no significant difference. So we can see that Beverly Hills Cop is significantly more popular than the other three movies, Mean Girls is significant more popular than The Mummy Returns or Mission: Impossible II, but there was no significant difference in users’ preference for these last two movies.

Going Further

The core functionality of PlackettLuce is illustrated in the package vignette, along with details of the model used in the package and a comparison to other packages. The vignette can be found on the package website or from within R once the package has been installed, e.g. via

vignette("Overview", package = "PlackettLuce")

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

References

Bennett, J., and S. Lanning. 2007. “The Netflix Prize.” In Proceedings of the KDD Cup Workshop 2007, 3–6. ACM.

Luce, R. Duncan. 1959. Individual Choice Behavior: A Theoretical Analysis. New York: Wiley.

Plackett, Robert L. 1975. “The Analysis of Permutations.” Appl. Statist 24 (2): 193–202. https://doi.org/10.2307/2346567.

plackettluce's People

Contributors

davidfirth avatar hturner avatar ikosmidis avatar kauedesousa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

plackettluce's Issues

Add comment on interpretation of coef to PlackettLuce.Rd

Should add to PlackettLuce.Rd that mod$coefficients are the probabilities for each item that it is ranked first, given that first place is not tied. Then coef(mod) returns the log-probabilities, by default with the log-probability of the first item set to zero.

Issue with ties

I tried to run PlackettLuce with the following rankings:
[1,] 1 2 3 4 5
[2,] 1 2 3 5 4
[3,] 1 2 4 3 5
[4,] 1 2 5 3 4
[5,] 1 3 2 4 5
[6,] 3 1 2 5 4
[7,] 0 2 1 4 5
[8,] 2 4 3 1 5
[9,] 0 0 0 1 5
[10,] 0 0 0 0 1
With the freq being [1] 10 1 3 2 2 1 2 1 1 1

I get the following error, when I try to summarize the results:

print(summary(mod))
Error in X %*% as.vector(coefs) :
Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 88

If I remove line [7, ] 0 2 1 4 5, then I get the proper summary.

Do you have any suggestions on how to solve this? It seems some ties (like the one in line 9 and 10) work whereas some others do not.

Thanks for the nice package,
Kishor

update the beans data

Dear Heather,

I want to use the beans data as a demo for the paper that I am writing for the package climatrends. But the current version of PlackettLuce::beans is missing the planting_dates and geographic coordinates.

What do you think in updating this dataset? I can do it and make a pull request.

Plackett-Luce vs Bradley-Terry on rank data

This is not a bug/issue, so I hope you don't mind me posting this here. Feel free to close this thread if this is annoying.

Before I had learned about the Plackett-Luce model, my intuition - which I know is not unique - for analyzing ranking data was to arrange the data into pairwise comparisons and use Bradley-Terry. I recently came across a 2017 post on Stack Exchange where it is pointed out that this method of arranging the data via "rank-breaking" and using Bradley-Terry is perhaps not a statistically valid thing to do.

I do not have the mathematical chops to intuit this on my own and I couldn't find any resources online, so I was wondering if it could be explained why Bradley-Terry ought to be avoided for ranking data (aside from the fact that Plackett-Luce is specifically for ranking data.)

Thank you for all the work you've put into BradleyTerry2 and PlackettLuce packages, they are incredibly useful.

Goodness of fit metric for pltree

Not an issue, but I would appreciate your suggestion on this.

What would be an appropriate goodness of fit metric for a pltree model, especially to asses prediction power with out-of-sample data?
Is a pseudo R2 (e.g. McFadden, Maddala, etc.) a suitable option? Or is better to use an accuracy metric comparing predicted vs observed rankings?

Many thanks.

Release PlackettLuce 0.3.0

Prepare for release:

  • devtools::build_readme()
  • Check current CRAN check results
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • Polish NEWS
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Dealing with weak networks

Dear Heather,

Here comes an issue that may be related to issue #25. But now I think we have a better clue on where is the problem, which arrises mostly when we are performing cross-validations and pltree() is exposed to a set of data with a weak network.

Here is an example

library("PlackettLuce")
source("https://raw.githubusercontent.com/AgrDataSci/ClimMob-analysis/master/R/functions.R")

R <- matrix(c(1, 2, 0, 0, 3,
              4, 1, 0, 0, 2,
              2, 1, 0, 0, 3,
              1, 2, 0, 4, 3,
              2, 1, 0, 3, 4,
              4, 1, 0, 0, 2,
              2, 1, 0, 0, 3,
              1, 2, 0, 1, 3,
              2, 0, 0, 0, 1,
              0, 0, 0, 1, 2), nrow = 10, byrow = TRUE)

colnames(R) <- c("apple", "banana", "orange", "pear", "grape")

R <- as.rankings(R)

# take rows 9 and 10 supposing that it belongs to a different fold in a
# cross-validation
R <- R[-c(9:10), ]

G <- group(R, index = 1:length(R))
p <- data.frame(p = rep(1, length(G)))
dt <- cbind(G, p)

pl <- pltree(G ~ p, data = dt)

# it does not work as shown in issue #25 
predict(pl, newdata = dt)
AIC(pl, newdata = dt)

# but works with vcov = FALSE for predict()
predict(pl, newdata = dt, vcov = FALSE)

# and still dont work for AIC 
AIC(pl, newdata = dt, vcov = FALSE)

# this because orange got off of the network when we sampled the folds
a <- adjacency(R)

plot(network(a))

# the issue still persists even if we increase npseudo 
pl2 <- pltree(G ~ p, data = dt, npseudo = 0.8)


The question is, do you think that this problem can be solved with npseudo (eventually) or should we deal with it by passing vcov = FALSE to the predict() method?

Thanks in advance

Refactor code in PlackettLuce

Refactor to compute log-abilities directly. Try using standard competition rankings to allow some vectorization (this may avoid creating pattern matrix). Improvements may roll out later to loglik and fitted computation.

Release PlackettLuce 0.2-9

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Polish NEWS
  • Polish pkgdown reference index

Submit to CRAN:

  • usethis::use_version('patch')
  • Update cran-comments.md
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Tweet

logLikNull from a pltree

I recall that version 0.2-4 used to show the logLikNull from a pltree object in its print method. However I cannot see it now. Can you show me how to get it from a pltree in v0.2-6?

example("beans", package = "PlackettLuce")
G <- grouped_rankings(R, rep(seq_len(nrow(beans)), 4))
d <- cbind(G, beans)

pl <- pltree(G ~ maxTN, data = d)

print(pl)

Release PlackettLuce 0.4.2

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • git push
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • git push

Release PlackettLuce 0.4.3

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Polish NEWS
  • usethis::use_github_links()
  • urlchecker::url_check()
  • devtools::build_readme()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version(push = TRUE)

Example for reading Netflix `.soc` example is invalid.

I'm using R 4.2.0 on two x86_64 Linux distributions.

I can't load the example Netflix dataset suggested in the documentation. First of all, it seems the PrefLib static URL has changed. I believe the following lines will need to be changed:

--- a/README.md
+++ b/README.md
@@ -74,8 +74,8 @@ The data can be read in using the `read.soc` function in

 library(PlackettLuce)
-preflib <- "https://www.preflib.org/static/data/ED/"
-netflix <- read.soc(file.path(preflib, "netflix/ED-00004-00000138.soc"))
+preflib <- "https://www.preflib.org/static/data/"
+netflix <- read.soc(file.path(preflib, "netflix/00004-00000138.soc"))
 head(netflix, 2)

However, even after applying this change, the read.soc function throws an error:

Error in if (nrows < 0L) 5 else min(5L, (header + nrows)) :
  missing value where TRUE/FALSE needed
In addition: Warning message:
In read.items(file) : NAs introduced by coercion

vcov.PlackettLuce sometimes fails with an error

This was reported initially as an apparent error in qvcalc: DavidFirth/qvcalc#7. But actually the error occurs in vcov.PlackettLuce.

The original error report points to a (large!) reproducible example. In the "educators.R" script for that example, the call to qvcalc() fails where it calls vcov(p_mle).

The error occurs on this line of vcov.PlackettLuce, it seems:

Browse[3]>  fit <-  as.vector(exp(X %*% as.vector(coefs)))
Error in X %*% as.vector(coefs) : 
  Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90

The issue appears to be that coefs and X have incompatible dimensions:

Browse[3]> length(coefs)
[1] 23
Browse[3]> dim(X)
[1] 16320875       29

So the problem is perhaps with the function poisson_rankings?

@hturner if you have time to take a look, that would be great. Otherwise I can investigate this further --- but I don't want to risk messing up something else here!

Release PlackettLuce 0.2-8

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • check reverse dependencies
  • Polish NEWS
  • Update pkgdown

Submit to CRAN:

  • Update cran-comments.md
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Tweet

Release PlackettLuce 0.2-7

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Polish NEWS
  • Polish pkgdown reference index

Submit to CRAN:

  • usethis::use_version('patch')
  • Update cran-comments.md
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Tweet

How to interpret ties

Dear @hturner

I have been ignoring ties for a long time as we use to treat it as an error during the data collection. But now, ClimMob allows ties and some of our partners are interested in using it when necessary.

I have some questions on how to interpret it. In this example,

library("PlackettLuce")

R <- matrix(c(1, 2, 3, 0, 0,
              0, 1, 2, 3, 0,
              1, 0, 0, 2, 3,
              1, 1, 2, 1, 0,
              0, 0, 2, 2, 1,
              0, 2, 1, 2, 0), nrow = 6, byrow = TRUE)
colnames(R) <- c("apple", "banana", "orange", "pear","grape")
R <- as.rankings(R)

R

mod <- PlackettLuce(R)

summary(mod)

We have three types of ties apple = banana = pear and orange = pear and orange > banana = pear but these are registered as tie2 and tie3. Which brings some questions:

  1. Why there is only two types of ties?
  2. In a large set of data it will be difficult to know which tie is tie2 and tie3. Is there a way to add labels to these ties so we know which comparison it represents?

Thanks

could not find function "isFALSE"

Dear Heather, I am using the v0.2-5 but I am getting this error:

Error in isFALSE(gamma) : could not find function "isFALSE"

Here an example:

library("PlackettLuce")
example("beans", package = "PlackettLuce")
G <- grouped_rankings(R, rep(seq_len(nrow(beans)), 4))
d <- cbind(G, beans)
mod <- pltree(G ~ maxTN, data = d)

Release PlackettLuce 0.4.0

Prepare for release:

  • devtools::build_readme()
  • Check current CRAN check results
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • Polish NEWS
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

predict Deviance not working

I figured out that function deviance() return the same result when used with or without the argument newdata. Is that correct? Here an example from "beans" data.

`
library(PlackettLuce)

example("beans", package = "PlackettLuce")
G <- grouped_rankings(R, rep(seq_len(nrow(beans)), 4))
formula <- as.formula("G ~ maxTN")
d <- cbind(G, beans)

#split the data into a training and test sample
n <- nrow(d)
s <- sample(1:n, n*0.7)
train <- d[s, ]
test <- d[-s, ]

#fit the model with training data
mod <- pltree(formula, data = train)

#predict estimates on test data based on fitted model
predict(mod, newdata = test)

#the AIC of fitted model
AIC(mod)

#the AIC predicted using the test data
AIC(mod, newdata = test) #different than value above (prediction is working)

#the deviance of fitted model
deviance(mod)

#the deviance predicted using the test data
deviance(mod, newdata = test) #same as the value above (prediction not working)

#Does deviance, when using a test data for prediction, should be like this?:
AIC <- AIC(mod, newdata = test)
df <- attr(logLik(mod), "df")
AIC - (2*df)
`

Guidance on convergence problems

Some guidance could be given on algorithm settings, at least in ?PlackettLuce. Also vignette should at least mention Steffensen acceleration.

Predict model with weights

library(PlackettLuce)

example("beans", package = "PlackettLuce")
G <- grouped_rankings(R, rep(seq_len(nrow(beans)), 4))

weights <- c(rep(0.3, 400), rep(1, 442))

tree <- pltree(G ~ maxTN,
data = beans, alpha = 0.05, weights = weights )

plot(tree) #does not work

predict(tree, newdata = beans)

#predict fails when model is made with weights and type is "itempar". Works with the other options.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.