Giter VIP home page Giter VIP logo

cohorts's Introduction

cohorts

R-CMD-check CRAN status Downloads

Creating cohort tables from event data is complicated and requires several lines of code. The cohorts package lets users convert data frames to cohort tables in both long and wide formats with simple functions. Users may choose between day and month level cohorts.

Installation

You can install the released version of cohorts from CRAN with:

install.packages("cohorts")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("PeerChristensen/cohorts")

Creating a month level cohort table

In this example, we use a dataset consisting of customer IDs and invoice dates.

library(cohorts)

head(online_cohorts)
#>   CustomerID InvoiceDate
#> 1      17850  2010-12-01
#> 2      13047  2010-12-01
#> 3      12583  2010-12-01
#> 4      13748  2010-12-01
#> 5      15100  2010-12-01
#> 6      15291  2010-12-01

We can then turn this into a cohort table where each customer ID is tracked from the first invoice month until the last month in the period.

online_cohorts %>%
  cohort_table_month(CustomerID, InvoiceDate)
#> # A tibble: 13 × 14
#>    cohort `Dec 2010` `Jan 2011` `Feb 2011` `Mar 2011` `Apr 2011` `May 2011`
#>     <int>      <int>      <int>      <int>      <int>      <int>      <int>
#>  1      1        949        363        318        368        342        377
#>  2      2         NA        421        101        119        102        138
#>  3      3         NA         NA        380         94         73        106
#>  4      4         NA         NA         NA        440         84        112
#>  5      5         NA         NA         NA         NA        299         68
#>  6      6         NA         NA         NA         NA         NA        279
#>  7      7         NA         NA         NA         NA         NA         NA
#>  8      8         NA         NA         NA         NA         NA         NA
#>  9      9         NA         NA         NA         NA         NA         NA
#> 10     10         NA         NA         NA         NA         NA         NA
#> 11     11         NA         NA         NA         NA         NA         NA
#> 12     12         NA         NA         NA         NA         NA         NA
#> 13     13         NA         NA         NA         NA         NA         NA
#> # … with 7 more variables: `Jun 2011` <int>, `Jul 2011` <int>,
#> #   `Aug 2011` <int>, `Sep 2011` <int>, `Oct 2011` <int>, `Nov 2011` <int>,
#> #   `Dec 2011` <int>

Creating a day level cohort table

If we need to track activity on a daily basis, we can instead use the cohort_table_day() function.

gamelaunch %>%
  cohort_table_day(userid, eventDate)
#> # A tibble: 31 × 32
#>    cohort `2016-04-27` `2016-04-28` `2016-04-29` `2016-04-30` `2016-05-01`
#>     <int>        <int>        <int>        <int>        <int>        <int>
#>  1      1           96           65           55           46           46
#>  2      2           NA          200          117           96           84
#>  3      3           NA           NA          370          207          181
#>  4      4           NA           NA           NA          387          223
#>  5      5           NA           NA           NA           NA          405
#>  6      6           NA           NA           NA           NA           NA
#>  7      7           NA           NA           NA           NA           NA
#>  8      8           NA           NA           NA           NA           NA
#>  9      9           NA           NA           NA           NA           NA
#> 10     10           NA           NA           NA           NA           NA
#> # … with 21 more rows, and 26 more variables: `2016-05-02` <int>,
#> #   `2016-05-03` <int>, `2016-05-04` <int>, `2016-05-05` <int>,
#> #   `2016-05-06` <int>, `2016-05-07` <int>, `2016-05-08` <int>,
#> #   `2016-05-09` <int>, `2016-05-10` <int>, `2016-05-11` <int>,
#> #   `2016-05-12` <int>, `2016-05-13` <int>, `2016-05-14` <int>,
#> #   `2016-05-15` <int>, `2016-05-16` <int>, `2016-05-17` <int>,
#> #   `2016-05-18` <int>, `2016-05-19` <int>, `2016-05-20` <int>, …

Converting to percentages

In order to see the percent of remaining customers in subsequent periods, we can pipe the above code into the cohort_table_pct() function.

gamelaunch %>%
  cohort_table_day(userid, eventDate) %>%
  cohort_table_pct(decimals = 1)
#> # A tibble: 31 × 32
#>    cohort `2016-04-27` `2016-04-28` `2016-04-29` `2016-04-30` `2016-05-01`
#>     <int>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
#>  1      1          100         67.7         57.3         47.9         47.9
#>  2      2           NA        100           58.5         48           42  
#>  3      3           NA         NA          100           55.9         48.9
#>  4      4           NA         NA           NA          100           57.6
#>  5      5           NA         NA           NA           NA          100  
#>  6      6           NA         NA           NA           NA           NA  
#>  7      7           NA         NA           NA           NA           NA  
#>  8      8           NA         NA           NA           NA           NA  
#>  9      9           NA         NA           NA           NA           NA  
#> 10     10           NA         NA           NA           NA           NA  
#> # … with 21 more rows, and 26 more variables: `2016-05-02` <dbl>,
#> #   `2016-05-03` <dbl>, `2016-05-04` <dbl>, `2016-05-05` <dbl>,
#> #   `2016-05-06` <dbl>, `2016-05-07` <dbl>, `2016-05-08` <dbl>,
#> #   `2016-05-09` <dbl>, `2016-05-10` <dbl>, `2016-05-11` <dbl>,
#> #   `2016-05-12` <dbl>, `2016-05-13` <dbl>, `2016-05-14` <dbl>,
#> #   `2016-05-15` <dbl>, `2016-05-16` <dbl>, `2016-05-17` <dbl>,
#> #   `2016-05-18` <dbl>, `2016-05-19` <dbl>, `2016-05-20` <dbl>, …

Left-shifted cohort tables

Another option is to shift cohort tables left. Here, we align cohorts such that date columns are replaced by time periods, i.e. t0, t1, t2 etc.

To left-shift a cohort table, we can use the shift_left() function.

gamelaunch %>%
  cohort_table_day(userid, eventDate) %>%
  shift_left()
#> # A tibble: 31 × 32
#>    cohort    t0    t1    t2    t3    t4    t5    t6    t7    t8    t9   t10
#>     <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1      1    96    65    55    46    46    45    44    33    34    31    26
#>  2      2   200   117    96    84    82    76    62    72    63    52    59
#>  3      3   370   207   181   152   138   127   114    98    95    89    84
#>  4      4   387   223   177   151   129   122   107   115   114    86    88
#>  5      5   405   222   178   152   130   131   128   103    86    98    84
#>  6      6   325   183   146   125   119   105    85    73    72    59    59
#>  7      7   270   165   129   113   113   102    85    89    75    74    72
#>  8      8   264   142   124    91    73    76    81    63    60    55    55
#>  9      9   267   153   114   110    99    94    89    72    68    62    65
#> 10     10   127    74    58    51    42    42    50    41    42    40    32
#> # … with 21 more rows, and 20 more variables: t11 <dbl>, t12 <dbl>, t13 <dbl>,
#> #   t14 <dbl>, t15 <dbl>, t16 <dbl>, t17 <dbl>, t18 <dbl>, t19 <dbl>,
#> #   t20 <dbl>, t21 <dbl>, t22 <dbl>, t23 <dbl>, t24 <dbl>, t25 <dbl>,
#> #   t26 <dbl>, t27 <dbl>, t28 <dbl>, t29 <dbl>, t30 <dbl>

We can also get the raw numbers as percentages.

gamelaunch %>%
  cohort_table_day(userid, eventDate) %>%
  shift_left_pct()
#> # A tibble: 31 × 32
#>    cohort    t0    t1    t2    t3    t4    t5    t6    t7    t8    t9   t10
#>     <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1      1   100  67.7  57.3  47.9  47.9  46.9  45.8  34.4  35.4  32.3  27.1
#>  2      2   100  58.5  48    42    41    38    31    36    31.5  26    29.5
#>  3      3   100  55.9  48.9  41.1  37.3  34.3  30.8  26.5  25.7  24.1  22.7
#>  4      4   100  57.6  45.7  39    33.3  31.5  27.6  29.7  29.5  22.2  22.7
#>  5      5   100  54.8  44    37.5  32.1  32.3  31.6  25.4  21.2  24.2  20.7
#>  6      6   100  56.3  44.9  38.5  36.6  32.3  26.2  22.5  22.2  18.2  18.2
#>  7      7   100  61.1  47.8  41.9  41.9  37.8  31.5  33    27.8  27.4  26.7
#>  8      8   100  53.8  47    34.5  27.7  28.8  30.7  23.9  22.7  20.8  20.8
#>  9      9   100  57.3  42.7  41.2  37.1  35.2  33.3  27    25.5  23.2  24.3
#> 10     10   100  58.3  45.7  40.2  33.1  33.1  39.4  32.3  33.1  31.5  25.2
#> # … with 21 more rows, and 20 more variables: t11 <dbl>, t12 <dbl>, t13 <dbl>,
#> #   t14 <dbl>, t15 <dbl>, t16 <dbl>, t17 <dbl>, t18 <dbl>, t19 <dbl>,
#> #   t20 <dbl>, t21 <dbl>, t22 <dbl>, t23 <dbl>, t24 <dbl>, t25 <dbl>,
#> #   t26 <dbl>, t27 <dbl>, t28 <dbl>, t29 <dbl>, t30 <dbl>

Line plots

To visualize the data, we can turn a cohort table into long format and create a line plot.

In this example, we select only the first seven cohorts.

library(tidyverse)

gamelaunch_long <- gamelaunch %>%
  cohort_table_day(userid, eventDate) %>%
  shift_left_pct() %>%
  pivot_longer(-cohort) %>%
  mutate(time = as.numeric(str_remove(name,"t"))) 

gamelaunch_long %>%
  filter(value > 0, cohort <= 7, time > 0) %>%
  ggplot(aes(time, value, colour = factor(cohort), group = cohort)) +
  geom_line(size = 1.5) +
  geom_point(size = 1.5) +
  theme_light()

Cohort tables plotted

Another way to plot a cohort table is by means of tiles. In this case we provide the percentages and colour the tiles accordingly.

gamelaunch_long %>%
  filter(time > 0, value > 0) %>%
  ggplot(aes(time, reorder(cohort, desc(cohort)))) +
  geom_raster(aes(fill = log(value))) +
  coord_equal(ratio = 1) +
  geom_text(aes(label = glue::glue("{round(value,0)}%")), size = 2, color = "snow") +
  scale_fill_gradient(guide = F) +
  theme_minimal() +
  theme(panel.grid   = element_blank(),
        panel.border = element_blank()) +
  labs(y= "cohort")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.