Giter VIP home page Giter VIP logo

recommender-system's Introduction

Multi-Objective Recommender System

Introduction

This repo. contains code I created during the Kaggle competition OTTO - Multi-Objective Recommender System during January 2023 (Kaggle Profile for Myles Dunlap). The approach in this repo. was able to finish in the top 2% of the 2,587 competing teams.

The objective of the competition was to predict e-commerce clicks, carts, and orders for a one-week period. The primary approach I took for this competition was a Candidate Ranker model which are commonly used for large dataset recommender systems. Candidate reranker models are explain in References section [1, 2].

The basic concept for a candidate reranker model is shown in Figure 1.

Figure 1: A Large Dataset Candidate Ranker Process [1]

In candidate ranker models the item corpus that can contain millions or billions of items and users need to be reduced so that items can be properly recommended to users. The concept is to first create a set of candidates (see candidates generation) which can be done in a multitude of ways; For this competition hand-crafted rules based on human logic, Word2Vec, and Bayseian Personalized Rankking (BPR) were used to provide a variable amounts of candidates and items pairs. The code for generating the candidates was performed in co_vis_matrices.py and candidates.py.

Co-visitation matrices describe products that are frequenctly viewed and bought together. Three co-visitation matrices were used to provide similar items that were 1) clicked, 2) clicked/carted/ordered, and 3) bought [3]. The RAPIDS cuDF GPU data science library was used to quickly process these matrices as opposed to multi-core dataframe libraries.

Data

In this competition the a large dataset was provided for training and testing. The time span of data was approximately 6 weeks and provided by the largest German online shop, OTTO. The information in this data The number of events for this dataset, listed below with units of millions, were:

  • 12.9M sessions or user
  • 1.8M items or aids
  • 194.7M clicks
  • 16.8M carts
  • 5.1M orders

The objective was to recommend the next 20 items a user would: click, cart, and order.

Libaries

  • RAPIDS AI - GPU acceleration of data science operations (e.g., dataframes)
  • POLARS - Fast dataframe libary written in Rust. This library allows for lazy execuation and is optimized for multi-threaded operations.
  • XGBoost - an XGB Ranker model, using a pairwise objective function, was the model used to refine item recommendations from the first stage of candidates. The XGB Ranker model can be found in file xgb_ranker.py. Additionally, the XGB model was trained on the GPU.
# Model Setup
xgb_parms = {'objective':'rank:pairwise',
             'tree_method':'gpu_hist',
             'learning_rate': hps.learning_rate,
             'max_depth': hps.max_depth,
             'colsample_bytree': hps.colsample_bytree,
            }
model = xgb.train(xgb_parms, 
                  dtrain=dtrain,
                  evals=[(dtrain,'train'),
                         (dvalid,'valid')],
                  num_boost_round=hps.num_boost_round,
                  verbose_eval=hps.verbose_eval,
                  early_stopping_rounds=hps.early_stopping_rounds,
                  )

In this competition the magnitude of the data would often create Out-of-Memory errors (i.e., not enough RAM) and to handle this issue the data was loaded from Disk in chunks and the XGB model trained in intervals. This was done us in the below code which can be found in xgb_ranker.py.

class IterLoadForDMatrix(xgb.core.DataIter):
    def __init__(self, df=None, features=None, target=None, batch_size=256*1024):
        self.features = features
        self.target = target
        self.df = df
        self.it = 0 # set iterator to 0
        self.batch_size = batch_size
        self.batches = int( np.ceil( len(df) / self.batch_size ) )
        super().__init__()

    def reset(self):
        '''Reset the iterator'''
        self.it = 0

    def next(self, input_data):
        '''Yield next batch of data.'''
        if self.it == self.batches:
            return 0 # Return 0 when there's no more batch.
        
        a = self.it * self.batch_size
        b = min( (self.it + 1) * self.batch_size, len(self.df) )
        dt = cudf.DataFrame(self.df.iloc[a:b])
        input_data(data=dt[self.features], label=dt[self.target]) #, weight=dt['weight'])
        self.it += 1
        return 1


def stratify_folds(df):
    # Stratify Group K-Fold
    skf = GroupKFold(n_splits=5)

    strat = {}
    for fold,(train_idx, valid_idx) in enumerate(skf.split(df, df['target'], groups=df['session'])):
        strat[fold] = {'fold': fold,
                       'train_idx': train_idx,
                       'valid_idx': valid_idx,
                       'num_rows': len(df)}
    return strat

References

  1. Recommendation Systems for Large Datasets
  2. How to Build a GBT Ranker Model
  3. Candidate ReRank Model using Handcrafted Rules

recommender-system's People

Contributors

mddunlap924 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.