Building Recommender Systems with Machine Learing and AI
Section 1: Getting Started
+ Youtube's candidate generation:
```
top-N <--- Knn index <---video vectors---- softmax ---> class probabilities
^ ^
| |
user vector |
| |
|__________ ReLU _______________|
^
|
__________ ReLU ____________
^
|
__________ ReLU ____________
^
|
watch vector search vector geographic age gender ...
^ ^
| |
average average
|||||..|| |||||..||
video watches search tokens
```
+ (one) anatomy of a top-N recommender
```
individual interests --> candidate generation <--> item similarities
|
candidate ranking
|
filtering
|
output
```
+ autoencoders for recommendations ("autorec")
```
R1i R2i R3i ... Rmi (+1)
M1i M2i...
R1i R2i R3i ... Rmi (+1)
```
+ Frameworks: Tensorflow, DSSTNE, Sagemaker, ApacheSpark.
+ Install Anaconda:
+ Notes: To overcome the PermissionError issue, after creating the RecSys environment, open its terminal, then run the command `nohup sudo spyder &` to get through this error. You may do the same if you use JupyterLab or others, just replace spyder by that application.
+ Getting Started
+ What is a recommender system?
+ RecSys is NOT a system that recommends arbitrary values, that describes machine learning in general.
+ For example,
+ A system that recommends prices for a house you're selling is NOT a recommender system.
+ A system that recommends whether a transaction is fraudulent is NOT a recommender system.
+ These are general ML problems, where you'd apply techniques such as regression, deep learning, xgboost, etc.
+ A recommender system that predicts ratings or preferences a user might give to an item. RecSys recommends on things based on the people's past behaviors. Often these are sorted and presented as top-N recommendations, aka, recommender engines or platforms.
+ Customers don't want to see your ability to predict their rating for an item, they just want to see things they're likely to love.
+ implicit ratings: purchase data, video viewing data, click data. (by product of user's natural behavior)
+ explicit ratings: star reviews. (ask user reviewing on something)
+ `GettingStarted.py`
+ Use `SVD` singular value decomposition. [explained](https://www.youtube.com/watch?v=P5mlg91as1c)
+ A[m x n] = U[m x r] * sigma{r x r}(V[n x r])^T
+ A[m x n]: (rows x cols)
+ documents x terms(different words)
* a given document (in a row) contains a list of terms/words (in columns)
+ users x movies:
* a given user (in a row) watches a list of movies (in columns)
+ U: Left singular vectors (i.e. user-to-concept similarity matrix)
+ m x r matrix (m documents, r concepts)
+ sigma: singular values (i.e. its diagonal elements: strength of each concept)
+ r x r diagonal matrix (strength of each 'concept' in r)
+ r: rank of the matrix A
+ V: right singular vectors (i.e. movie-to-concept similarity matrix)
+ n x r matrix (n terms, r concepts)
A = U*sigma*V^T = sum{i}sigma_i*u_i*v_i
+ [SVD case study](https://www.youtube.com/watch?v=K38wVcdNuFc)
+ Compare the SVD algorithm to KNNBasic:
+ test MAE score: SVD(0.2731) KNN(0.6871)
+ test RSME score: SVD(0.3340) KNN(0.9138)
+ SVD (user 81)
We recommend:
Gladiator (1992)
Lord of the Rings: The Fellowship of the Ring, The (2001)
To Kill a Mockingbird (1962)
Ghost in the Shell (Kôkaku kidôtai) (1995)
Godfather: Part II, The (1974)
Seven Samurai (Shichinin no samurai) (1954)
African Queen, The (1951)
Memento (2000)
Band of Brothers (2001)
General, The (1926)
+ KNN (user 81)
We recommend:
One Magic Christmas (1985)
Art of War, The (2000)
Taste of Cherry (Ta'm e guilass) (1997)
King Is Alive, The (2000)
Innocence (2000)
Maelström (2000)
Faust (1926)
Seconds (1966)
Amazing Grace (2006)
Unvanquished, The (Aparajito) (1957)
+ Take-away note: Apparently, SVD performs better than KNN in this scenario.
Section 2: Intro to Python
Section 3: Evaluating Recommender Systems
+ Train/test/crossvalidation
+ full data -> train and test
+ trainset -> machine learning -> fit
+ use trained model for testing on test data.
+ K-fold cross validation
+ Bagging: full data -> fold i-th -> ML -> measure accuracy -> take average
+ Accurate metrics RMSE/MAE
+ MAE: lower is better
sum{1..n}|yi - xi|/n
+ RMSE: lower is better
+ it penalizes you more when your prediction is way off, and penalizes you less when you are reasonably close. It inflates the penalty for larger errors.
+ Top-N hit rate - many ways
+ evaluating top-n recommenders:
+ Hit rate: you generate top-n recommendations for all of the users in your test set, if one of top-n recommendations is rated, it's a hit. Thus, hit rate = count of all hits per total users
+ leave-one-out cross validation:
+ compute the top-n recommendations for each user in our training data,
+ then intentionally remove one of those items from that user's training data.
+ then test our recommender system's ability to recommend that item that was left out in the top-n results it creates for that user in the testing phase.
+ Notes: hit rate with leave-one-out are working better with a very large dataset.
+ Average reciprocal hit rate (ARHR):
+ sum{1..n}(1/rank(i))/users
+ it measures our ability to recommend items that actually appeared in a user's top-n highest rated movies, it gives more weight to these hits when they appear near the top of the top-n list.
+ cumulative hit rate (cHR):
+ throw away hits if our predicted rating is below some threshold. The idea is that we shouldn't get credit for recommending items to a user that we think they won't actually enjoy.
+ rating hit rate (rHR)
+ break it down by predicted rating score. The idea is that we recommend movies that they actually liked and breaking down the distribution gives you some sense of how well you're doing in more detail.
+ Take-away: RMSE and hit rate are not always related.
+ Coverage, Diversity, and Novelty Metrics
+ Coverage:
+ the percentage of possible recommendations that your system is able to provide.
`% of <user,item> pair that can be predicted.`
+ Can be important to watch bcos it gives you a sense of how quickly new items in your catalog will start to appear in recommendations. i.e., when a new book comes out on Amazon, it won't appear in recommendations until at least few people buy it. Therefore, establishing patterns with the purchase of other items. Until those patterns exist, that new book will reduce Amazon's coverage metric.
+ Diversity:
+ How broad a variety of items your recommender system is putting in front of people. Low diversity indicates it recommends next books in the series that you've started reading, but doesn't recommend books from different authors, or movies related to what you've read/watched.
+ We can use similarity scores to measure diversity.
+ If we look at the similarity scores of every possible pair in a list of top-n recommendations, we can average them to get a measure of how similar the recommended items in the list are to each other, called S.
+ Diversity = 1 - S
+ S: avg similarity between recommendation pairs.
+ Novelty:
+ mean popularity rank of recommended items.
+ Churn, Responsiveness, and A/B Tests
+ How often do recommendations change?
+ perceived quality: rate your recommendations.
+ The results of online A/B tests are the metric matters more than anything.
+ Review ways to measure your recommender
+ Recommender Metrics
+ Surprise package is about making rating predictions, and we need a method to get top-n recommendations out of it.
+ Test Metrics
+ run the test.
```
Loading movie ratings...
Computing movie popularity ranks so we can measure novelty later...
Computing item similarities so we can measure diversity later...
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Building recommendation model...
Computing recommendations...
Evaluating accuracy of model...
RMSE: 0.9033701087151801
MAE: 0.6977882196132263
Evaluating top-10 recommendations...
Computing recommendations with leave-one-out...
Predict ratings for left-out set...
Predict all missing ratings...
Compute top 10 recs per user...
Hit Rate: 0.029806259314456036
rHR (Hit Rate by Rating value):
3.5 0.017241379310344827
4.0 0.0425531914893617
4.5 0.020833333333333332
5.0 0.06802721088435375
cHR (Cumulative Hit Rate, rating >= 4): 0.04960835509138381
ARHR (Average Reciprocal Hit Rank): 0.0111560570576964
Computing complete recommendations, no hold outs...
User coverage: 0.9552906110283159 [this is good]
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Diversity: 0.9665208258150911 [this is not good, too high]
Novelty (average popularity rank): 491.5767777960256
[this is not good, too high][as long tail in distribution]
```
+ Measure the performance of SVD recommender.
Section 4: A Recommender Engine Framework
+ Build a recommender engine:
+ use `surpriselib` algorithm (base class)
+ AlgoBase: SVD, KNNBasic, SVDpp, Custom
+ Creating a custom algorithm:
+ implement an estimate function.
```
class myOwnAlgorithm(AlgoBase):
def __init__(self):
AlgoBase.__init__(self)
def estimate(self, user, item):
return 3
```
+ Building on top of surpriselib:
+ create a new class, EvaluatedAlgorithm(AlgoBase)
+ algorithm: AlgoBase
+ Evaluate(EvaluationData)
+ RecommenderMetrics
+ EvaluationData(Dataset):
+ GetTrainSet()
+ GetTestSet()
+ algorithm bake-offs
+ Evaluator(DataSet):
+ AddAlgorithm(algorithm)
+ Evaluate()
+ dataset: EvaluatedDataSet
+ algorithms: EvaluatedAlgorithm[]
+ Implementation:
```
# load up common dataset for the recommender algos.
(evaluationData, rankings) = LoadMovieLensData()
# construct an evaluator to evaluate them
evaluator = Evaluator(evaluationData, rankings)
# Throw in an SVD recommender.
SVDAlgo = SVD(random_state=10)
evaluator.AddAlgorithm(SVDAlgorithm, "SVD")
# Just make random recommendations
Random = NormalPredictor()
evaluator.AddAlgorithm(Random, "Random")
# Evaluate
evaluator.Evaluate(True)
```
+ Code:
```
Loading movie ratings...
Computing movie popularity ranks so we can measure novelty later...
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating SVD ...
Evaluating accuracy...
Evaluating top-N with leave-one-out...
Computing hit-rate and rank metrics...
Computing recommendations with full data set...
Analyzing coverage, diversity, and novelty...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Analysis complete.
Evaluating Random ...
Evaluating accuracy...
Evaluating top-N with leave-one-out...
Computing hit-rate and rank metrics...
Computing recommendations with full data set...
Analyzing coverage, diversity, and novelty...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Analysis complete.
Algorithm RMSE MAE HR cHR ARHR Coverage Diversity Novelty
SVD 0.9034 0.6978 0.0298 0.0298 0.0112 0.9553 0.0445 491.5768
Random 1.4385 1.1478 0.0089 0.0089 0.0015 1.0000 0.0719 557.8365
Legend:
RMSE: Root Mean Squared Error. Lower values mean better accuracy.
MAE: Mean Absolute Error. Lower values mean better accuracy.
HR: Hit Rate; how often we are able to recommend a left-out rating. Higher is better.
cHR: Cumulative Hit Rate; hit rate, confined to ratings above a certain threshold. Higher is better.
ARHR: Average Reciprocal Hit Rank - Hit rate that takes the ranking into account. Higher is better.
Coverage: Ratio of users for whom recommendations above a certain threshold exist. Higher is better.
Diversity: 1-S, where S is the average similarity score between every possible pair of recommendations
for a given user. Higher means more diverse.
Novelty: Average popularity rank of recommended items. Higher means more novel.
```
Section 5: Content-Based Filtering
+ Cosin similarity:
+ Assume that movie1 vs movie2 form an angle of 45 degree, so cosine helps to identify if cosine(angle)=0 (cosine(90)) means no similar at all, and cosine(angle)=1 (cosine(0)) means totally the same thing.
+ multi-dimensional space:
+ convert genres to dimensions (i.e. multiple one-hot encoding).
+ compute multi-dimensional cosines:
```
CosSim(x, y) = sum{1..n} xi*yi / (sqrt(sum{1..n}xi^2) * sqrt(sum{1..n}yi^2))
```
+ code:
```
def computeCosineSimilarity(self, movie1, movie2, genres):
genres1 = genres[movie1]
genres2 = genres[movie2]
sumxx, sumxy, sumyy = 0, 0, 0
# go through all genres.
for i in range(len(genres1)):
x, y = genres1[i], genres2[i]
sumxx += x*x
sumyy += y*y
sumxy += x*y
return sumxy / math.sqrt(sumxx * sumyy)
```
+ compute time similarity:
```
def computeYearSimilarity(self, movie1, movie2, years):
diff = abs(years[movie1] - years[movie2])
sim = math.exp(-diff/10.0)
return sim
```
+ K-nearest-neighbors:
+ similarity scores between this movie and all others the user rated
=> sort top 40 nearest movies
=> weighted average (weighting them by the rating the user gave them)
=> rating prediction.
+ knn code:
```
# Build up similarity scores between this item and everything the user rated.
neighbors = []
for rating in self.trainset.ur[u]:
genreSimilarity = self.similarities[i, rating[0]]
neighbors.append((genreSimilarity, rating[1]))
# Extract the top-k most-similar ratings
k_neighbors = heapq.nlargest(self.k, neighbors, key=lambda t: t[0])
# Compute average sim score of K neighbors weighted by user ratings.
simTotal = weightedSum = 0
for (simScore, rating) in k_neighbors:
if (simScore > 0):
simTotal += simScore
weightedSum += simScore * rating
if (simTotal == 0):
raise PredictionImpossible('No neighbors')
predictedRating = weightedSum/simTotal
return predictedRating
```
+ Producing and evaluating content-based filtering movies recommendation.
+ `getPopularityRanks`
Algorithm RMSE MAE
contentKNN 0.9375 0.7263
Random 1.4385 1.1478
Legend:
RMSE: Root Mean Squared Error. Lower values mean better accuracy.
MAE: Mean Absolute Error. Lower values mean better accuracy.
Using recommender contentKNN
We recommend:
Presidio, The (1988) 3.841314676872932
Femme Nikita, La (Nikita) (1990) 3.839613347087336
Wyatt Earp (1994) 3.8125061475551796
Shooter, The (1997) 3.8125061475551796
Bad Girls (1994) 3.8125061475551796
The Hateful Eight (2015) 3.812506147555179
True Grit (2010) 3.812506147555179
Open Range (2003) 3.812506147555179
Big Easy, The (1987) 3.7835412549266985
Point Break (1991) 3.764158410102279
Using recommender Random
We recommend:
Sleepers (1996) 5
Beavis and Butt-Head Do America (1996) 5
Fear and Loathing in Las Vegas (1998) 5
Happiness (1998) 5
Summer of Sam (1999) 5
Bowling for Columbine (2002) 5
Babe (1995) 5
Birdcage, The (1996) 5
Carlito's Way (1993) 5
Wizard of Oz, The (1939) 5
+ `mixed years and genres`
Algorithm RMSE MAE
contentKNN 0.9441 0.7310
Random 1.4385 1.1478
--
Using recommender contentKNN
True Grit (2010) 3.81250614755518
Open Range (2003) 3.81250614755518
The Hateful Eight (2015) 3.8125061475551796
Wyatt Earp (1994) 3.8125061475551796
Shooter, The (1997) 3.8125061475551796
Bad Girls (1994) 3.8125061475551796
Romeo Must Die (2000) 3.771364493375906
Femme Nikita, La (Nikita) (1990) 3.7678571120506548
RoboCop (1987) 3.7594365328860415
Die Hard (1988) 3.75840413236323
+ matrix factorization: the general idea is to describe users and movies as combinations of different amounts of each feature. For example, Bob is active as being 80% an action fan and 20% a comedy fan. We'd then know to match him up with movies.
+ PCA (see explain from Statquest)
+ Dimensionality reduction in order to accurately describe a movie.
+ Eigenvectors are principle components.
+ PCA on movie ratings
+ Singular Value Decomposition
+ Running SVD and SVD++ and improve it.
+ Downside is with categorical data, have to prepare data to work with it.
+ Probabilistic Latent Semantic Analysis (PLSA) is promising. You can use it to extract latent features from content itself.
+ Tuning svd:
```
print("Searching for best parameters...")
param_grid = {'n_epochs': [20,30], 'lr_all':[0.005,0.010], 'n_factors':[50,100]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse','mae'], cv=3)
gs.fit(evaluationData)
# best RMSE score
print("Best RMSE score attained: ", gs.best_score['rmse'])
params = gs.best_params['rmse']
SVDtuned = SVD(n_epochs = params['n_epochs'], lr_all=params['lr_all'], n_factors=params['n_factors'])
```
+ Sparse Linear Methods (SLIM)
+ Paper: SLIM sparse linear methods for top-N recommender systems.
Section 8: Intro to Deep Learning
+ Autodiff: is what tensorflow uses under the hood to implement its gradient descent.
+ As the computer calculates the partial derivatives of the loss function, it tweaked in scalar form.
C(y,w,x,b) = y - max(0, w*x+b)
+ Because autodiff can only calculate the partial derivative of an expression on a certain point, we have to assign initial values to each variables.
+ Then calculate the partial derivatives of each connection between operations/edges.
+ from w5 = w1 * w2 => d_w5/ d_w1 = w2
+ from w6 = w5 + w3 => d_w6/d_w5 = 1
+ from w7 = max(0,w6) => d_max(0,w6)/d_w6= 1 if x>0 else 0
+ from w8 = w4 - w7 => d_w8/d_w7 = (d_w4 - d_w7)/d_w7 = -1 and d_w8/d_w4 = 1.
+ from z = w8 => d_z/d_w8 = 1
+ The max(0, x) piece-wise function converts all negative values to zero, and keeps all positive values as they are. The graph is like ReLU slope.
+ Now we calculate the partials with respect of the weights. We simply multiply up the edges.
d_z/d_w1 = (d_z/d_w8) * (d_w8/d_w7) * (d_w7/d_w6) * (d_w6/d_w5) * (d_w5/d_w1) = 1*(-1)*(1 if x>0 else 0)* 1 * w2 = (1 if x>0 else 0)
+ Gradient descent requires knowledge of the gradient from your cost function (MSE)
+ Mathematically, we need the first partial derivatives of all the inputs.
+ This is inefficient if you just throw calculus at the problem.
+ Reverse-mode autodiff to the rescue.
+ Optimized for many inputs + few outputs (like a neuron)
+ Computes all partial derivatives in outputs + 1 graph traversals.
+ Still fundamentally a calculus trick.
+ tensorflow used.
+ backpropagation:
+ For each training set, compute the output error for the weights that we have currently in place for each connection between each artificial neuron. Then we take the error that is computed at the end of the neural network and back propagate it down to the other direction through the neural network backwards. In that way, we can distribute error back through each connection all the way back to the inputs using the weights that we are currently using at this training step. We take the error information to tweak the weights through gradient descent to get a better value on the next pass, which is called an epoch of our training passes.
+ Sum up:
+ Run a set of weights.
+ Measure the errors.
+ Back propagate the error using those weights.
+ Tune things using gradient descent, and try again until the system converges.
+ Activation functions (rectifier).
+ Sigmoid, Logistic, tangent, ELU, ReLU, leaky ReLU, Noisy ReLU, etc.
+ Sigmoid, Logistic, Tanh:
+ Scales everything from 0 to 1 for sigmoid and logistic, and scales -1 to 1 for tanh/hyperbolic tangent.
+ Output changes slowly for extreme high/low input values, the change of output value becomes very small => The vanishing gradient problem.
+ computationally expensive.
+ Tanh is preferred over sigmoid.
+ ReLU (Rectified Linear Unit)
+ fast compute.
+ But when inputs are zeros or negative => The "Dying ReLU" problem.
+ Leaky ReLU:
+ Solves "dying ReLU" by introducing a slightly negative slope below 0.
+ Parametric ReLU
+ ReLU but the slope in the negative part is learned via backpropagation. [complicated!!]
+ Exponential Linear Unit (ELU):
+ Swish: performs well with very deep networks (40+ layers)
+ Maxout:
+ outputs the max of the inputs.
+ ReLU is a special case of maxout.
+ Not practical bcos doubles parameters that need to be trained.
+ Notes: Step functions don't work with gradient descent as no derivative.
+ Optimization functions:
+ Momentum optimization: introduces a momentum to the descent, so it slows down as things start to flatten and speeds up as the slope is steep.
+ Nesterov Accelerated gradient: a small tweak on momentum optimization: it computes momentum based on the gradient slightly ahead of the current state.
+ RMSProp: adaptive learning rate to help point toward the minimum.
+ Adam: adaptive moment estimation: momentum + RMSProp.
+ Avoiding Overfitting:
+ Early stopping (when performance starts dropping)
+ Regularization terms added to cost function during training.
+ Dropout - ignore percentage (i.e. 50%) of all neurons randomly at each training step.
+ works surprisingly well.
+ forces your model to spread out its learning.
+ Tuning your topology:
+ Trial & Error
+ Evaluate a smaller network with less neurons in the hidden layers.
+ Evaluate a larger net with more layers:
+ Try reducing the size of each layer as you progress.
+ More layers can yield faster learning.
+ Use more layers and neurons than you need, don't care because you use early stopping.
+ Use "model zoos".
+ Softmax:
+ used for classification
+ Given a score for each class
+ It produces a probability of each class
+ The class with the highest probability is the answer you get.
+ h_theta(x) = 1 / (1 + exp(-theta^transpose * x))
+ x is a vector of input values.
+ theta is a vector of weights.
+ used on the final output layer of a multiple classification problem.
+ basically converts outputs from logits to probabilities of each classification.
+ can't produce more than one label for something (but sigmoid can.)
+ don't worry about the actual function for the exam.
+ Choosing an activation function.
+ for multiple classification, use softmax on the output layer.
+ RNN's do well with Tanh.
+ For everything else:
+ Start with ReLU.
+ If need to do better, Leaky ReLU.
+ Last resort: PReLU, Maxout.
+ Swish for really deep nets.
+ In review:
+ Gradient descent is an algorithm for minimizing error over multiple steps. In other word, it aims to minimize the loss through tweaking the weights and biases.
+ Autodiff is a calculus trick for finding the gradients in gradient descent.
+ Softmax is a function for choosing the most probable classification given several input values.
+ Tensorflow (TF):
+ Overview:
+ General purposes, not specific for neural networks. It's more an architecture for executing a graph of numerical operations.
+ TF can optimize the processing of that graph, and distribute its processing across a network (CPUs, GPUs, clusters at scale).
+ TF can work on GPUs, while Apache Spark doesn't.
+ TF Basics:
```
import tensorflow as tf
a = tf.Variable(1, name="a")
b = tf.Variable(2, name="b")
f = a + b
tf.print(f)
```
```
The bias term can be added onto the result of the matrix multiplication.
output = tf.matmul(previous_layer, layer_weights) + layer_biases ## y = x*w + b
```
+ creating a neural net with tensorflow.
+ Load up our training and testing data.
+ Construct a graph describing our neural net.
+ Use "placeholders" for the `input data` and `target labels`.
This way we can use the same graph for training + testing.
+ Use "variables" for the learned weights for each connection and learned biases for each neuron.
Variables are preserved across runs within a tensorflow session.
+ Associate an optimizer (ie gradient descent) to the network.
+ Run the optimizer with your training data.
+ Evaluate your trained network with your testing data.
+ Make sure your features are normalized.
+ neural nets work best if your input data is normalized.
+ That is zero mean and unit variance.
+ The real goal is that every input feature is comparable in terms of magnitude.
+ scikit-learn standardscaler can do this.
Section 9: Deep Learning for Recommender Systems
+ RBM's for recsys
+ contrastive divergence
+ It samples probability distribution during training using Gibbs sampler. We only train it on the ratings that actually exist, but re-use the resulting weights and biases across other users to fill in the missing ratings we want to predict.
+ Algorithm result:
```
Algorithm RMSE MAE HR cHR ARHR Coverage Diversity Novelty
RBM 1.3257 1.1337 0.0000 0.0000 0.0000 0.0000 0.7505 4597.7419
Random 1.4366 1.1468 0.0149 0.0149 0.0041 1.0000 0.0721 552.4610
Legend:
RMSE: Root Mean Squared Error. Lower values mean better accuracy.
MAE: Mean Absolute Error. Lower values mean better accuracy.
HR: Hit Rate; how often we are able to recommend a left-out rating. Higher is better.
cHR: Cumulative Hit Rate; hit rate, confined to ratings above a certain threshold. Higher is better.
ARHR: Average Reciprocal Hit Rank - Hit rate that takes the ranking into account. Higher is better.
Coverage: Ratio of users for whom recommendations above a certain threshold exist. Higher is better.
Diversity: 1-S, where S is the average similarity score between every possible pair of recommendations
for a given user. Higher means more diverse.
Novelty: Average popularity rank of recommended items. Higher means more novel.
```
```
We recommend:
Suburban Commando (1991) 2.7896125
Candyman: Farewell to the Flesh (1995) 2.7879963
Salesman (1969) 2.786376
Maze Runner, The (2014) 2.78613
Heavy (1995) 2.7853022
Coal Miner's Daughter (1980) 2.7840955
High Plains Drifter (1973) 2.7836945
Salsa (1988) 2.7833865
Beasts of No Nation (2015) 2.782932
Aguirre: The Wrath of God (Aguirre, der Zorn Gottes) (1972) 2.7823327
Using recommender Random
Building recommendation model...
Computing recommendations...
We recommend:
Beavis and Butt-Head Do America (1996) 5
Gods Must Be Crazy, The (1980) 5
Seven (a.k.a. Se7en) (1995) 5
Reality Bites (1994) 5
Young Guns (1988) 5
Fear and Loathing in Las Vegas (1998) 5
Pet Sematary (1989) 5
Ghostbusters (a.k.a. Ghost Busters) (1984) 5
Requiem for a Dream (2000) 5
Herbie Rides Again (1974) 5
```
+ Autoencoders for recommendations ("autorec")
+ paper autorec.
+ Clickstream recommendation with RNN
+ Session-based recommendation (ICLR'16)
+ GRU4Rec (gated recurrent unit)
+ Architecture:
* input layer (one-hot encoded item)
-> embedding layer
-> gru layers
-> feedforward layers
-> output scores on items
+ Twists:
+ session-parallel mini-batches
+ sampling the output
+ ranking loss
+ Deep matrix factorization:
+ paper DeepFM (IJCAI'17) = blending of factorization machine + deep neural net.
+ More emerging tech:
+ word2vec
+ string = "to boldly go where no one has"
+ string -> embedding layer -> hidden layer -> softmax -> "gone"
+ extending word2vec
+ songs = song1, song2, song3, songn
+ songs -> mbedding layer -> hidden layer -> softmax -> songk
+ 3D CNN for session-based recs.
+ descriptions vs. categories vs. clicks (time)
Section 10: Scaling it Up
+ Apache Spark
+ amazon DSSTNE
Section 11: Real-World Challenges of Recommender Systems
+ Cold-start: new user solutions
+ use implicit data
+ use cookies (carefully)
+ geo-ip
+ recommend top-sellers or promotions
+ interview the user.
+ use content-based attributes
+ map attributes to latent features (LearnAROMA)
+ random exploration.
+ stoplist:
+ adult-oriented content
+ vulgarity
+ legally prohibited topics
+ terrorism/political extremism
+ bereavement/medical
+ drug use
+ religion
+ Never build a RecSys based on Image Clicks.
+ Temporal effects, seasonality.
Section 12: Case Studies
+ YouTube:
top-N <-- Knn index <===video vectors=== softmax ---> classif. prob.
| |
user vector<-- ReLU -------------
ReLU
ReLU
watch_vector search_token_vector geoinfo age gender ...