A course project of Analyzing Unstructured Data class at UCSD
In this project, we conduct analysis on the rating and fit of customers based on the e-commerce clothing data collected from RentTheRunway.com, and then develop a recommender to return similar items given the user inputs. From the model results, we conclude that review texts could largely improve the performance of classifiers which only includes customer demographics and body measurements. Other than that, word embedding models such as Skip-gram and CBOW which focus on learning from context perform best among a series of text mining models.
In this project, we use open-sourced clothing fit data which is collected from RentTheRunway. The RentTheRunway dataset contains 192,544 records from 105,508 users regarding 5,850 items. Our original source of data is in json format. To better manipulate data and conduct further analysis, we convert it to a pandas dataframe. As Figure 1 shows, the raw dataset contains missing values and some of its body measurement indexes are not in numeric formats.
Figure 1: First 5 rows of raw dataset
To handle these issues, first of all, we check the proportion of missing values in each column. we can see that 6 features contain missing values. For these columns, we fill in their modes to replace missing values. And for feature columns which we expect to have numeric, we transform them to their corresponding units. For bust size and weight, we extract the number part as feature values, and create a new variable cup to store the letters after the number in bust size. And for height, since the original values are in feet and inches, we extract the numbers of feet and inches, convert 1 feet to 12 inches, and recalculate these values in inches to obtain the height data in numeric format.
Figure 2: Proportion of missing values
From Figure 2, we can see that 6 features contain missing values. For these columns, we fill in their modes to replace missing values. And for feature columns which we expect to have numeric, we transform them to their corresponding units. For bust size and weight, we extract the number part as feature values, and create a new variable cup to store the letters after the number in bust size. And for height, since the original values are in feet and inches, we extract the numbers of feet and inches, convert 1 feet to 12 inches, and recalculate these values in inches to obtain the height data in numeric format.
And for text variables, we use nltk package to conduct tokenization and stemming operations for them, then we convert tokens to lower cases and remove the stop words. Specifically, we find the list of stop words in nltk.corpus package does not contain some frequent stop words in review text, so we manually added about 30 words into this list to keep it consistent with our data.
To explore the characteristics of each column, we conduct univariate analysis for features and response variables.
For our 4 response variables fit, rating, rented for, and category, since rating only have 5 levels of response value and the other 3 are all categorical variables, we create some barplot to explore their distributions.
Figure 3: Proportion of variables
From Figure 3 we find that for fit, more than 70% of customers think the clothes that they get fits all, and the probabilities of being small or being large are about 10% for each level.
For rating, 10 (the highest rating) has the highest proportion of about 60%, and as rating decreases, the proportion also decreases. These 2 variables indicate that most customers are satisfied with the products and services offered by RentTheRunway.
And from the distribution of variables rented for and category, we find that most popular categories are formal dresses which are required for some public affairs. It makes sense since RentTheRunway offers clothes sharing services. For daily dressing which has higher re-use probability, people are more likely to make purchases instead of renting.
Figure 4: Proportion of categorical features
Then for features, we use barplot for categorical features and boxplot for numeric features. In Figure 4, we can see that most customers have medium body shapes. It may be because these kinds of people could find more suitable clothes on a cloth-sharing website and the website also has a willingness to offer more clothes in medium sizes since they will have higher turnover rate and generate more profit.
Figure 5: Distribution of numeric features
In Figure 5, we use boxplot to explore and examine the distribution of numeric features. Specifically, what we should pay attention to is the distribution of age. For variable age, we find there are many outliers and the range of age is [0,120], which is obviously abnormal. We assume that some customers are unwilling to offer their actual ages and inputs below 15 or above 70 may be fake ages. We replace these values with mode to make the variable more likely to obey a convincing distribution.
Besides univariate analysis, we also generate pairplot for bivariate analysis.
Figure 6: Pairplot of numeric features
From Figure 6, we find that among 4 features bust_number, weight, height, and age, variable weight shows some positive correlations with height and bust_number from the scatter plot. To further examine the correlations, we build a correlation matrix of these variables. In the correlation matrix, all pairs of variables show positive correlations, which is consistent with the actual fact that people who are taller or have larger bust numbers will have higher weights. However, the correlation between bust_number and weight is noteworthy since it is rather high. In our modelling part, we will keep track of these 2 variables to check if we should do some adjustments.
Then, we perform the VIF test on all variables as a whole, and the obtained VIF value is 2.28, indicating that there is no significant multicollinearity among the variables. Therefore, we keep all the current features.
Figure 7: Correlation matrix
The RentTheRunway dataset also contains 2 text variables, review_summary and review_text. Previewing the raw data, we find review_summary contains phrases extracted from review_text which include some key information such as cloth category and customer sentiments, but other customer features are omitted. Since we need text to generate word vectors and build our model, we choose review_text as our text variable since it could provide more comprehensive information. After tokenization, we build a word cloud to demonstrate the frequent words in our text.
Figure 8: Word cloud of review_text
From our word cloud, “dress” is the most mentioned word, followed by words related to customer experience such as “fit” and “size”. Among these words, while “fit” is mentioned the most, “little” is also frequently brought up, which might be an indication of less satisfied customers. Also there are some frequent positive words like “great”, “perfect” and “loved”, which is consistent with the previous barplot that customers overall feedback in rating are rather satisfactory.
There are three goals of our research. First, we want to know whether adding review would increase model accuracy when predicting whether a clothes fit a person or not. Second, whether adding reviews variable would increase model performance when predicting customer rating. Third, we build a recommendation model which recommends clothing categories based on user input.
For predictive models which predict fit and rating, the input variables include bust number, weight, height, size, age, cup, body type, review. For the recommender, the input is the item which a user rated the highest in the past. For the reviews in predictive models, we try different preprocessing methods to convert text. These include converting reviews to vectors applying word2vec (Skip-gram & CBOW), GloVe, and Tf-Idf.
The output variables for predictive models are fit and rating. For the recommender, the output variable is item_id, and its corresponding category and rent for, which means that the model would recommend top 5 items which have the highest similarity with the user input.
The goal for our project is to see whether adding “text” would improve model accuracy, thus the baseline model is adding all variables except review. The variables include bust number, weight, height, size, age, cup, body type. The reason that the model adding text could outperforms the baseline is that first, we only keep important words in each row, and deleted pronouns, prepositions, and other common words, such as “I, you, he, she, when, where, what, and, also, to, oh, of”, thus reduced the noise caused by not related words. Second, language models learn patterns from corpus, thus they are able to increase characteristics for each data entry.
For fit predicting model and rating predicting model, we compare the performance of model using different NLP techniques including word2vec (Skip-gram & CBOW), GloVe, and Tf-Idf. To evaluate model performance, we compare AUC-score, F1-score, and accuracy of each model. The higher the AUC, the better the performance of the model at distinguishing classes.
Since our purpose is to classify variables fit and rating, we try logistic regression without and with review text to clarify whether text is useful in this task.
To compare the effect of text on the model, we try different bags of words and word embedding methods to process review text. Bag of words can build a vocabulary from a corpus of documents and counts how many times the words appear in each document. To put it another way, each word in the vocabulary becomes a feature and a document is represented by a vector with the same length of the vocabulary. This approach causes a significant dimensionality problem: the more documents you have the larger is the vocabulary, so the feature matrix will be a huge sparse matrix. Therefore, in order to use this way more efficiently, we do a lot of important preprocessing like word cleaning, stop word removal, stemming to reduce the dimensionality problem.
Word embeddings are methods which look for a latent space to preserve the semantics as much as possible. Words from the vocabulary are mapped to vectors of real numbers. These vectors are calculated from the probability distribution for each word appearing before or after another. These methods are able to capture many linguistic regularities of words as well as taking their context into consideration.
To be more specific, we try binary, frequency, TF-IDF, word to vector, GloVe and CBOW to process data and retrain the classification models. Then we compare the models’ accuracy, auroc_rate, F1-score(both macro and micro)and choose the best one as the with-text result. Then we compare it with the model’s result without text and to find the improvement of adding text.
Finally, to make our project more useful, we design a recommender system based on existing information. The recommendation system is constructed based on the ratings given to the item by existing users and the Jaccard similarity. When existing users enter their id, they can see the top 10 products recommended by the system based on their ratings. In addition, we also output the categories of these ten products and the occurrences used, so that customers can choose according to their own needs.
To reduce the cost of training and tuning, we randomly shuffle the original dataset and pick 50,000 records as our dataset for model comparison. Among these 50,000 records, we split them into a training set with 40,000 records and a testing set with 10,000 records. Table 1 shows the evaluation scores of different fitting models on the testing set.
As we can see from Table 1, all the 4 models show improvement from baseline, which indicates that review texts do provide more latent information for us to predict user fit. The performance of these models is not very different. However, in this project, we believe that AUROC can better reflect the pros and cons of the model in a balanced manner, so we finally choose the AUROC indicator to select the model. And among the 4 text models, CBOW has the best AUROC score.
And from Table 2, when predicting rating, text models also outperform the baseline model. Among these models, Skip Gram has the best AUROC score.
From above results, in general, word2v (CBOW and Skip Gram) has the best performance. Then, because the model effect of CBOW and skip-gram is not very different, and in this task, whether the clothes fit the customer is more important, finally we chose the CBOW model. We re-estimate this model on the whole dataset with 192,544 records and get the following results.
As Table 3, the performances of CBOW model are similar on partial datasets as well as on the full dataset, which means our model is stable.
In our recommender, based on cosine similarity, for each user within the dataset, we find 10 most similar items for them and return the categories and rented_for reasons as well. In Table 4, we randomly choose a user and show the recommendation for them.
In conclusion, we have the following findings from our analysis.
First, from the data collected from RentTheRunway, the majority of its customers have high evaluation of the services and products. The main reason that customers would like to use cloth sharing service is to prepare for some formal affairs, and most of the customers are of medium figure, which may be because this kind of customers are exactly whom RentTheRunway wants to target.
Second, baseline models of logistic regression which excluded text variables show poor performance in predicting user fit and rating. In Particular, the macro F1 scores of baseline models are extremely low, which suggests that the classifiers are affected by imbalanced data and have low accuracy in predicting the minority levels.
Third, Skip-gram and CBOW models show the best performance among all the text models involved in this project. Possible reason is that under this e-commerce situation, a user’s fit and overall evaluation are much more likely to be mentioned in user comments, so these models which generate word embeddings based on the study of context outperform other models.
Also, our project still has some weaknesses which suggest the potential space for improvement.
First, when examining the impact of text compared with other features, we use only logistic regression models. Though models with text features added do perform better than logistic regression, if we parse original features into more complex machine learning models, they may generate better performance. In the future, we will consider comparing text models with other models to further study their impacts.
Secondly, when building text models, we only develop some simple frequency-based models and word embedding models. During the experiment, we tried to build a Bert model, however it failed to return expected outputs in time since it had higher requirements for devices. In the future, if possible, we are going to try more language models with advanced devices.
And for the recommender system we build, it returns recommendations based on existing records of certain users. However it does not solve the cold start problem that we cannot apply it to generate recommendations for new customers. To address this issue, we have proposed plans including introducing random numbers, returning average, or asking for more features from new customers that we can use KNN to find some similar users in our records and then apply our recommender system to them.
And lastly, for the deliverable, we are planning to create some more applicable forms to better demonstrate our models and let them be accessible to amateurs as well. In the future, we are going to build an interactive tool which allows inputs of body features and desired categories/situations to return customized recommendations for users.
[1] Rishabh Misra, Mengting Wan, Julian McAuley (2018). Decomposing fit semantics for product size recommendation in metric spaces.
[2] Hsin-Yu Chen, Huang-Chin Yen, Tian Tian (2020) Female Clothing E-commerce Review and Rating.
Code:
import pandas as pd
import numpy as np
from numpy import dot
from numpy.linalg import norm
import json
import gzip
import math
from math import log
import scipy.optimize
import random
import string
import os
import pickle
import itertools
import re
from tqdm import tqdm
import scipy.sparse as sparse
import pyrsm as rsm
import statsmodels.formula.api as smf
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from collections import defaultdict
import gensim.models
import gensim.downloader as api
from gensim.test.utils import common_texts
from gensim.models import word2vec
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from statsmodels.genmod.families import Binomial
from statsmodels.genmod.families.links import logit
import en_core_web_sm
nlp_processor = en_core_web_sm.load()
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem.porter import *
import sklearn
from sklearn import preprocessing
from sklearn import linear_model
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE
from wordcloud import WordCloud
from statsmodels.stats.outliers_influence import variance_inflation_factor
/usr/local/lib/python3.8/dist-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.
warnings.warn(msg)
path = "renttherunway_final_data.json"
dataset = []
for line in open(path, 'r'):
dataset.append(json.loads(line))
dataset[0]
{'fit': 'fit',
'user_id': '420272',
'bust size': '34d',
'item_id': '2260466',
'weight': '137lbs',
'rating': '10',
'rented for': 'vacation',
'review_text': "An adorable romper! Belt and zipper were a little hard to navigate in a full day of wear/bathroom use, but that's to be expected. Wish it had pockets, but other than that-- absolutely perfect! I got a million compliments.",
'body type': 'hourglass',
'review_summary': 'So many compliments!',
'category': 'romper',
'height': '5\' 8"',
'size': 14,
'age': '28',
'review_date': 'April 20, 2016'}
df_raw = pd.DataFrame(dataset)
# display first 5 rows
df_raw.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
fit | user_id | bust size | item_id | weight | rating | rented for | review_text | body type | review_summary | category | height | size | age | review_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | fit | 420272 | 34d | 2260466 | 137lbs | 10 | vacation | An adorable romper! Belt and zipper were a lit... | hourglass | So many compliments! | romper | 5' 8" | 14 | 28 | April 20, 2016 |
1 | fit | 273551 | 34b | 153475 | 132lbs | 10 | other | I rented this dress for a photo shoot. The the... | straight & narrow | I felt so glamourous!!! | gown | 5' 6" | 12 | 36 | June 18, 2013 |
2 | fit | 360448 | NaN | 1063761 | NaN | 10 | party | This hugged in all the right places! It was a ... | NaN | It was a great time to celebrate the (almost) ... | sheath | 5' 4" | 4 | 116 | December 14, 2015 |
3 | fit | 909926 | 34c | 126335 | 135lbs | 8 | formal affair | I rented this for my company's black tie award... | pear | Dress arrived on time and in perfect condition. | dress | 5' 5" | 8 | 34 | February 12, 2014 |
4 | fit | 151944 | 34b | 616682 | 145lbs | 10 | wedding | I have always been petite in my upper body and... | athletic | Was in love with this dress !!! | gown | 5' 9" | 12 | 27 | September 26, 2016 |
# check columns types
df_raw.dtypes
fit object
user_id object
bust size object
item_id object
weight object
rating object
rented for object
review_text object
body type object
review_summary object
category object
height object
size int64
age object
review_date object
dtype: object
# check the proportion of null values
df_raw.isnull().sum()/df_raw.shape[0]
fit 0.000000
user_id 0.000000
bust size 0.095620
item_id 0.000000
weight 0.155715
rating 0.000426
rented for 0.000052
review_text 0.000000
body type 0.076019
review_summary 0.000000
category 0.000000
height 0.003516
size 0.000000
age 0.004986
review_date 0.000000
dtype: float64
# get modes of columns to fill in missing values
col_feature = ["bust size","weight","body type","height","size","age"]
df_raw.loc[:,col_feature].mode()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
bust size | weight | body type | height | size | age | |
---|---|---|---|---|---|---|
0 | 34b | 130lbs | hourglass | 5' 4" | 8 | 31 |
# data pre-processing
for i in dataset:
if 'rented for' not in i.keys():
dataset.remove(i)
elif (i['rating'] == None) or (len(i['review_text']) == 0) or (len(i['fit']) == 0) or (len(i['category']) == 0) or (len(i['rented for']) == 0):
dataset.remove(i)
len(dataset)
192452
bust_size = []
for d in dataset:
if 'bust size' in d.keys():
bust_size.append(d['bust size'])
else:
bust_size.append(0)
bust_number = []
cup = []
for b in bust_size:
if b != 0 :
bust_number.append(int(b[0:2]))
cup.append(b[2:])
else:
bust_number.append(34)
cup.append('b')
body_type = []
for d in dataset:
if 'body type' in d.keys():
body_type.append(d['body type'])
else:
body_type.append('hourglass')
weight = []
for d in dataset:
if 'weight' in d.keys():
weight.append(int(d['weight'].split('l',1)[0]))
else:
weight.append(130)
height = []
for d in dataset:
if 'height' in d.keys():
height.append(12*int(d['height'].split("\' ")[0])+int(d['height'].split("\' ")[1].split('"')[0]))
else:
height.append(64)
size = []
for d in dataset:
if 'size' in d.keys():
size.append(d['size'])
else:
size.append(8)
age = []
for d in dataset:
if 'age' in d.keys():
age.append(int(d['age']))
else:
age.append(31)
fit = [d['fit'] for d in dataset]
rate = [(d['rating']) for d in dataset]
category = [d['category'] for d in dataset]
rent = []
for d in dataset:
if 'rented for' in d.keys():
rent.append(d['rented for'])
else:
rent.append('Unknown')
reviews = [d['review_text'] for d in dataset]
df_clean = pd.DataFrame({'fit': fit, 'rating': rate, 'category': category, 'rent_for': rent,
'bust_number': bust_number, 'cup': cup, 'body_type': body_type, 'weight' : weight, 'height': height,
'size': size, 'age': age, 'review_text': reviews})
df_clean.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
fit | rating | category | rent_for | bust_number | cup | body_type | weight | height | size | age | review_text | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | fit | 10 | romper | vacation | 34 | d | hourglass | 137 | 68 | 14 | 28 | An adorable romper! Belt and zipper were a lit... |
1 | fit | 10 | gown | other | 34 | b | straight & narrow | 132 | 66 | 12 | 36 | I rented this dress for a photo shoot. The the... |
2 | fit | 10 | sheath | party | 34 | b | hourglass | 130 | 64 | 4 | 116 | This hugged in all the right places! It was a ... |
3 | fit | 8 | dress | formal affair | 34 | c | pear | 135 | 65 | 8 | 34 | I rented this for my company's black tie award... |
4 | fit | 10 | gown | wedding | 34 | b | athletic | 145 | 69 | 12 | 27 | I have always been petite in my upper body and... |
# Plotting barplots for response variables
# increase plot resolution
mpl.rcParams["figure.dpi"] = 80
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)
df_clean.fit.value_counts(normalize=True,ascending=True).plot(kind='barh',ax=ax1)
df_clean.rating.value_counts(normalize=True,ascending=True).plot(kind='barh',ax=ax2)
df_clean.rent_for.value_counts(normalize=True,ascending=True).plot(kind='barh',ax=ax3)
category_sort = df_clean.category.value_counts().reset_index().sort_values(by=['category'],ascending=False)
category_top10 = list(category_sort['index'][:10])
df_clean[df_clean.category.isin(category_top10)].category.value_counts(normalize=True,ascending=True).plot(kind='barh',ax=ax4)
#fig.suptitle('Proportion of Variables',y=1.0)
ax1.set_title('Fit')
ax2.set_title('Rating')
ax3.set_title('Rented for')
ax4.set_title('Category')
plt.subplots_adjust(left=0.1,
bottom=0.1,
right=0.9,
top=0.9,
wspace=0.4,
hspace=0.4)
plt.show()
fig, (ax1, ax2) = plt.subplots(2)
df_clean.cup.value_counts(normalize=True,ascending=False).plot(kind='bar',ax=ax1, rot=0)
df_clean.body_type.value_counts(normalize=True,ascending=False).plot(kind='bar',ax=ax2, rot=30)
#fig.suptitle('Proportion of Categorical Features',y=1.0)
plt.subplots_adjust(left=0.1,
bottom=0.1,
right=0.9,
top=0.9,
wspace=0.4,
hspace=0.4)
ax1.set_title('Cup')
ax2.set_title('Body Type')
plt.show()
#mpl.rcParams["figure.dpi"] = 80
fig, axes = plt.subplots(1,3)
sns.boxplot(y=df_clean.bust_number,ax=axes[0])
sns.boxplot(y=df_clean.weight,ax=axes[1])
sns.boxplot(y=df_clean.height,ax=axes[2])
#fig.suptitle('Boxplot of Features',y=1.0)
plt.subplots_adjust(left=0.1,
bottom=0.1,
right=0.9,
top=0.9,
wspace=0.6,
hspace=0.4)
plt.show()
#mpl.rcParams["figure.dpi"] = 80
fig, axes = plt.subplots(1,2)
sns.boxplot(y=df_clean["age"],ax=axes[0])
sns.boxplot(y=df_clean["size"],ax=axes[1])
#fig.suptitle('Boxplot of Features',y=1.0)
plt.subplots_adjust(left=0.1,
bottom=0.1,
right=0.9,
top=0.9,
wspace=0.6,
hspace=0.4)
plt.show()
df_clean.loc[(df_clean['age'] > 70) | (df_clean['age'] < 15), 'age'] = 31
mpl.rcParams["figure.dpi"] = 80
sns.pairplot(data = df_clean, vars=["bust_number","weight","height","age"])
plt.show()
#mpl.rcParams["figure.dpi"] = 150
# Creating a matrix using age, salry, balance as rows and columns
df_clean[['weight','height','bust_number']].corr()
#plot the correlation matrix of salary, balance and age in data dataframe.
sns.heatmap(df_clean[['weight','height','bust_number']].corr(), annot=True, cmap = 'Reds')
plt.show()
# Output wordcloud
wordCount = defaultdict(int)
punctuation = set(string.punctuation)
stemmer = PorterStemmer()
stopWords = stopwords.words('english')
review_summary = []
for d in dataset:
r = ''.join([c for c in d['review_summary'].lower()])
for w in r.split():
#w = stemmer.stem(w)
if (w not in punctuation) and (w not in stopWords):
review_summary.append(w)
nltk.download('stopwords')
stop = set(stopwords.words('english'))
ps = PorterStemmer()
# return a list of tokens
def pre_processing_by_nltk(doc, stemming = True, need_sent = False):
# step 1: get sentences
sentences = sent_tokenize(doc)
# step 2: get tokens
tokens = []
for sent in sentences:
words = word_tokenize(sent)
# step 3: stemming
if stemming:
words = [ps.stem(word) for word in words]
if need_sent:
tokens.append(words)
else:
tokens += words
# remove stop words and punctuations in reutrns
return [w.lower() for w in tokens if w.lower() not in stop if w.isalnum()]
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
review_text = np.array(df_clean["review_text"].agg(pre_processing_by_nltk, stemming = False, need_sent = False))
review_text = review_text.reshape(-1).tolist()
review_text_lst = [i for lst in review_text for i in lst]
reviewtext_dict=Counter(review_text_lst)
reviewtext_wordcloud = WordCloud(width = 1000, height = 500, background_color='white').generate_from_frequencies(reviewtext_dict)
plt.figure(figsize=(15,8))
plt.imshow(reviewtext_wordcloud)
plt.axis("off")
plt.show()
df_clean.shape
(192452, 12)
dataset_50k = df_clean.sample(frac=1,random_state=42).reset_index(drop=True)[:50000]
dataset_50k.shape
(50000, 12)
dataset_50k.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
fit | rating | category | rent_for | bust_number | cup | body_type | weight | height | size | age | review_text | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | large | 10 | maxi | party | 34 | b | pear | 140 | 64 | 8 | 36 | I straddle the 4/6 sizes and this fit me perfe... |
1 | fit | 10 | gown | formal affair | 32 | d | hourglass | 130 | 63 | 8 | 31 | I didn't want to send it back! The 4 fit perf... |
2 | fit | 10 | sheath | formal affair | 32 | d | athletic | 142 | 68 | 12 | 30 | Fit great, if you are between sizes order up, ... |
3 | fit | 10 | dress | wedding | 38 | c | apple | 175 | 67 | 20 | 31 | I went out on a limb with this dress - all the... |
4 | fit | 8 | midi | wedding | 36 | ddd/e | full bust | 170 | 69 | 24 | 31 | It is low, although you're covered you can see... |
df = pd.get_dummies(dataset_50k, columns = ['cup','body_type'])
X = df[['bust_number', 'weight',
'height', 'size', 'age', 'cup_a', 'cup_aa', 'cup_b',
'cup_c', 'cup_d', 'cup_d+', 'cup_dd', 'cup_ddd/e', 'cup_f', 'cup_g',
'cup_h', 'cup_i', 'cup_j', 'body_type_apple', 'body_type_athletic',
'body_type_full bust', 'body_type_hourglass', 'body_type_pear',
'body_type_petite', 'body_type_straight & narrow']]
X_fit_train, X_fit_test, y_fit_train, y_fit_test = train_test_split(X, df.loc[:,'fit'], random_state=42, test_size=0.2, shuffle=True)
X_rate_train, X_rate_test, y_rate_train, y_rate_test = train_test_split(X,df.loc[:,'rating'], random_state=42, test_size=0.2, shuffle=True)
# Test for multicollinearity
X['intercept'] =1
vif = variance_inflation_factor(X.values, 1)
vif
<ipython-input-21-c2510b5486bd>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
X['intercept'] =1
2.2781909944415895
def eval(model, x, true):
auroc = roc_auc_score(true, model.predict_proba(x), multi_class = 'ovr')
f1_mac = f1_score(true, model.predict(x), average='macro')
f1_mic = f1_score(true, model.predict(x), average='micro')
accuracy = accuracy_score(true, model.predict(x))
print("auroc_rate: ", auroc)
print("f1_mac_rate: ", f1_mac)
print("f1_mic_rate: ", f1_mic)
print("accuracy_rate: ", accuracy)
return(auroc, f1_mac, f1_mic, accuracy)
X_fit_train_sparse = sparse.csr_matrix(X_fit_train)
X_rate_train_sparse = sparse.csr_matrix(X_rate_train)
clf_fit_base = LogisticRegression(max_iter=100000000,C=1).fit(X_fit_train, y_fit_train)
auroc_fit_base, f1_mac_fit_base, f1_mic_fit_base, accuracy_fit_base = eval(clf_fit_base, sparse.csr_matrix(X_fit_test), y_fit_test)
auroc_rate: 0.5866287174095914
f1_mac_rate: 0.28463118112481195
f1_mic_rate: 0.7338999999999999
accuracy_rate: 0.7339
clf_rate_base = LogisticRegression(max_iter=100000000,C=1).fit(X_rate_train, y_rate_train)
auroc_rate_base, f1_mac_rate_base, f1_mic_rate_base, accuracy_rate_base = eval(clf_rate_base,sparse.csr_matrix( X_rate_test), y_rate_test)
auroc_rate: 0.5455740291115136
f1_mac_rate: 0.15766387980128438
f1_mic_rate: 0.6506
accuracy_rate: 0.6506
# Text Pre-Processing
stop_words = set(stopwords.words('english'))
newStopWords = ['actually', 'also', 'although',
'always', 'became', 'become', 'could','do', 'does', 'either', 'else','hence','how', 'may', 'to', 'on'
'maybe','me','might','mine','must','my','of','oh','ok','whereas','wherever',
'whenever','whether','whoever','whose','why','will','within','with','without','yet','your', 'would', 'all']
stop_words.update(newStopWords)
# print(stop) must be lowercase an √ ; AN X
ps = PorterStemmer()
# return a list of tokens
def pre_processing_by_nltk(doc, stemming = True, need_sent = False):
# step 1: get sentences
sentences = sent_tokenize(doc)
# step 2: get tokens
tokens = []
for sent in sentences:
words = word_tokenize(sent)
# step 3: stemming
if stemming:
words = [ps.stem(word) for word in words]
if need_sent:
tokens.append(words)
else:
tokens += words
return [w.lower() for w in tokens if w.lower() not in stop_words]
# Create vocab
DF = defaultdict(float)
for doc in tqdm(df.review_text):
tokens = pre_processing_by_nltk(doc)
for token in set(tokens):
DF[token] += 1
IDF, vocab = dict(), dict()
for token in DF:
if DF[token] < 15:
# words (freq < 10) becomes an unk
pass
else:
vocab[token] = len(vocab)
IDF[token] = log(1 + len(df.review_text) / DF[token])
IDF['<UNK>'] = 1
vocab['<UNK>'] = len(vocab)
len(vocab)
100%|██████████| 50000/50000 [01:26<00:00, 579.70it/s]
2951
def tfidf_feature_extractor(doc, vocab, IDF):
tokens = pre_processing_by_nltk(doc)
for i, token in enumerate(tokens):
if token not in vocab:
tokens[i] = '<UNK>'
TF = defaultdict(int)
for token in tokens:
TF[token] += 1
x = [0] * len(vocab)
for token in set(tokens):
tfidf = log(TF[token] + 1) * IDF[token]
token_id = vocab[token]
# print(token, TF[token], IDF[token])
x[token_id] = tfidf # this will be a dense matrix
return x
X_review_tfidf = []
for doc in tqdm(df.review_text):
X_review_tfidf.append(tfidf_feature_extractor(doc, vocab, IDF))
X_review = pd.DataFrame(np.array(X_review_tfidf))
100%|██████████| 50000/50000 [01:33<00:00, 532.47it/s]
X_add_idf = pd.concat([X,X_review],axis=1)
X_fit_train, X_fit_test, y_fit_train, y_fit_test = train_test_split(X_add_idf, df.loc[:,'fit'], random_state=42, test_size=0.2, shuffle=True)
X_rate_train, X_rate_test, y_rate_train, y_rate_test = train_test_split(X_add_idf,df.loc[:,'rating'], random_state=42, test_size=0.2, shuffle=True)
X_fit_train_sparse = sparse.csr_matrix(X_fit_train)
X_rate_train_sparse = sparse.csr_matrix(X_rate_train)
clf_fit_idf = LogisticRegression(max_iter=100000000,C=1).fit(X_fit_train_sparse, y_fit_train)
auroc_fit_idf, f1_mac_fit_idf, f1_mic_fit_idf, accuracy_fit_idf = eval(clf_fit_idf, sparse.csr_matrix(X_fit_test), y_fit_test)
auroc_rate: 0.790526609454432
f1_mac_rate: 0.6030773228208012
f1_mic_rate: 0.7731
accuracy_rate: 0.7731
clf_rate_idf = LogisticRegression(max_iter=100000000,C=1).fit(X_rate_train_sparse, y_rate_train)
auroc_rate_idf, f1_mac_rate_idf, f1_mic_rate_idf, accuracy_rate_idf = eval(clf_rate_idf, sparse.csr_matrix(X_rate_test), y_rate_test)
auroc_rate: 0.7883374631115887
f1_mac_rate: 0.34772206797681615
f1_mic_rate: 0.6721
accuracy_rate: 0.6721
def build_vocab(sentences):
# Build vocabulary
word_counts = Counter(itertools.chain(*sentences))
# Mapping from index to word
vocabulary_inv = [x[0] for x in word_counts.most_common()]
# Mapping from word to index
vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}
return word_counts, vocabulary, vocabulary_inv
def get_embeddings(inp_data, vocabulary_inv, size_features=100,
mode='skipgram',
min_word_count=2,
context=5):
model_name = "embedding"
model_name = os.path.join(model_name)
num_workers = 15 # Number of threads to run in parallel
downsampling = 1e-3 # Downsample setting for frequent words
print('Training Word2Vec model...')
# use inp_data and vocabulary_inv to reconstruct sentences
sentences = [[vocabulary_inv[w] for w in s] for s in inp_data]
if mode == 'skipgram':
sg = 1
print('Model: skip-gram')
elif mode == 'cbow':
sg = 0
print('Model: CBOW')
embedding_model = word2vec.Word2Vec(sentences, workers=num_workers,
sg=sg,
vector_size=size_features,
min_count=min_word_count,
window=context,
sample=downsampling)
print("Saving Word2Vec model {}".format(model_name))
embedding_weights = np.zeros((len(vocabulary_inv), size_features))
for i in range(len(vocabulary_inv)):
word = vocabulary_inv[i]
if word in embedding_model.wv:
embedding_weights[i] = embedding_model.wv[word]
else:
embedding_weights[i] = np.random.uniform(-0.25, 0.25,
embedding_model.vector_size)
return embedding_weights
def preprocess_df(df):
# get English stopwords
stop_words = set(stopwords.words('english'))
stop_words.add('would')
# prepare translation table to translate punctuation to space
translator = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
preprocessed_sentences = []
for i, row in df.iterrows():
sent = row["review_text"]
sent_nopuncts = sent.translate(translator)
words_list = sent_nopuncts.strip().split()
filtered_words = [word for word in words_list if word not in stop_words and len(word) != 1] # also skip space from above translation
preprocessed_sentences.append(" ".join(filtered_words))
df["review_text"] = preprocessed_sentences
return df
train = df[:len(df)//10*8]
test = df[len(df)//10*8:]
train_feature = X[:len(X)//10*8]
test_feature = X[len(X)//10*8:]
train = preprocess_df(train)
test = preprocess_df(test)
<ipython-input-41-424ecf821ad2>:57: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["review_text"] = preprocessed_sentences
# tokenization
tagged_data = [word_tokenize(_d) for i, _d in enumerate(train["review_text"])]
# build vocabulary from tokenized data
word_counts, vocabulary, vocabulary_inv = build_vocab(tagged_data)
# use the above mapping to create input data
inp_data = [[vocabulary[word] for word in text] for text in tagged_data]
# get embedding vector
embedding_weights = get_embeddings(inp_data, vocabulary_inv)
tagged_train_data = [word_tokenize(_d) for i, _d in enumerate(train["review_text"])]
tagged_test_data = [word_tokenize(_d) for i, _d in enumerate(test["review_text"])]
Training Word2Vec model...
Model: skip-gram
Saving Word2Vec model embedding
train_vec = []
for doc in tagged_train_data:
vec = 0
for w in doc:
vec += embedding_weights[vocabulary[w]]
if len(doc) != 0:
vec = vec / len(doc)
else: vec = [0] * 100
train_vec.append(vec)
test_vec = []
for doc in tagged_test_data:
vec = 0
length = 0
for w in doc:
try:
vec += embedding_weights[vocabulary[w]]
length += 1
except:
continue
if length != 0:
vec = vec / length
else: vec = [0] * 100
test_vec.append(vec)
train_vec_features = [np.append(i, j) for (i,j) in zip(train_vec, np.array(train_feature))]
test_vec_features = [np.append(i, j) for (i,j) in zip(test_vec, np.array(test_feature))]
train_just_features = [i for i in np.array(train_feature)]
test_just_features = [i for i in np.array(test_feature)]
model_fit_SG = LogisticRegression(C = 1, random_state=42, max_iter = 100000000).fit(train_vec_features, train['fit'])
auroc_fit_SG, f1_mac_fit_SG, f1_mic_fit_SG, accuracy_fit_SG = eval(model_fit_SG, test_vec_features, test['fit'])
auroc_rate: 0.7852504496310196
f1_mac_rate: 0.48416385275120405
f1_mic_rate: 0.7669
accuracy_rate: 0.7669
model_rating_SG = LogisticRegression(C = 1, random_state=42, max_iter = 100000000).fit(train_vec_features, train['rating'])
auroc_rate_SG, f1_mac_rate_SG, f1_mic_rate_SG, accuracy_rate_SG = eval(model_rating_SG, test_vec_features, test['rating'])
auroc_rate: 0.8226084393207364
f1_mac_rate: 0.26879388027214113
f1_mic_rate: 0.6763
accuracy_rate: 0.6763
# get embedding vector
embedding_weights = get_embeddings(inp_data, vocabulary_inv, mode='cbow' )
tagged_train_data = [word_tokenize(_d) for i, _d in enumerate(train["review_text"])]
tagged_test_data = [word_tokenize(_d) for i, _d in enumerate(test["review_text"])]
Training Word2Vec model...
Model: CBOW
Saving Word2Vec model embedding
train_vec = []
for doc in tagged_train_data:
vec = 0
for w in doc:
vec += embedding_weights[vocabulary[w]]
if len(doc) != 0:
vec = vec / len(doc)
else: vec = [0] * 100
train_vec.append(vec)
test_vec = []
for doc in tagged_test_data:
vec = 0
length = 0
for w in doc:
try:
vec += embedding_weights[vocabulary[w]]
length += 1
except:
continue
if length != 0:
vec = vec / length
else: vec = [0] * 100
test_vec.append(vec)
train_vec_features = [np.append(i, j) for (i,j) in zip(train_vec, np.array(train_feature))]
test_vec_features = [np.append(i, j) for (i,j) in zip(test_vec, np.array(test_feature))]
model_fit_cbow = LogisticRegression(C = 1, random_state=42, max_iter = 100000000).fit(train_vec_features, train['fit'])
auroc_fit_cbow, f1_mac_fit_cbow, f1_mic_fit_cbow, accuracy_fit_cbow = eval(model_fit_cbow, test_vec_features, test['fit'])
auroc_rate: 0.7948469125828606
f1_mac_rate: 0.508236085791458
f1_mic_rate: 0.7708
accuracy_rate: 0.7708
model_rating_cbow = LogisticRegression(C = 1, random_state=42, max_iter = 100000000).fit(train_vec_features, train['rating'])
auroc_rate_cbow, f1_mac_rate_cbow, f1_mic_rate_cbow, accuracy_rate_cbow = eval(model_rating_cbow, test_vec_features, test['rating'])
auroc_rate: 0.810268130641866
f1_mac_rate: 0.25538126948925366
f1_mic_rate: 0.6646
accuracy_rate: 0.6646
tagged_train_data_list = []
for i in tagged_train_data:
tagged_train_data_list.append(' '.join(i))
tagged_test_data_list = []
for i in tagged_test_data:
tagged_test_data_list.append(' '.join(i))
glove_df_train = pd.DataFrame(tagged_train_data_list).rename(columns={0:'text'})
glove_df_test = pd.DataFrame(tagged_test_data_list).rename(columns={0:'text'})
glove_model = {}
with open ('glove.6B.100d.txt') as f:
for line in f:
line_split = line.split()
word = line_split[0]
embedding = np.array(line_split[1:], dtype=np.float64)
glove_model[word] = embedding
glove_model_new = {}
for k, v in glove_model.items():
if v.size ==100:
glove_model_new[k] = v
def cos_sim(a, b):
return dot(a, b)/(norm(a)*norm(b))
def GloVe(doc, wv):
vecs = []
for token in doc.split():
try:
vecs.append(wv[token])
except KeyError:
vecs.append(np.array([0]*100))
return np.mean(vecs, axis=0)
train_vec = []
for sent in glove_df_train.text:
train_vec.append(GloVe(sent, glove_model_new))
test_vec = []
for sent in glove_df_test.text:
test_vec.append(GloVe(sent, glove_model_new))
/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py:3419: RuntimeWarning: Mean of empty slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py:188: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
train_vec_features_df = pd.DataFrame(train_vec_features)
train_vec_features_df = pd.concat([train_vec_features_df, train[['fit', 'rating']]], axis=1)
test_vec_features_df = pd.DataFrame(test_vec_features)
test_vec_features_df = pd.concat([test_vec_features_df, test[['fit', 'rating']].reset_index()], axis=1)
train_vec_features_df = train_vec_features_df.dropna()
test_vec_features_df = test_vec_features_df.dropna()
test_vec_features_df = test_vec_features_df.drop(columns=['index'])
model_fit_Glove = LogisticRegression(C = 1, random_state=42, max_iter = 100000000).fit(train_vec_features_df.iloc[:,:-2], train_vec_features_df['fit'])
auroc_fit_Glove, f1_mac_fit_Glove, f1_mic_fit_Glove, accuracy_fit_Glove = eval(model_fit_Glove, test_vec_features_df.iloc[:,:-2], test_vec_features_df['fit'])
auroc_rate: 0.7500948658707619
f1_mac_rate: 0.4330966750550509
f1_mic_rate: 0.7553308639503454
accuracy_rate: 0.7553308639503454
model_rate_Glove = LogisticRegression(C = 1, random_state=42, max_iter = 100000000).fit(train_vec_features_df.iloc[:,:-2], train_vec_features_df['rating'])
auroc_rate_Glove, f1_mac_rate_Glove, f1_mic_rate_Glove, accuracy_rate_Glove = eval(model_rate_Glove, test_vec_features_df.iloc[:,:-2], test_vec_features_df['rating'])
auroc_rate: 0.754920472670829
f1_mac_rate: 0.19990658444584186
f1_mic_rate: 0.649614576033637
accuracy_rate: 0.649614576033637
# fit
compare_fit = pd.DataFrame({
'baseline':[auroc_fit_base, f1_mac_fit_base, f1_mic_fit_base, accuracy_fit_base],
'TFIDF':[auroc_fit_idf, f1_mac_fit_idf, f1_mic_fit_idf, accuracy_fit_idf],
'skip-gram':[auroc_fit_SG, f1_mac_fit_SG, f1_mic_fit_SG, accuracy_fit_SG],
'CBOW':[auroc_fit_cbow, f1_mac_fit_cbow, f1_mic_fit_cbow, accuracy_fit_cbow],
'Glove':[auroc_fit_Glove, f1_mac_fit_Glove, f1_mic_fit_Glove, accuracy_fit_Glove]
},index = ['auroc','f1_mac','f1_mic','accuracy'])
compare_fit.round(4)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
baseline | TFIDF | skip-gram | CBOW | Glove | |
---|---|---|---|---|---|
auroc | 0.5866 | 0.7905 | 0.7853 | 0.7948 | 0.7501 |
f1_mac | 0.2846 | 0.6031 | 0.4842 | 0.5082 | 0.4331 |
f1_mic | 0.7339 | 0.7731 | 0.7669 | 0.7708 | 0.7553 |
accuracy | 0.7339 | 0.7731 | 0.7669 | 0.7708 | 0.7553 |
# rate
compare_rate = pd.DataFrame({
'baseline':[auroc_rate_base, f1_mac_rate_base, f1_mic_rate_base, accuracy_rate_base],
'TFIDF':[auroc_rate_idf, f1_mac_rate_idf, f1_mic_rate_idf, accuracy_rate_idf],
'skip-gram':[auroc_rate_SG, f1_mac_rate_SG, f1_mic_rate_SG, accuracy_rate_SG],
'CBOW':[auroc_rate_cbow, f1_mac_rate_cbow, f1_mic_rate_cbow, accuracy_rate_cbow],
'Glove':[auroc_rate_Glove, f1_mac_rate_Glove, f1_mic_rate_Glove, accuracy_rate_Glove]
},index = ['auroc','f1_mac','f1_mic','accuracy'])
compare_rate.round(4)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
baseline | TFIDF | skip-gram | CBOW | Glove | |
---|---|---|---|---|---|
auroc | 0.5456 | 0.7883 | 0.8226 | 0.8103 | 0.7549 |
f1_mac | 0.1577 | 0.3477 | 0.2688 | 0.2554 | 0.1999 |
f1_mic | 0.6506 | 0.6721 | 0.6763 | 0.6646 | 0.6496 |
accuracy | 0.6506 | 0.6721 | 0.6763 | 0.6646 | 0.6496 |
df_full = df_clean.copy()
df_full = pd.get_dummies(df_full, columns = ['cup','body_type'])
df_feature_full = df_full[['bust_number', 'weight',
'height', 'size', 'age', 'cup_a', 'cup_aa', 'cup_b',
'cup_c', 'cup_d', 'cup_d+', 'cup_dd', 'cup_ddd/e', 'cup_f', 'cup_g',
'cup_h', 'cup_i', 'cup_j', 'body_type_apple', 'body_type_athletic',
'body_type_full bust', 'body_type_hourglass', 'body_type_pear',
'body_type_petite', 'body_type_straight & narrow']]
train_full = df_full[:len(df_full)//10*8]
test_full = df_full[len(df_full)//10*8:]
train_feature_full = df_feature_full[:len(df_feature_full)//10*8]
test_feature_full = df_feature_full[len(df_feature_full)//10*8:]
train_full = preprocess_df(train_full)
test_full = preprocess_df(test_full)
<ipython-input-41-424ecf821ad2>:57: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["review_text"] = preprocessed_sentences
# tokenization
tagged_data_full = [word_tokenize(_d) for i, _d in enumerate(train_full["review_text"])]
# build vocabulary from tokenized data
word_counts_full, vocabulary_full, vocabulary_inv_full = build_vocab(tagged_data_full)
# use the above mapping to create input data
inp_data_full = [[vocabulary_full[word] for word in text] for text in tagged_data_full]
# get embedding vector
embedding_weights_full = get_embeddings(inp_data_full, vocabulary_inv_full,mode='cbow')
tagged_train_data_full = [word_tokenize(_d) for i, _d in enumerate(train_full["review_text"])]
tagged_test_data_full = [word_tokenize(_d) for i, _d in enumerate(test_full["review_text"])]
Training Word2Vec model...
Model: CBOW
Saving Word2Vec model embedding
train_vec_full = []
for doc in tagged_train_data_full:
vec = 0
for w in doc:
vec += embedding_weights_full[vocabulary_full[w]]
if len(doc) != 0:
vec = vec / len(doc)
else: vec = [0] * 100
train_vec_full.append(vec)
test_vec_full = []
for doc in tagged_test_data_full:
vec = 0
length = 0
for w in doc:
try:
vec += embedding_weights_full[vocabulary_full[w]]
length += 1
except:
continue
if length != 0:
vec = vec / length
else: vec = [0] * 100
test_vec_full.append(vec)
train_vec_features_full = [np.append(i, j) for (i,j) in zip(train_vec_full, np.array(train_feature_full))]
test_vec_features_full = [np.append(i, j) for (i,j) in zip(test_vec_full, np.array(test_feature_full))]
train_just_features_full = [i for i in np.array(train_feature_full)]
test_just_features_full = [i for i in np.array(test_feature_full)]
model_rating_baseline_full = LogisticRegression(C =1, random_state=42, max_iter = 100000000).fit(train_just_features_full, train_full['rating'])
auroc_rating_baseline_full, f1_mac_rating_baseline_full, f1_mic_rating_baseline_full, accuracy_rating_baseline_full = eval(model_rating_baseline_full, test_just_features_full, test_full['rating'])
auroc_rate: 0.5339586064760585
f1_mac_rate: 0.1569258943512993
f1_mic_rate: 0.6455886937545464
accuracy_rate: 0.6455886937545464
model_rating_full = LogisticRegression(C = 1, random_state=42, max_iter = 100000000).fit(train_vec_features_full, train_full['rating'])
auroc_rating_CBOW_full, f1_mac_rating_CBOW_full, f1_mic_rating_CBOW_full, accuracy_rating_CBOW_full=eval(model_rating_full, test_vec_features_full, test_full['rating'])
auroc_rate: 0.8207798206398245
f1_mac_rate: 0.27803024060448356
f1_mic_rate: 0.6739582250857321
accuracy_rate: 0.6739582250857321
model_fit_baseline_full = LogisticRegression(C = 1, random_state=42, max_iter = 100000000).fit(train_just_features_full, train_full['fit'])
auroc_fit_baseline_full, f1_mac_fit_baseline_full, f1_mic_fit_baseline_full, accuracy_fit__baseline_full= eval(model_fit_baseline_full, test_just_features_full, test_full['fit'])
auroc_rate: 0.5856004766883958
f1_mac_rate: 0.285921879398892
f1_mic_rate: 0.7369583290034293
accuracy_rate: 0.7369583290034293
model_fit_full = LogisticRegression(C = 1, random_state=42, max_iter = 100000000).fit(train_vec_features_full, train_full['fit'])
auroc_fit_CBOW_full, f1_mac_fit_CBOW_full, f1_mic_fit_CBOW_full, accuracy_fit_CBOW_full=eval(model_fit_full, test_vec_features_full, test_full['fit'])
auroc_rate: 0.7880492150879822
f1_mac_rate: 0.5023983522203012
f1_mic_rate: 0.7683934324015379
accuracy_rate: 0.768393432401538
# fit
compare_full_fit = pd.DataFrame({
'baseline_full':[auroc_fit_baseline_full, f1_mac_fit_baseline_full, f1_mic_fit_baseline_full, accuracy_fit__baseline_full],
'CBOW_full':[auroc_fit_CBOW_full, f1_mac_fit_CBOW_full, f1_mic_fit_CBOW_full, accuracy_fit_CBOW_full]
},index = ['auroc','f1_mac','f1_mic','accuracy'])
compare_full_fit['improvement'] = (compare_full_fit['CBOW_full']-compare_full_fit['baseline_full'])
compare_full_fit.round(4)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
baseline_full | CBOW_full | improvement | |
---|---|---|---|
auroc | 0.5340 | 0.7880 | 0.2541 |
f1_mac | 0.1569 | 0.5024 | 0.3455 |
f1_mic | 0.6456 | 0.7684 | 0.1228 |
accuracy | 0.7370 | 0.7684 | 0.0314 |
# rate
compare_full_rate = pd.DataFrame({
'baseline_full':[auroc_rating_baseline_full, f1_mac_rating_baseline_full, f1_mic_rating_baseline_full, accuracy_rating_baseline_full],
'CBOW_full':[auroc_rating_CBOW_full, f1_mac_rating_CBOW_full, f1_mic_rating_CBOW_full, accuracy_rating_CBOW_full]
},index = ['auroc','f1_mac','f1_mic','accuracy'])
compare_full_rate['improvement'] = (compare_full_rate['CBOW_full']-compare_full_rate['baseline_full'])
compare_full_rate.round(4)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
baseline_full | CBOW_full | improvement | |
---|---|---|---|
auroc | 0.5340 | 0.8208 | 0.2868 |
f1_mac | 0.1569 | 0.2780 | 0.1211 |
f1_mic | 0.6456 | 0.6740 | 0.0284 |
accuracy | 0.6456 | 0.6740 | 0.0284 |
This part is for exist customers, when you input your id, we can output some recommend items for you.
usersPerItem = defaultdict(set)
itemsPerUser = defaultdict(set)
catPerItem = defaultdict(set)
rentforPerItem = defaultdict(set)
for d in dataset:
user,item,category,rentfor = d['user_id'], d['item_id'],d['category'],d['rented for']
usersPerItem[item].add(user)
itemsPerUser[user].add(item)
catPerItem[item].add(category)
rentforPerItem[item].add(rentfor)
def Jaccard(s1, s2):
numer = len(s1.intersection(s2))
denom = len(s1.union(s2))
return numer / denom
def user_fav(user_id) :
ratings = []
items = []
rating = []
for d in dataset:
if (d['user_id'] == user_id) :
rating.append(d['rating'])
max_rating = max(rating)
for d in dataset:
if (d['user_id'] == user_id) & (d['rating'] == max_rating):
items.append(str(d['item_id']))
items.sort()
return items[0]
def mostSimilarItem(i):
similarities = []
users = usersPerItem[i]
candidateItem = set()
for u in users:
candidateItem = candidateItem.union(itemsPerUser[u])
for i2 in candidateItem:
if i2 == i: continue
sim = Jaccard(users, usersPerItem[i2])
similarities.append((sim, i2))
sortedList = [i[1] for i in sorted(similarities, key = lambda y: (-y[0], y[1]))[:10]]
cats = [catPerItem[j] for j in sortedList]
occ = [rentforPerItem[j] for j in sortedList]
result = pd.DataFrame({'itemid':sortedList, 'category': cats, 'occasion': occ})
return result
UserId = input("Please input your user id: " )
print(mostSimilarItem(user_fav(str(UserId))))
Please input your user id: 420272
itemid category occasion
0 2732613 {shirt} {date, everyday, party, other, work}
1 925104 {dress} {wedding, everyday, party, vacation, other}
2 1006590 {dress} {date, party, wedding}
3 940419 {dress} {date, wedding, everyday, party, formal affair...
4 962489 {dress} {date, everyday, party, vacation, other, work}
5 1601603 {dress} {date, wedding, party, formal affair, other}
6 2375866 {skirt} {vacation, party, everyday}
7 2363191 {jumpsuit} {date, everyday, party, formal affair, vacatio...
8 1635675 {dress} {date, wedding, everyday, party, formal affair...
9 2211789 {romper} {vacation, party, everyday, other}
renttherunway_final_data.json
The raw data collected from RentTheRunway.com, containing 192,544 records from 105,508 users regarding 5,850 items.
To download this file, please go to https://www.dropbox.com/s/9omw6s9bh8vxkgu/renttherunway_final_data.json?dl=0.
glove.6B.100d.txt
The corpus which is used to train the Glove model (with the embedding dimensions of 100).
To download this file, please go to https://www.dropbox.com/s/sc5uzmrcype7udr/glove.6B.100d.txt?dl=0 or refer to https://nlp.stanford.edu/projects/glove/.
code.ipynb
The code containing all the works mentioned in our report.
report.pdf
The final report which includes detailed process.
Rishabh Misra, Mengting Wan, Julian McAuley (2018). Decomposing fit semantics for product size recommendation in metric spaces.
https://cseweb.ucsd.edu//~jmcauley/pdfs/recsys18e.pdf
Hsin-Yu Chen, Huang-Chin Yen, Tian Tian (2020). Female Clothing E-commerce Review and Rating.