We developed an API that is called by a telegram bot, where the Rossmann Store CEO can access on his cellphone and have the sales prediction of each store by hand, enhancing speed in business decision making about the store renovation options.
- Project Overview
- Business Problem and Dataset
- Data Description
- Feature Engineering
- Exploratory Data Analysis
- Machine Learning Modeling
- Model Deployment
- Making requests (Telegram Bot)
A quick look at the top-level files and directories you'll see in rossmann-store-sales.
.
├── data
├── notebooks
├── .gitignore
├── README.md
├── deployment
├── .vscode
├── img
├── model
├── requirements.txt
├── parameter
-
/data
: This directory contains all of the unprocessed dataset -
/notebooks
: This directory will contain all of the code related to analysis, EDA, modeling, and so on. -
.gitignore
: This file tells git which files it should not track / not maintain a version history for. -
README.md
: A text file containing useful reference information about your project -
deployment
: This folder contains productionized code ready for deployment. -
.vscode
: configuration of enviroments used within vscode editor. -
img
: folder containing the images from analysis and other images -
model
: This folder contain the model serialized by pickle -
requirements
List of packages used in the project for reproducibility. -
parameter
This folder contain the parameters serialized by pickle.
Rossmann is a private, German drug store chain founded in 1972 and is a key player in the European pharmacy market, with operations in healthcare and beauty retail industries. According to Bloomberg, Rossmann offers a wide range of products including baby and body care, hygiene, sun protection, cosmetics, dental hygiene, household, pets, hair care, perfume, fragrances, and food products.
Aside from the +2,000 on-site Germany stores (see stores location here), Rossmann operations extend to Poland, Czech Republic, Turkey, Albania, and Hungary, totaling +4,100 on-site stores.
Rossmann is also active on e-commerce for Germany-based customers, with around $30 million EUR in online revenues per year, making up for 15.2% of market share in Germany. Rossmann's 2018 annual revenue was approximately $9 billion EUR (Dun & Bradstreet). Further financial information is not publicly available.
In this project, a machine learning model was trained to predict sales revenues for Rossmann. The following setup was utilized:
- Project Methodology: the CRISP-DM was used as the main project management methodology. Two cycles were completed within the total project implementation length of 2-months; a log of each CRISP-DM cycle can be accessed here.
- Business Problem and Solution: a fictitious business problem was created to motivate the project. Due to an upper-management request, a 6-week sales prediction project for each Rossmann store will be delivered to the business. Predictions will be available through a Telegram Bot where stakeholders can retrieve information on their smartphones. A machine learning solution was feasible once the CEO of Rossmann wanted the predictions to evaluate if he could use revenue to renovate stores.
-
Data Collection: Data was acquired from Rossmann's Store Sales Kaggle competition:
-
Data Dimensions (rows x columns):
- Train dataset: 969264 x 18
- Valid dataset: 47945 x 18
- Date Range: 2013-01-01 (first) / 2015-07-31 (last)
In this project, we split the whole data into training and validation parts:
- Training data corresponds to all data entries between 2013-01-01 to 2015-06-19
- Validation data contains entries from the last 6 weeks of available data, 2015-06-19 to 2015-07-31.
- Test data corresponds to data entries between 2015-07-31 to 2015-09-16. This data doesn't have the target variable
sales
and will be used as the input to generate predictions in production
-
A little feature engineering was done in the beginning to deal with time variables
- Extract year, month, day, week of year, and year week of the data provided
- Create new variable called "competition since" which is the time since the competition started
- Created new variable called "promo_since" which is the time since last promo started
We generated hypotesis to confirm or reject during Exploratory Data Analysis
False, these stores sell more
False, there stores sell more
False, these stores sell less after a period of promotion
False, stores are selling less across the years
That's False, stores sell less during the second semester of the year
False, Stores sell roughly the same
True, stores sell less on weekends
We mainly trained five models, the performance can be seen in the table below
We performed cross-validation with five folds to adress the model real performance
The chosen model was XGBoostRegressor, the error plot is as follows:
The model was deployed within Heroku platform, production code was a handler that creats the API, a tester that calls the API and a class named Rossmann.py, with all the models and parameters serealized by pickle