Giter VIP home page Giter VIP logo

grocery's Introduction

Grocery Repo

This repo contains scripts to do web scraping of e-commerce grocery web pages.
This is done merely as a school project and we adhere to the robots.txt and ethics of being a good webscrapper.

Project Directory

The project directory would look something like this after a few rounds of scraping.

├── config
├── data
│   ├── fairprice
│   │   └── raw
│   │       └── 20180820_links.csv
│   └── redmart
│       ├── processed
│       │   └── data.csv
│       └── raw
│           ├── 20180809_data.json
│           ├── 20180811_data.json
│           ├── 20180812_data.json
│           ├── 20180813_data.json
│           ├── 20180814_data.json
│           ├── 20180818_data.json
│           └── 20180820_data.json
├── environment.yml
├── LICENSE
├── log
├── notebook
│   ├── FairPrice Exploratory.ipynb
│   ├── MongoDB.ipynb
│   ├── narrative-python.ipynb
│   └── Redmart Exploratory .ipynb
├── pictures
├── README.md
├── report
├── robots
│   ├── fairprice_robots.txt
│   └── redmart_robots.txt
├── seleniumdrivers
│   └── chromedriver
└── src
    ├── fairprice.py
    ├── make_dataset.py
    ├── make_mongodb_redmart.py
    ├── redmart.py
    └── utils.py

Requirements

In order to run the code, you would need to have Anaconda3 installed.

Setup

  1. Clone the repo
git clone https://github.com/notha99y/Grocery.git
  1. Set up conda environment
cd Grocery
conda env create -f=environment.yml

With this, you are set to do webscraping and some simple data analysis

Dockerfile

TODO

RedMart (How to use)

Currently, these are some of the things you do can with the Redmart Scripts

  • Scrape Redmart
    • This would create a project directory called data/raw, scrapes the redmart webpage and saves it into a .json file
    • This would take roughly 10 mins (depending on your internet speed and redmart servers) The .json file would be rougly 200 MB in size
python src/redmart.py

The redmart collection is roughly 60 MB in size and have ~ 62,000 unique items

  • Extract data into MongoDB
    • This would extract the relevant information from the raw json data and add it into MongoDB.
    • The extracted data will be saved into a db called Grocery and in a collection called redmart
python src/make_mongodb_redmart.py

Tableau Analysis

For those who are familiar with Tableau could connect the MongoDB to Tableau.

How to connect MongoDB to Tableau

Interactive Dashboard

We have came up with an Interactive Dashboard ( <-- Click )

Go Full Screen for maximum viewing pleasure.

Screenshot of Dashboard

redmart_analysis

FairPrice

The FairPrice Script takes roughly 4 hours to run and would output a .csv file called $TODAY_DATE_links.csv (which contains all the products listing links of fairprice) create collection called fairprice of ~6 MB in size with ~6,500 unique items.
In order to run the script, you would need to do the following:

  • Get Selenium Chromedriver download here
  • Unzip the chromedriver
  • Set the chrome_driver variable in line 65 of utils.py to the path of your downloaded chromedriver

With these, you are ready to run. The script would given you the following outputs:

  • date_links.csv (36 mins): csv file which contains all the product links
  • A Grocery MongoDB with a fairprice collection of product documents
python src/fairprice.py

Interactive Dashboard

We have came up with an Interactive Dashboard ( <-- Click )

Go Full Screen for maximum viewing pleasure.

Cold Storage

TODO

MongoDB

We could run some commands using the MongoDB shell.
Alternatively, the default MongoDB GUI called Compass could do the job.

Screenshot of the GroceryDB and its collections

mongodb_preview

  1. Open MongoDB shell
mongo
  1. Show databases
    • You should see a db called Grocery
show dbs
  1. Show collections
    • This would display the two collections:
      1. fairprice
      2. redmart
use Grocery
show collections
  1. Count Query
    • We can count the number of items in the collections
db.redmart.find().count()
  1. Get Data Size
db.redmart.dataSize()
  1. Drop collections
db.redmart.drop()

grocery's People

Contributors

derekchia avatar notha99y avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.