Giter VIP home page Giter VIP logo

ph_miner's Introduction

PH_miner

A ProductHunt.com miner in Python3.

Installation

Execute the following commands:

$ git clone https://github.com/collab-uniba/PH_miner.git
$ git submodule init
$ git submodule update

Setup

  1. Register two apps using the dashboard, PH_miner and PH_updater.

  2. For the first app, in the root folder, create the file credentials_miner.yml with the following structure:

api:
  key: CLIENT_KEY
  secret: CLIENT_SECRET
  redirect_uri: APP_REDIRECT_URI
  dev_token: DEVELOPER_TOKEN
  1. For the second app, follow the same steps as above to create the file credentials_updater.yml.

  2. Create the folder db/cfg/, then create therein the file dbsetup.yml to setup the connection to the MySQL database:

mysql:
    host: 127.0.0.1
    user: root
    passwd: *******
    db: producthunt
    recycle: 3600

NOTE: If you're using a MySQL database, the default parameter pool_recycle for resetting the database connection is fine, since the wait_timeout is set to 28800 by default. But, if you're using Maria DB, then wait_timeout is set by default to 600 seconds. Edit the my.cnf file and change it to anything larger than the value chosen for pool_recycle.

  1. Install packages via pip:
$ pip install -r requirements.txt
  1. Enable execution via crontab:
$ crontab -e

Add the following lines. Make sure to enter the correct path.

SHELL=bash
# New products are uploaded at 12.01 PST (just past midnight, 9am next morning in CET timezone):
# minute hour day-of-month month day-of-week command
    35     8       *          *       *       /path/.../to/PH_miner/cronjob.sh /var/log/ph_miner.log 2>&1
    05    20       *          *       *       /path/.../to/PH_miner/cronjob.sh --update -c credentials_updater.yml >> /var/log/ph_miner_updates.log 2>&1
    */30   *       *          *       *       /path/.../to/PH_miner/cronjob.sh --newest -c credentials_updater.yml >> /var/log/ph_miner.log 2>&1
  1. Enable the rotation of the log files:
$ sudo ln -s /fullpath/to/../ph_miner.logrotate /etc/logrotate.d/ph_miner 
  1. Install Chromium browser and the chromedriver

This step depends on the OS. On Ubuntu boxes, run:

$ sudo apt-get install chromium-browser chromium-chromedriver
$ sudo ln -s /usr/lib/chromium-browser/chromedriver /usr/bin/chromedriver

Resources & Libraries

  • Product Hunt API
  • ph_py - ProductHunt.com API wrapper in Python
  • Scrapy - A scraping and web-crawling framework
  • Selenium - A suite of tools for automating web browsers
  • ChromeDriver - Tool to connect to Chromium web browser
  • Beautiful Soup 4 - HTML parser

License

The project is licensed under the MIT license.

ph_miner's People

Contributors

bateman avatar dependabot[bot] avatar gianfrancobrescia avatar snyk-bot avatar yashovardhansingh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ph_miner's Issues

Extract today's popular

The list of today' popular app must be retrieved and stored in a local CSV file named popular_YYYY-mm-dd.csv.
The following information must be mined for each app:

  • name
  • tagline
  • URL of the product logo image
  • URL of the product page
  • list of topics
  • number of upvotes
  • number of comments

Extract reviews

Only comments can be retrieved via API. Mining the product reviews requires the development of a custom scraper with Scrapy and selenium

Extract basic info from product pages

Given the link to the specific page of a 'product of the day', extract the following pieces of info:

  • Links to featured pictures and videos
  • Description
  • Links to 'Around the Web' articles talking about the app (if any)
  • Name of the Hunter and link to their personal profile page
  • Names of the maker(s) and link to their personal profile page(s)

Extract discussion from product pages

From a product page, extract the discussion thread, retrieving specifically:

  • the name of the comment author, link to the profile page, and role (PRO, MAKER, or nothing)
  • time when the comment was posted
  • text of the comment
  • the number of upvotes received
  • TODO: are replies and comments treated differently?!?

Extract reviews from product pages

From a product page, extract the reviews received storing:

  • Review name and link to the profile page
  • Review date
  • Review text
  • Recommended or not
  • Pros and cons
  • Helpful counts
  • Comments count

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.