Giter VIP home page Giter VIP logo

bns's Introduction

Bitcoin News Scraper

Requirements

  • docker (>= 1.13)
  • docker-compose (>= 1.8)
  • MongoDB (optionally)

Credits

Screenshot

Screenshot

Building the docker image

Go to project directory.

cd bns/

Build docker image

docker-compose build

Running the docker container

From the project directory execute and wait ~20 secs for all services to boot up

docker-compose up

Stop the docker container with CTRL+C or CTRL+BREAK or CMD+C (OS X). Wait around 10 secs for all services to stop running. Make sure there are no running scrape jobs when stopping the container to avoid possible data corruption.

Usage

Make sure that port 5000 is free before running the docker. Administration is done through web panel (AdminUI). Navigate to http://localhost:5000/ or http://127.0.0.1:5000/ in your browser.

Configuration

First setup your MongoDB connection details, go to Settings section in the AdminUI, and set up the Connection URI and optionally DB name to your preferred DB name. You can change other scrape related settings from here but most of these settings can be changed also during the creation of scrape jobs.

Scrape job

Go to Spiders section, select at least one spider and click Run. You can change default settings (scrape type, results save target and proxy usage) for current job.

Periodic Scrape Job

Go to Periodic jobs section, select at least one spider and click Add Periodic Job. Adjust scrape details and choose pause time between two scrapes. Periodic job for each spider will spawn one regular scrape job after each repeat time/delay end. Take note: periodic jobs are run in delay->job->delay->job... order.

Saving scraped data

JSON

Data is saved to bns/results/feeds/SPIDERNAME.json

MongoDB

Data is by default stored in bitcoin_news database, Articles collection. MongoDB database files are stored in data/mongo directory. Replica set files are stored in data/mongo2 and data/mongo3 directories. It is recommended to change provided keyfile. From project directory issue (on *nix compatible OS-es):

openssl rand -base64 756 > config/mongo.keyfile
chmod 400 config/mongo.keyfile
Images

Images are stored in bns/results/images/SPIDERNAME/full/.

Avoiding scraper blocking

User agents

Scraper automatically loads the user agent list from data/user_agents.txt, and changes user agent for every request. You can change list of user agents from the User Agents section.

Proxies

List of proxies is located in data/proxies.txt, one proxy per line in format:

PROTOCOL://HOST:PORT/

Example proxy list:

http://127.0.0.1:80/
http://127.0.0.1:8080/
http://127.0.0.1:8888/

You can change proxies from the Proxies section. Each proxy will be disabled/removed after number of retries. If there are no valid proxies, spider will stop scraping.

Delays

Change delay between scrape requests.

Reliability

Sometimes scrapes will fail because of connection issues or because target server is down or overloaded. Lower these values to increase speed and decrease reliability or increase them get less speed but more reliable scraping.

Retries

Change number of scrape attempts on one url after failed scrape before giving up.

Timeout

Number of seconds before giving up or another retry.

Scrape mode

Default scraping mode is to try to scrape the all articles. Scrape new will try to fetch the most recent ones (all articles from categories first page). Spider will try to scrape from all collected article urls for which there are no data stored in chosen target storage (JSON and/or MongoDB).

Twitter

To harvest Twitter data files data/twitter_usernames.txt and data/twitter_keywords.txt needs to be populated, one item per line. To access API auth details should be provided from Settings->Twitter API page in webUI.

bns's People

Contributors

fuzzy69 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.