Giter VIP home page Giter VIP logo

scrapy-bigquery's Introduction

scrapy-bigquery

PyPi

A Big Query pipeline to store items into Google BigQuery.

Dependencies ๐ŸŒ

Installation ๐Ÿ“ฅ

This is a python package hosted on pypi, so to install simply run the following command:

pip install scrapy-bigquery

Settings

BIGQUERY_DATASET (Required)

The name of the bigquery dataset to post to.

BIGQUERY_TABLE (Required)

The name of the bigquery table in the dataset to post to.

BIGQUERY_SERVICE_ACCOUNT (Required)

The base64'd JSON of the Google Service Account used to authenticate with Google BigQuery. You can generate it from a service account like so:

cat service-account.json | jq . -c | base64

BIGQUERY_ADD_SCRAPED_TIME (Optional)

Whether to add the time the item was scraped to the item when posting it to BigQuery. This will add current datetime to the column scraped_time in the BigQuery table.

BIGQUERY_ADD_SCRAPER_NAME (Optional)

Whether to add the name of the scraper to the item when posting it to BigQuery. This will add the scrapers name to the column scraper in the BigQuery table.

BIGQUERY_ADD_SCRAPER_SESSION (Optional)

Whether to add the session ID of the scraper to the item when posting it to BigQuery. This will add the scrapers session ID to the column scraper_session_id in the BigQuery table.

BIGQUERY_ITEM_BATCH (Optional)

The number of items to batch process when inserting into BigQuery. The higher this number the faster the pipeline will process items.

Usage example ๐Ÿ‘€

In order to use this plugin simply add the following settings and substitute your variables:

BIGQUERY_DATASET = "my-dataset"
BIGQUERY_TABLE = "my-table"
BIGQUERY_SERVICE_ACCOUNT = "eyJ0eX=="
ITEM_PIPELINES = {
    "bigquerypipeline.pipelines.BigQueryPipeline": 301
}

The pipeline will attempt to create a dataset/table if none exist by inferring the type from the dictionaries it processes, however be aware that this can be flaky (especially if you have nulls in the dictionary), so it is recommended you create the table prior to running.

If you want to specify a table for a specific item, you can add the keys "BIGQUERY_DATASET" and "BIGQUERY_TABLE" to the item you pass back to the pipeline. This will override where the item is posted, allowing you to handle more than one item type in a scraper. The keys/values here will not be part of the final item in the table.

License ๐Ÿ“

The project is available under the MIT License.

scrapy-bigquery's People

Contributors

8w9ag avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.