This is meant to be a recipe for running streaming pipelines in Python.
Subscribe to a pubsub topic, extract data and store in a BigQuery Table.
- Python 3.7
- Apache Beam
- GCP Project
- Dataflow and PubSub APIs Enabled
- Access to BigQuery
First clone this repo.
git clone https://github.com/meirelon/streaming_data.git
cd streaming_data
Next, create a virtual environment with Python37 and install the apache beam dependency.
virtualenv py37
source activate py37
pip install -r requirements.txt
Then create a new bucket for streaming logs.
gsutil mb gs://[PROJECT ID]-streaming
Create a folder within this bucket called tmp
which will be used later in the process.
For this example, we are going to use the NYC Taxi Rides pubsub topic provided by GCP. Now we deploy! The command is going to look like the following:
python -m example --runner DataflowRunner \
--project [PROJECT ID] \
--temp_location gs://[PROJECT ID]-streaming/tmp/ \
--input_topic "projects/pubsub-public-data/topics/taxirides-realtime" \
--output_table "[PROJECT ID]:[DATASET NAME].[TABLE NAME]" \
--streaming
More information about subscribing to the NYC Taxi Rides pubsub here. Please read about how pubsub works here.
I am happy to review PRs, or if you liked this example, I also gladly accept Crypto Currency!