Giter VIP home page Giter VIP logo

consultingprojatinsight's Introduction

Technologies/Programming Languages used:

1. Python
2. Spark
3. Google Cloud Platform

Project setup:

1. Set up a cluster 
    a. The cluster can be set up through the DataProc console on GCP (The cluster that runs the pyspark jobs
        on GCP are a four node cluster: one master and three worker nodes)
    b. The cluster can be set up via gcloud commands as well. The sampleworkflow.sh in the tools folder
        can be used as a reference to set up a cluster via gcloud commands. 
        (Note: using gcloud on an instance will require having the right set of permissions for the service account)
2. Set up a workflow template to run the job
    a. The shell script (workflow.sh) in the tools folder sets up a workflow template - 
    b. There is a workflow template that is already set up. You can click "RUN" at (omitted for public github)to run the job. 
3. Behind the scenes:
    a. The python scripts that does the reading, and transformation of the data is copied to the storage bucket
    b. The PySpark job is set up to run the script from the above mentioned bucket
    c. The final output of the script is saved to the table on BigQuery
4. How can the whole process be automated? (work in progress)
    a. Run a cron job from an compute engine instance. 
        i. Check out the code from github, and run the job.sh script from the tools folder. 
            This job instantiates the workflow template that is already set up. 
        i. blocker: the service account do not have the right credentials.
    b. Set up a "serverless" mode that can use GCP's Cloud Scheduler, and Pub Sub to run the batch job 
        i. AWS has AWS Batch to run batch jobs. GCP doesn't have something similar. 
        ii. Instructions to set up the batch flow on GCP - 
            https://medium.com/google-cloud/running-a-serverless-batch-workload-on-gcp-with-cloud-scheduler-cloud-functions-and-compute-86c2bd573f25

Testing:

1. The job reads the data from three different sources and puts the final set of data into analytics table.
2. There should be one row per id. Hence, the following query will yeild zero results:
    select a.id, count(*)  from `analytics` a
    group by a.id having count(*) > 1;

consultingprojatinsight's People

Contributors

ghprud avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.