Giter VIP home page Giter VIP logo

brahimmade2016 / stockitup Goto Github PK

View Code? Open in Web Editor NEW

This project forked from govardhan1194/stockitup

0.0 0.0 0.0 888 KB

A unified streaming pipeline to process historical as well as real time stock data to provide insightful dashboards to rookie investors and to also provide clean,processed, critical features to data scientists on demand. Utilizing the tiered storage feature to eliminate the need for a datawarehouse and using s3 as an offload storage for historical data.

Python 100.00%

stockitup's Introduction

StockItUp

Introduction:

Stock data is one of the resources that is widely used and is in demand. It can be utilized in several ways to provide insightful information in various aspects. There are numerous stand-alone approaches by individuals to address each use case. StockItUp aims to bridge this gap by providing an easy to use single infrastructure, where data can be manipulated in any way which would facilitate any use case.

StockItUp is a unified streaming pipeline to process historical as well as real-time stock data to provide insightful dashboards to rookie investors and also to provide clean, processed, critical features to data scientists on-demand. Another important business aspect that StockItUp addresses is that it eliminates the need for an external data warehouse, making this a data warehouse-less infrastructure.

The Unified Approach:

StockItUp utilizes a single tool called Apache Pulsar, a pub/sub messaging system where data from the data source is ingested and making use of Pulsar functions, the data is processed and stored in another topic. By setting up an appropriate retention period threshold either by setting up a time or ledger size, the data is then offloaded to an external object storage service like s3. Towards the end, utilizing pulsar SQL, data can be queried from the cluster as well as from s3, to satisfy front end query demands. This facilitates the data requirements for data scientists and also for a front end which provides an insightful dashboard for rookie investors.

Typical tech stack:

TypicalTechStack

A typical tech stack for a project like this would look like this where there is a data source and the data is ingested by Kafka and processed by spark streaming and so forth. But if you look at this infrastructure, there are a lot of tools interacting with one another and there are a lot of complications. However, Pulsar eliminated all these complications with its inbuilt features.

Tech stack that was implemented:

TechStackImplemented

The data source that I am using here is called XETRA, which is the reference market for exchange trading in German shares and exchange-traded funds.

Data is ingested from the data source into the Pulsar cluster. Pulsar functions are utilized to process the raw data ingested, for example, create new columns which indicate profit or loss, have data only during the trading hours, etc., and store them in another topic called processedData within the cluster.

A retention period of a month is set, after which data from the cluster is offloaded into s3 utilizing the Tiered storage feature of Pulsar.

Complex queries can be executed on the available data through Pulsar SQL according to the front end queries as the data is being streamed in. Also, queries can be executed to retrieve historical data from s3 for data scientists on-demand.

Breaking down the infrastructure:

The hero of this infrastructure is Apache Pulsar, however, multiple nodes within the pulsar cluster enable the functioning of the entire pipeline. This section shows how many nodes are used for this project.

  • 3 pulsar broker nodes/ BookKeeper nodes
  • 3 Zookeeper nodes
  • 3 Pulsar SQL (Presto) nodes
  • 1 Producer node
  • 1 Consumer node

Key Takeaways:

  • Scalability and high throughput can be achieved by utilizing a cluster of Pulsar brokers. More number of nodes can improve the performance.
  • Historical data and real-time data can be made available using Pulsar.
  • Data warehouse-less warehouse can be implemented by utilizing the tiered storage feature provided by Pulsar
  • Apache pulsar is really powerful and has a lot of features!

Appendix:

Data set: Deutsche Börse Public Dataset - https://registry.opendata.aws/deutsche-boerse-pds/

Presentation Slides: https://bit.ly/3hR3GBl

stockitup's People

Contributors

govardhan1194 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.