Giter VIP home page Giter VIP logo

markoshlima / positional-file-process Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 894 KB

This project is based for legacy applications that works with positional files to process data. The objetive is read these positional files when they arrives in AWS S3, and then send to a dataware-house like AWS Redshift, and finally read the results with a Business Intelligence tool as AWS QuickSight.

aws-lambda aws-sqs aws-sns aws-athena aws-quicksight aws-redshift python analytics aws-s3 aws-glue cloudformation

positional-file-process's Introduction

Introduction

This project is based for legacy applications that works with positional files to process data. The objetive is read these positional files when they arrives in AWS S3, and then send to a dataware-house like AWS Redshift, and finally read the results with a Business Intelligence tool as AWS QuickSight. It's a simple flow, but very rich and usefull for these kind of application.

To simulate this scenario, it is using Soccer Brazilian Championship. The files in folder '/data' contains three positional files that can be read and loaded into Redshift throught AWS Glue.

The files contains positional data like this: 2708202002305Neo Química Arena Rodolpho Toski Marques Corinthians Fortaleza Tiago Nunes Rogério Ceni 101501010501141211130206000521101007

And the target is the folowing: { "dia":"27", "mes":"08", "ano":"2020", "hora":"02", "minuto":"30", "rodada":"5", "colocacao_mandante":"10", "colocacao_visitante":"15", "gols_mandante":"0", "gols_visitante":"1", "escanteios_mandante":"01", "escanteios_visitante":"05", "faltas_mandante":"01", "faltas_visitante":"14", "chutes_bola_parada_mandante":"12", "chutes_bola_parada_visitante":"11", "desefas_mandante":"13", "desefas_visitante":"02", "impedimentos_mandante":"06", "impedimentos_visitante":"00", "chutes_mandante":"05", "chutes_visitante":"21", "chutes_fora_mandante":"10", "chutes_fora_visitate":"10", "estadio":"Neo Química Arena", "arbitro":"Rodolpho Toski Marques", "time_mandante":"Corinthians", "time_visitante":"Fortaleza", "tecnico_mandante":"Tiago Nunes", "tecnico_visitate":"Rogério Ceni" }

Positions

Below is showed the contract positions about the file.

name begin end size
dia 1 2 2
mês 3 4 2
ano 5 8 4
hora 9 10 2
minuto 11 12 2
rodada 13 13 1
estadio 14 53 40
arbitro 54 83 30
time mandante 84 96 13
time visitante 97 109 13
técnico mandante 110 124 15
técnico visitante 125 139 15
colocação mandante 140 141 2
colocação visitante 142 143 2
gols mandante 144 144 1
gols visitante 145 145 1
escanteios mandante 146 147 2
escanteios visitante 148 149 2
faltas mandante 150 151 2
faltas visitante 152 153 2
chutes bola parada mandante 154 155 2
chutes bola parada visitante 156 157 2
defesas mandante 158 159 2
defesas visitante 160 161 2
impedimentos mandante 162 163 2
impedimentos visitante 164 165 2
chutes mandante 166 167 2
chutes visitante 168 169 2
chutes fora mandante 170 171 2
chutes fora visitante 172 173 2

Application Architecture

alt text

  • The files can be uploaded into S3
  • S3 triggers a Lambda Function that will starts a Glue Job
  • The Glue Job will read this file and process with PySpark to do the ETL Job.
  • After transformation, the DataFrame will be streamed into a SQS queue.
  • SQS will trigger a lambda funtion to do transational operations about each line/event like data normalization
  • Lambda will save these normalized data into Redshift
  • The data from Redshift could be readen from QuickSight
  • The data from S3 could be readen from Athena
  • Every trigger there is an exception flow

Quicksight Results

After all process has been executed with sucess, we could see the results inside Redshift. In this example was created some graphs and reports about the Brasileirão, and it will changing according new files been processed. Here it was listed:

  • The current game round
  • Total rounds
  • More gols kicks
  • More gols about all teams
  • The coach that his team had done more fouls.
  • The referee who whistled more games

alt text

alt text

alt text

Pricing

All resources were priced in AWS Calculator, the following link, the final price of this architecture, excluding Lambda, SQS, S3 and Cloudfront services, because it is Free Tier Elegible or the value is low to input in this pricing cotation.

Click here for Pricing Project

Addition Information & Setup

The project is for lab and studies, it is not operating so as not to keep costs in production. That's a reason not existing a Continuos Develivy implemented too.

To setup the environment:

  • Use folder /Iaac in this repository (Infraestructure as a Code) to publish the S3 buckets, SNS, Redshift and also SQS queues in AWS CloudFormation.
  • In Lambda functions there is the SAM Template for creating (CloudFormation) the backend applications as well.
  • In folder /sql there are two scripts: DDL - RedShift Tables.sq that is DDL for redshift tables and Quicksight Custom Query.sql that is the custom query used in Quicksight to explore some graphs.

positional-file-process's People

Contributors

markoshlima avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.