DWL-Project_SCJ

Repository for DWL-Project of group SCJ. It is expected, that the user of this repository has some basic knowledge about:

Python
AWS Services RDS, Lambda, S3
Apache Airflow Docs
(Docker Docs, depends on the operating system)

!! DWL_02 !!

For details of contents of the second part of the project for the second module of Data Warehouse and Data Lake Systems, please refer the folder DWL_02

About the Project

The idea that you can add value to your finances through a smart investment strategy is something most individuals understand relatively early in life. It is the case, however, that the decision to invest a portion of one's assets is not made until later for many individuals. The main purpose for the whole project is, to analyse and present the performance of various investment opportunities over the last few months, in order to subsequently provide an overview of investment strategies and opportunities for newcomers.

Data Source

Data was extracted from four API

Binance: prices and other key figures of top 10 crypto currencies (according to market cap)
YahooFinance: prices and other key figures of four indices and two precious metals
Reddit: posts in which the investment assets are mentionned
Twitter: Count of tweets which have hashtags of the investment assets

TwitterAPI_HistoricalData - File

Code for extracting and loading the data of the Twitter API within the period 1/1/2021 - 5/4/2022. Goal of this script was to load the data into the RDS from the beginning of the period, we want to start the analyses, till the start of the daily data load. The code was executed once, as part of the script, the database tables were created.

Requirements:

Packages mentioned in the first cell should be installed
Access to a Twitter Academic research account
Database is prepared
Database credentials and Twittwer Bearer token are stored in an .env-file

TwitterAPI_Lambda_dailyload - File

Code of a AWS Lambda function for extracting and loading the data of the Twitter API of the last day on a daily basis. The function is executed every day.

Requirements:

Packages mentioned in the first cell should be part of a layer in the Lambda function
Access to a Twitter developer account
Database is prepared
Database credentials and Twittwer Bearer token are stored as environmental variable of the lambda function

YahoofinanceAPI_HistoricalData - File

Code for extracting and loading the data of the YahooFinance API within the period 1/1/2021 - 31/3/2022. Goal of this script was to load the data into the RDS from the beginning of the period, we want to start the analyses, till the start of the daily data load. The code was executed once, as part of the script, the database tables were created.

Requirements:

Packages mentioned in the first cell should be installed
Database is prepared
Database credentials are stored in an .env-file

YahoofinanceAPI_Lambda_dailyload - File

Code of a AWS Lambda function for extracting and loading the data of the YahooFinance API of the last day on a daily basis. The function is executed every day.

Requirements

Packages mentionned in the first cell should be part of a layer in the Lambda function
Database is prepared
Database credentials are stored as environmental variable of the lambda function

BinanceAPI_HistoricalData - File

Code for extracting and loading historical cryptocurrency data from Binance within the period 1/1/2017 - today. Goal of this script is the same like for YahooFinance: Load the data into the RDS from beginning of the period and create the database tables.

Requirements:

Packages mentioned in the first cells of code may be needed to be installed.
Database is prepared
Binance account is necessary incl. creating an API-Connection on Binance account
Credentials stored in an .env-File

BinanceAPI_Lambda_dailyload - File

Code of a AWS Lambda function for extracting and loading daily the data of the Binance API of the previous day. The function is executed every day.

Requirements:

Packages mentionned in the first cell should be part of a layer in the Lambda function
Database is prepared
Binance account is necessary incl. creating an API-Connection on Binance account
Database credentials are stored as environmental variable of the lambda function

Reddit_HistoricalData - File

Code for extracting, light transforming and load comment data from Reddit to S3 bucket from 1/1/2021 to circa 5/4/2022.

Requirements:

Packages mentioned in the first cells of code may be needed to be installed. Alternative:
- Use the requirements.txt in the airflow-docker folder
.env-File in the same folder like the script with the credentials of at least AWS (Reddit is only necessary if praw-library is used, e.g. for looking for certain subreddits). The naming of the variables can be take out of the script.
an S3 bucket on AWS (IMPORTANT: put in the right bucket name in Line 106)
enough time if you're looking for bitcoin-comments for a long period 😉

ApacheAirflow / Reddit_PeriodicalData_Airflow - Folder

All necessary files for getting periodical Reddit data with Apache Airflow. The current configuration was run on an Windows10 operating system inside a Docker Container. It is also possible to run the DAGs outside of a Docker container.
It is intended to run the code every three days. If a different period is desired, Line 25 and Line 116 have to be changed.

Requirements: (assuming Apache Airflow will be run in Docker on Windows10) The installation of Docker + Airflow is very well explained here - text and here - Video

Install Docker Engine (incl. Docker Compose)
Copy the project-Folder airflow-docker to the desired location
Run the command below to ensure the container and host computer have matching file permissions:

echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env

Run docker-compose with following command -> Apache Airflow will be started, missing packages should be installed (using those two files: requirements.txt & Dockerfile)

docker-compose up

Specify following variables in Apache Airflow:
- AWS credentials (ACCESS_KEY, SECRET_KEY, SESSION_TOKEN)
- S3 bucket name (bucket = 'Bucketname of S3-Bucket')
Run the DAG Reddit_PeriodicalData_Airflow.py

databauheini / dwl-project_scj Goto Github PK

dwl-project_scj's Introduction

DWL-Project_SCJ

!! DWL_02 !!

About the Project

Data Source

TwitterAPI_HistoricalData - File

TwitterAPI_Lambda_dailyload - File

YahoofinanceAPI_HistoricalData - File

YahoofinanceAPI_Lambda_dailyload - File

BinanceAPI_HistoricalData - File

BinanceAPI_Lambda_dailyload - File

Reddit_HistoricalData - File

ApacheAirflow / Reddit_PeriodicalData_Airflow - Folder

dwl-project_scj's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent