Giter VIP home page Giter VIP logo

health-record-etl-big-data's Introduction

Health Record ETL Big Data application

Automation pipeline to ingest and calculate data in BigData Platform using Hadoop HDFS, Hive, Spark.

Prerequisites

Download the Hortonworks Sandbox (VirtualBox / Docker) from 
https://www.cloudera.com/downloads/hortonworks-sandbox/hdp.html
Ensure VirtualBox (v6.0) / Docker installed
Please note that a computer with minimum 10 GB RAM dedicated to the VirtualBox / Docker required

Environment setup

  1. Import Hortonworks Sandbox

Open VirtualBox and navigate to File -> Import Appliance. Select the sandbox image you downloaded and click Open.

Make sure to allocate at least 10 GB (10240 MB) of RAM for the sandbox.

Click Import and wait for VirtualBox to import the sandbox.

  1. Start Hortonworks Sandbox Once the sandbox has finished being imported, you may start it by selecting the sandbox and clicking “Start” from the VirtualBox menu.

  1. A console window opens and displays the boot process. This process take a few minutes. When you see the following screen, you may begin using the sandbox.

  1. Map Sandbox IP to your desired hostname in the hosts file. Execute below command in linux shell window.
echo '127.0.0.1 sandbox-hdp.hortonworks.com' | sudo tee -a /etc/hosts

SSH on to HDP sandbox

ssh [email protected] -p 2201

Default password is root/hadoop

Please remember to change password after first time login!

Run 'ambari-admin-password-reset' command to reset admin deafult password. (e.g. admin/admin)

Access Shell Web Client

Alternatively you can access sandbox from shell web client (remember to add hostname sandbox-hdp.hortonworks.com) http://sandbox-hdp.hortonworks.com:4200/ or http://localhost:4200

The default login using credentials: root / hadoop

Must change the password when you login for first time. The new password used for accesssing sandbox through ssh / web client.

Access HDP sandbox UI

Access HDP sandbox Ambari UI through http://sandbox-hdf.hortonworks.com:8080 or http://localhost:8080

User credentials:
raj_ops / raj_ops
maria_dev / maria_dev
admin/admin

Data Ingestion

SSH to sandbox ssh [email protected] -p 2201

HDFS

  1. download source data file
wget https://chronicdata.cdc.gov/views/735e-byxc/rows.csv
  1. hdfs make directory
hdfs dfs -mkdir /assignment
  1. copy file from local file system to hadoop destination file system
hdfs dfs -put rows.csv /assignment
  1. verify file copied in hadoop
hdfs dfs -ls /assignment

Hive

  1. Create Hive table - It create table in default schema
CREATE TABLE IF NOT EXISTS health_record(YearStart INT,YearEnd INT,LocationAbbr STRING,LocationDesc STRING,Datasource STRING,Class STRING,Topic STRING,Question STRING,Data_Value_Unit STRING,Data_Value_Type STRING,Data_Value DOUBLE,Data_Value_Alt DOUBLE,Data_Value_Footnote_Symbol STRING,Data_Value_Footnote STRING,Low_Confidence_Limit DOUBLE,High_Confidence_Limit  DOUBLE,Sample_Size INT,Total STRING,`Age(months)` STRING,Gender STRING,`Race/Ethnicity` STRING,GeoLocation STRING,ClassID STRING,TopicID STRING,QuestionID STRING,DataValueTypeID STRING,LocationID INT,StratificationCategory1 STRING,Stratification1 STRING,StratificationCategoryId1 STRING,StratificationID1 STRING)
COMMENT 'This is health record table'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
tblproperties("skip.header.line.count"="1"); 

  1. Desribe table
describe health_record;

  1. Load data from HDFS into Hive table
LOAD DATA INPATH '/assignment/rows.csv' OVERWRITE INTO TABLE health_record;

  1. Check row count to verify data loaded correctly
select count(*) from health_record;

  1. Query to check table data
SELECT * FROM health_record limit 5;

We can view hive queries in Hive Query UI

Spark

Running Spark application

  1. Copy PySpark script health_avg_calculation.py to HDP sandbox:
scp -p 2222 ./health_avg_calculation.py [email protected]:/root
  1. Execute spark-submit to run the program:
spark-submit health_avg_calculation.py \
    --master yarn-client \
    --num-executors 1 \
    --driver-memory 1g \
    --executor-memory 1g \
    --executor-cores 1

Navigate to Spark UI page to view spark job history:

The calculate average results are persist in Hive.

REST API

API to access results stored in Hive

Developed RESTful API service using Spring Boot to access results from Hive

Go to rest-api directory, run mvn install to build the project.

Ensure that port (7755) mentioned in src/main/application.properties is available to use.

Once build Once build is successful, it generate restapi-1.0-SNAPSHOT.jar in target directory.

Copy it to anywhere you want, execute java -jar 'path of the restapi jar'. The below command bring up rest api server and listining to 7755 for api requests.

java -jar target/restapi-1.0-SNAPSHOT.jar

REST API Request through curl

Open a new terminal window, call API using curl command:

To get list of database:

curl http://localhost:7755/api/hive/database
  1. To get list of tables:
curl http://localhost:7755/api/hive/table

To get the average of all age group:

curl http://localhost:7755/api/hive/average

To get the average of female only:

curl http://localhost:7755/api/hive/femaleAverage

Alternatively access APIs using rest api / browser url to get the API response

http://localhost:7755/api/hive/femaleAverage

http://localhost:7755/api/hive/average

The REST API response contains year, question, average.

health-record-etl-big-data's People

Contributors

surajagopal avatar aisuresh avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.