Automation pipeline to ingest and calculate data in BigData Platform using Hadoop HDFS, Hive, Spark.
Download the Hortonworks Sandbox (VirtualBox / Docker) from
https://www.cloudera.com/downloads/hortonworks-sandbox/hdp.html
Ensure VirtualBox (v6.0) / Docker installed
Please note that a computer with minimum 10 GB RAM dedicated to the VirtualBox / Docker required
- Import Hortonworks Sandbox
Open VirtualBox and navigate to File -> Import Appliance. Select the sandbox image you downloaded and click Open.
Make sure to allocate at least 10 GB (10240 MB) of RAM for the sandbox.
Click Import and wait for VirtualBox to import the sandbox.
- Start Hortonworks Sandbox Once the sandbox has finished being imported, you may start it by selecting the sandbox and clicking “Start” from the VirtualBox menu.
- A console window opens and displays the boot process. This process take a few minutes. When you see the following screen, you may begin using the sandbox.
- Map Sandbox IP to your desired hostname in the hosts file. Execute below command in linux shell window.
echo '127.0.0.1 sandbox-hdp.hortonworks.com' | sudo tee -a /etc/hosts
ssh [email protected] -p 2201
Default password is root/hadoop
Please remember to change password after first time login!
Run 'ambari-admin-password-reset' command to reset admin deafult password. (e.g. admin/admin)
Alternatively you can access sandbox from shell web client (remember to add hostname sandbox-hdp.hortonworks.com) http://sandbox-hdp.hortonworks.com:4200/ or http://localhost:4200
The default login using credentials: root / hadoop
Must change the password when you login for first time. The new password used for accesssing sandbox through ssh / web client.
Access HDP sandbox Ambari UI through http://sandbox-hdf.hortonworks.com:8080 or http://localhost:8080
User credentials:
raj_ops / raj_ops
maria_dev / maria_dev
admin/admin
SSH to sandbox ssh [email protected] -p 2201
- download source data file
wget https://chronicdata.cdc.gov/views/735e-byxc/rows.csv
- hdfs make directory
hdfs dfs -mkdir /assignment
- copy file from local file system to hadoop destination file system
hdfs dfs -put rows.csv /assignment
- verify file copied in hadoop
hdfs dfs -ls /assignment
- Create Hive table - It create table in default schema
CREATE TABLE IF NOT EXISTS health_record(YearStart INT,YearEnd INT,LocationAbbr STRING,LocationDesc STRING,Datasource STRING,Class STRING,Topic STRING,Question STRING,Data_Value_Unit STRING,Data_Value_Type STRING,Data_Value DOUBLE,Data_Value_Alt DOUBLE,Data_Value_Footnote_Symbol STRING,Data_Value_Footnote STRING,Low_Confidence_Limit DOUBLE,High_Confidence_Limit DOUBLE,Sample_Size INT,Total STRING,`Age(months)` STRING,Gender STRING,`Race/Ethnicity` STRING,GeoLocation STRING,ClassID STRING,TopicID STRING,QuestionID STRING,DataValueTypeID STRING,LocationID INT,StratificationCategory1 STRING,Stratification1 STRING,StratificationCategoryId1 STRING,StratificationID1 STRING)
COMMENT 'This is health record table'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
tblproperties("skip.header.line.count"="1");
- Desribe table
describe health_record;
- Load data from HDFS into Hive table
LOAD DATA INPATH '/assignment/rows.csv' OVERWRITE INTO TABLE health_record;
- Check row count to verify data loaded correctly
select count(*) from health_record;
- Query to check table data
SELECT * FROM health_record limit 5;
We can view hive queries in Hive Query UI
- Copy PySpark script
health_avg_calculation.py
to HDP sandbox:
scp -p 2222 ./health_avg_calculation.py [email protected]:/root
- Execute spark-submit to run the program:
spark-submit health_avg_calculation.py \
--master yarn-client \
--num-executors 1 \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1
Navigate to Spark UI page to view spark job history:
The calculate average results are persist in Hive.
Developed RESTful API service using Spring Boot to access results from Hive
Go to rest-api directory, run mvn install
to build the project.
Ensure that port (7755) mentioned in src/main/application.properties is available to use.
Once build Once build is successful, it generate restapi-1.0-SNAPSHOT.jar in target directory.
Copy it to anywhere you want, execute java -jar 'path of the restapi jar'. The below command bring up rest api server and listining to 7755 for api requests.
java -jar target/restapi-1.0-SNAPSHOT.jar
Open a new terminal window, call API using curl command:
To get list of database:
curl http://localhost:7755/api/hive/database
- To get list of tables:
curl http://localhost:7755/api/hive/table
To get the average of all age group:
curl http://localhost:7755/api/hive/average
To get the average of female only:
curl http://localhost:7755/api/hive/femaleAverage
Alternatively access APIs using rest api / browser url to get the API response
http://localhost:7755/api/hive/femaleAverage
http://localhost:7755/api/hive/average
The REST API response contains year, question, average.