The project creates a distributed map-reduce program for parallel processing of the randomly generated log messages to get insights about the log levels based on different characteristics . The problem is broken down into smaller tasks which can be executed parallelly using Map-Reduce in Apache Hadoop. The task is executed in Amazon EMR which acts as a elastic and low cost solution to quickly execute the program.
- Install SBT to build the jar
- Terminal to SSH and SCP into VM to execute Hadoop commands
- Install VM Workstation Pro and run Horton Sandbox which comes with Apache Hadoop installation.
- AWS account to execute the jar file in Amazon EMR .
- Clone the GIT repository by using git clone https://github.com/Gnanamanickam/LogFileGenerator.git
- Run the following commands in the console
sbt clean compile test
sbt clean compile run
-
It builds and compiles the project
-
If you are using IntellIj clone the repository by using "Check out from Version Control and then Git."
-
The scala version should be set in the Global Libraries under Project Structure in Files .
-
The SBT configuration should be added by using Edit Configurations and then simulations can be ran in the IDE .
-
Now Build the project and then run the mapReduce in hadoop to get the output.
sbt clean compile assembly
The above commands generate the jar file in the root target folder under the scala path and this jar file has to be moved into the hdfs file system to execute the map reduce in the hadoop environment .
Run the main method in generate log data class to generate the random log values . For this project , I have generated 20k records to be used as a input value .
Map Reduce is used for parallel processing of big data over distributed systems. It consists of Mapper and Reducer .
The task of Mappers job is to obtain a set of key value pairs and produce an intermediate set of key value pairs while that of Reducers job is to obtain the intermediate set of key value pairs produced by the the Mappers and reduce them by grouping values that share a similar key.
We will execute MapReduce in Hadoop Environment .
- Start the hortonworks in the VM workstation Pro . The URL to be used to SSH will be displayed on the screen once it starts .
- Use the URL to SSH and login into the hortonworks
ssh -p 2222 [email protected]
The default password in hadoop . Set the password you require after that . Make sure to use port 2222 to login to execute hadoop commands .
su - hdfs
Change the filesystem to hdfs to execute the mapreduce in hadoop environment .
scp -P 2222 Path\LogFileGenerator-assembly-0.1.jar [email protected]:/home/hdfs
scp -P 2222 Path\input.txt [email protected]:/home/hdfs
Copy the jar and the input file to the hdfs system . SCP is used to copy the file from local system to VM system .
cd /home/hdfs
hdfs dfs -mkdir /LogFileGenerator
Create a hdfs directory for copying the local file to hdfs format
hdfs dfs -copyFromLocal LogFileGenerator-assembly-0.1.jar /LogFileGenerator
The above command is used to copy the file from normal file system to HDFS file system
hdfs dfs -rm -R /LogFileGenerator/LogFileGenerator-assembly-0.1.jar
Incase we need to delete the file we have to use the above command .
hadoop jar LogFileGenerator-assembly-0.1.jar ClassName inputfile outputfile
In order to execute the Map Reduce job, you must use the above command to execute the particular class file from the given jar . The input file will be taken and parsed in the code level .
hdfs dfs -rm -R outputfile
The above command is used to delete the output file incase we want to rerun the jar again and take a new output .
hdfs dfs -text outputFile/part-r-00000
To view the output file use the above command
DEBUG,781
ERROR,75
INFO,5284
WARN,1407
The above output is a csv generated by executing the mapReduce to generate the log distribution between two start and end time intervals from the input file randomly generated using log generated .
01:53:49 5
01:54:04 4
01:54:38 3
01:54:47 3
01:54:05 2
01:53:59 2
The above output is the timestamp splitted by seconds and the error log count has been calculated based on it and printed in descending order
DEBUG 2060
ERROR 200
INFO 13987
WARN 3756
The above output is the count of number of log level distribution in the given input file .
DEBUG 87
ERROR 79
INFO 105
WARN 97
The above output is the number of characters of the longest log message in the input file for every single log level distribution . The above message matches the regex pattern with two continuous numbers .
- Create a account in aws.amazon.com and create a IAM user .
- Now create a S3 bucket and deploy the jar and input file in the s3 bucket in AWS .
- Now go to Amazon EMR and create a cluster which is used to rapid processing and analyzing big data in AWS .
- Configure the steps displayed as required and run the job .
- On completion run the jar file and check the output folder which will have the output of the mapReduce class which has been executed in the hadoop environment in EMR .
Youtube Link : https://youtu.be/0_RzY-82LeQ