The logfilegenerator from gnanamanickam

Gnanamanickam Arumugaperumal

Overview

The project creates a distributed map-reduce program for parallel processing of the randomly generated log messages to get insights about the log levels based on different characteristics . The problem is broken down into smaller tasks which can be executed parallelly using Map-Reduce in Apache Hadoop. The task is executed in Amazon EMR which acts as a elastic and low cost solution to quickly execute the program.

Prerequisites

Install SBT to build the jar
Terminal to SSH and SCP into VM to execute Hadoop commands
Install VM Workstation Pro and run Horton Sandbox which comes with Apache Hadoop installation.
AWS account to execute the jar file in Amazon EMR .

Installation

Clone the GIT repository by using git clone https://github.com/Gnanamanickam/LogFileGenerator.git
Run the following commands in the console

sbt clean compile test

sbt clean compile run

It builds and compiles the project
If you are using IntellIj clone the repository by using "Check out from Version Control and then Git."
The scala version should be set in the Global Libraries under Project Structure in Files .
The SBT configuration should be added by using Edit Configurations and then simulations can be ran in the IDE .
Now Build the project and then run the mapReduce in hadoop to get the output.

Execution

sbt clean compile assembly

The above commands generate the jar file in the root target folder under the scala path and this jar file has to be moved into the hdfs file system to execute the map reduce in the hadoop environment .

Run the main method in generate log data class to generate the random log values . For this project , I have generated 20k records to be used as a input value .

MapReduce

Map Reduce is used for parallel processing of big data over distributed systems. It consists of Mapper and Reducer .

The task of Mappers job is to obtain a set of key value pairs and produce an intermediate set of key value pairs while that of Reducers job is to obtain the intermediate set of key value pairs produced by the the Mappers and reduce them by grouping values that share a similar key.

We will execute MapReduce in Hadoop Environment .

Steps

Start the hortonworks in the VM workstation Pro . The URL to be used to SSH will be displayed on the screen once it starts .
Use the URL to SSH and login into the hortonworks

 ssh -p 2222 [email protected]

The default password in hadoop . Set the password you require after that . Make sure to use port 2222 to login to execute hadoop commands .

 su - hdfs

Change the filesystem to hdfs to execute the mapreduce in hadoop environment .

scp -P 2222 Path\LogFileGenerator-assembly-0.1.jar [email protected]:/home/hdfs
scp -P 2222 Path\input.txt [email protected]:/home/hdfs

Copy the jar and the input file to the hdfs system . SCP is used to copy the file from local system to VM system .

cd /home/hdfs
hdfs dfs -mkdir /LogFileGenerator

Create a hdfs directory for copying the local file to hdfs format

hdfs dfs -copyFromLocal LogFileGenerator-assembly-0.1.jar /LogFileGenerator

The above command is used to copy the file from normal file system to HDFS file system

hdfs dfs -rm -R /LogFileGenerator/LogFileGenerator-assembly-0.1.jar

Incase we need to delete the file we have to use the above command .

hadoop jar LogFileGenerator-assembly-0.1.jar ClassName inputfile outputfile

In order to execute the Map Reduce job, you must use the above command to execute the particular class file from the given jar . The input file will be taken and parsed in the code level .

hdfs dfs -rm -R outputfile

The above command is used to delete the output file incase we want to rerun the jar again and take a new output .

hdfs dfs -text outputFile/part-r-00000

To view the output file use the above command

Output

Task1 -> Log distribution between time intervals

DEBUG,781
ERROR,75
INFO,5284
WARN,1407

The above output is a csv generated by executing the mapReduce to generate the log distribution between two start and end time intervals from the input file randomly generated using log generated .

Task2 -> Log distribution between time intervals to print error logs in descending order

01:53:49        5
01:54:04        4
01:54:38        3
01:54:47        3
01:54:05        2
01:53:59        2

The above output is the timestamp splitted by seconds and the error log count has been calculated based on it and printed in descending order

Task3 -> Log level count for the given log input

DEBUG   2060
ERROR   200
INFO    13987
WARN    3756

The above output is the count of number of log level distribution in the given input file .

Task4 -> Longest log messages that matches a regex pattern

DEBUG   87
ERROR   79
INFO    105
WARN    97

The above output is the number of characters of the longest log message in the input file for every single log level distribution . The above message matches the regex pattern with two continuous numbers .

To run on AWS EMR

Create a account in aws.amazon.com and create a IAM user .
Now create a S3 bucket and deploy the jar and input file in the s3 bucket in AWS .
Now go to Amazon EMR and create a cluster which is used to rapid processing and analyzing big data in AWS .
Configure the steps displayed as required and run the job .
On completion run the jar file and check the output folder which will have the output of the mapReduce class which has been executed in the hadoop environment in EMR .

Youtube Link : https://youtu.be/0_RzY-82LeQ

gnanamanickam / logfilegenerator Goto Github PK

logfilegenerator's Introduction

Gnanamanickam Arumugaperumal

Overview

Prerequisites

Installation

Execution

MapReduce

Steps

Output

Task1 -> Log distribution between time intervals

Task2 -> Log distribution between time intervals to print error logs in descending order

Task3 -> Log level count for the given log input

Task4 -> Longest log messages that matches a regex pattern

To run on AWS EMR

logfilegenerator's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent