Giter VIP home page Giter VIP logo

logfilegenerator's Introduction

Gnanamanickam Arumugaperumal

Overview

The project creates a distributed map-reduce program for parallel processing of the randomly generated log messages to get insights about the log levels based on different characteristics . The problem is broken down into smaller tasks which can be executed parallelly using Map-Reduce in Apache Hadoop. The task is executed in Amazon EMR which acts as a elastic and low cost solution to quickly execute the program.

Prerequisites

  • Install SBT to build the jar
  • Terminal to SSH and SCP into VM to execute Hadoop commands
  • Install VM Workstation Pro and run Horton Sandbox which comes with Apache Hadoop installation.
  • AWS account to execute the jar file in Amazon EMR .

Installation

sbt clean compile test
sbt clean compile run
  • It builds and compiles the project

  • If you are using IntellIj clone the repository by using "Check out from Version Control and then Git."

  • The scala version should be set in the Global Libraries under Project Structure in Files .

  • The SBT configuration should be added by using Edit Configurations and then simulations can be ran in the IDE .

  • Now Build the project and then run the mapReduce in hadoop to get the output.

Execution

sbt clean compile assembly

The above commands generate the jar file in the root target folder under the scala path and this jar file has to be moved into the hdfs file system to execute the map reduce in the hadoop environment .

Run the main method in generate log data class to generate the random log values . For this project , I have generated 20k records to be used as a input value .

MapReduce

Map Reduce is used for parallel processing of big data over distributed systems. It consists of Mapper and Reducer .

The task of Mappers job is to obtain a set of key value pairs and produce an intermediate set of key value pairs while that of Reducers job is to obtain the intermediate set of key value pairs produced by the the Mappers and reduce them by grouping values that share a similar key.

We will execute MapReduce in Hadoop Environment .

Steps

  • Start the hortonworks in the VM workstation Pro . The URL to be used to SSH will be displayed on the screen once it starts .
  • Use the URL to SSH and login into the hortonworks
 ssh -p 2222 [email protected]

The default password in hadoop . Set the password you require after that . Make sure to use port 2222 to login to execute hadoop commands .

 su - hdfs

Change the filesystem to hdfs to execute the mapreduce in hadoop environment .

scp -P 2222 Path\LogFileGenerator-assembly-0.1.jar [email protected]:/home/hdfs
scp -P 2222 Path\input.txt [email protected]:/home/hdfs

Copy the jar and the input file to the hdfs system . SCP is used to copy the file from local system to VM system .

cd /home/hdfs
hdfs dfs -mkdir /LogFileGenerator

Create a hdfs directory for copying the local file to hdfs format

hdfs dfs -copyFromLocal LogFileGenerator-assembly-0.1.jar /LogFileGenerator

The above command is used to copy the file from normal file system to HDFS file system

hdfs dfs -rm -R /LogFileGenerator/LogFileGenerator-assembly-0.1.jar

Incase we need to delete the file we have to use the above command .

hadoop jar LogFileGenerator-assembly-0.1.jar ClassName inputfile outputfile

In order to execute the Map Reduce job, you must use the above command to execute the particular class file from the given jar . The input file will be taken and parsed in the code level .

hdfs dfs -rm -R outputfile

The above command is used to delete the output file incase we want to rerun the jar again and take a new output .

hdfs dfs -text outputFile/part-r-00000

To view the output file use the above command

Output

Task1 -> Log distribution between time intervals

DEBUG,781
ERROR,75
INFO,5284
WARN,1407

The above output is a csv generated by executing the mapReduce to generate the log distribution between two start and end time intervals from the input file randomly generated using log generated .

Task2 -> Log distribution between time intervals to print error logs in descending order

01:53:49        5
01:54:04        4
01:54:38        3
01:54:47        3
01:54:05        2
01:53:59        2

The above output is the timestamp splitted by seconds and the error log count has been calculated based on it and printed in descending order

Task3 -> Log level count for the given log input

DEBUG   2060
ERROR   200
INFO    13987
WARN    3756

The above output is the count of number of log level distribution in the given input file .

Task4 -> Longest log messages that matches a regex pattern

DEBUG   87
ERROR   79
INFO    105
WARN    97

The above output is the number of characters of the longest log message in the input file for every single log level distribution . The above message matches the regex pattern with two continuous numbers .

To run on AWS EMR

  • Create a account in aws.amazon.com and create a IAM user .
  • Now create a S3 bucket and deploy the jar and input file in the s3 bucket in AWS .
  • Now go to Amazon EMR and create a cluster which is used to rapid processing and analyzing big data in AWS .
  • Configure the steps displayed as required and run the job .
  • On completion run the jar file and check the output folder which will have the output of the mapReduce class which has been executed in the hadoop environment in EMR .

Youtube Link : https://youtu.be/0_RzY-82LeQ

logfilegenerator's People

Contributors

gnanamanickam avatar vineet77 avatar

Watchers

0x1D0CD00D avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.