Giter VIP home page Giter VIP logo

hadoop-examples's Introduction

Hadoop Examples

Some simple, kinda introductory projects based on Apache Hadoop to be used as guides in order to make the MapReduce model look less weird or boring.

Preparations & Prerequisites

  • Latest stable version of Hadoop or at least the one used here, 3.3.0.
  • A single node setup is enough. You can also use the applications in a local cluster or in a cloud service, with needed changes on the map splits and the number of reducers, of course.
  • Of course, having (a somehow recent version of) Java installed. I have openjdk 11.0.5 installed to my 32-bit Ubuntu 16.04 system, and if I can do it, so can you.

Projects

Each project comes with its very own:

  • input data (.csv, .tsv, or simple text files in a folder ready to be copied to the HDFS).
  • execution guide (found in the source code of each project but also being heavily dependent of your setup of java and environment variables, so in case the guide doesn't work, you can always google/yahoo/bing/altavista your way to execution).

The projects featured in this repo are:

Calculating the average price of houses for sale by zipcode.

A typical "sum-it-up" example where for each bank we calculate the number and the sum of its transfers.

Typical case of finding the max recorded temperature for every city.

An interesting application of working on Olympic game stats in order to see the total wins of gold, silver, and bronze medals of every athlete.

Just a plain old normalization example for a bunch of students and their grades.

Finding the oldest tree per city district. Child's play.

A bit more challenging than the rest. Every key-character (A-E) has 3 numbers as values, two negatives and one positive. We just calculate the score for every character based on the following expression character_score = pos / (-1 * (neg_1 + neg_2)).

A simple way to calculate the symmetric difference between the records of two files, based on each record's ID.

Filtering out patients' records where their PatientCycleNum column is equal to 1 and their Counseling column is equal to No.

Reading a number of files with multiple lines and converting them into key-value pairs with each file's name as key and each file's content as value.

The most challenging yet. Τerm frequency is being calculated from 5 input documents. The goal is to find the document with the max TF for each word and how many documents contain that said word.

A simple merge of WordCount and TopN examples to find the 10 most used words in 5 input documents.


Check out the equivalent Spark Examples here.

hadoop-examples's People

Contributors

coursal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

hadoop-examples's Issues

Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist:

Hi, I am a new hadoop learner. When I use your code to run hadoop it have this problem. I worked the Word Count example that run normally.(https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Partitioner). But your code not working when I command hadoop jar Bank_Transfers.jar Bank_Transfers. Please help me to understand. Thanks, have a good day.

 2023-03-18 00:45:31,378 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /127.0.0.1:8032

2023-03-18 00:45:31,626 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

2023-03-18 00:45:31,671 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/bigdata/.staging/job_1679072275142_0004

2023-03-18 00:45:31,946 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/bigdata/.staging/job_1679072275142_0004

Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:9000/user/bigdata/bank_dataset

	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:340)

	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:279)

	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:404)

	at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)

	at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)

	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)

	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1571)

	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1568)

	at java.base/java.security.AccessController.doPrivileged(Native Method)

	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)

	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)

	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1568)

	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1589)

	at Bank_Transfers.main(Bank_Transfers.java:113)

	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

	at java.base/java.lang.reflect.Method.invoke(Method.java:566)

	at org.apache.hadoop.util.RunJar.run(RunJar.java:323)

	at org.apache.hadoop.util.RunJar.main(RunJar.java:236)

Caused by: java.io.IOException: Input path does not exist: hdfs://localhost:9000/user/bigdata/bank_dataset

	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:313)

	... 19 more


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.