Giter VIP home page Giter VIP logo

mapreduce-kmeans's Introduction

MapReduce-Kmeans

An implementation of the k-means algorithm using Hadoop and HDFS written in Java.
The program was developed and tested on a Windows 10 machine using hadoop-3.35 and Maven structure with K = 3 .

  1. Description
  2. Installing & Configuring Hadoop Locally
  3. Running K-Means on Hadoop
  4. Results
  5. Notes

This project implements the k-means clustering algorithm on Hadoop using sythetic data as a sample. The data can be found at src/main/resources/data.txt and were generated by the DataGenerator.java component biased towards 3 initial centers located at src/main/resources/centroid.txt. A visual representation of the said data can be obtained by running the DataPlotter.java file

Windows

  1. Watch this Video and follow the steps closely.
  2. Open the windows cmd as an administrator
  3. Navigate to the folder you installed hadoop ex C:\hadoop-3.3.5
  4. Navigate to hadoop/sbin
  5. Type start-all.cmd to start all the hadoop services (demons)
  6. To confirm that it is working go to your browser and in the url type http://localhost:9870/. Keep this tab open. This will come in handy later

Warnings!

  1. When setting env variables make sure JAVA_HOME and HADOOP_HOME don't contain any spaces in the path.
  2. Hadoop runs on Java 8 or later
  3. If you are still getting any errors especially java exceptions try to search them on the web.

Ubuntu Linux

You can install Hadoop in ubuntu by following This article

Before you start

Put the data.txt and centroid.txt files from the resources folder in hdfs in the same directory. You can do that by opening a terminal and running

$ hdfs dfs -copyFromLocal <path-to-data.txt> <destination-folder-in-hdfs>
$ hdfs dfs -copyFromLocal <path-to-centroid.txt> <destination-folder-in-hdfs>

1. Clone this repository and navigate tothe folder:

$ git clone https://github.com/nickkatsios/MapReduce-Kmeans.git
$ cd MapReduce-Kmeans

2. Build project using Maven:

$ mvn install

A target folder should be generated with a MapReduce-Kmeans-1.0-SNAPSHOT.jar jar file inside.

3. Run the k-means algorithm using:

$ cd target
$ hadoop jar KmeansTest-1.0-SNAPSHOT.jar gr.aueb.dmst.nickkatsios.KMeans <input-hdfs-directory> <output-hdfs-directory>

With the input direcory being the directory where you put your data.txt and centroid.txt files. And output directory the directory name the output folders are based upon.

You are done With the example data and centroid files convergence should be reached after ~10 iterations.

  1. In your browser tab where http://localhost:9870/ (the namenode) is running navigate to utilities --> browse the file system
  2. After convergence x number of folders should be generated each containing the output of each iteration based on the output path/name specified in the jar execution. Navigate to the most recent one.
  3. Download the part-r-0000 file and open it with a text editor. It should contain the final centers (x,y).

The cmd output for each iteration = map-reduce job.

The state of the filesystem after running the jar.

The final directory with the final centers in the part-r-0000 file.

The part-r-0000 file opened in notepad

This project was made as an assignement of the Big Data Management Systems course at DMST AUEB.

Team members
Nikolaos Katsios 8200071
Theodoros Skondras Mexis 8200156

mapreduce-kmeans's People

Contributors

nickkatsios avatar teoskondras avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.