Giter VIP home page Giter VIP logo

map-reduce's Introduction

Using Map-Reduce to calculate Diversity Index

DivIndex by Tushar Iyer

The United States Census Bureau (USCB) estimates the number of people in each county in each state. These estimates are categorized by gender, age, race, and other factors.

The diversity index D for a population is the probability that two random people from a given population will be of different races. The diversity index is calculated with the following formula, where NI is the number of individuals in racial category i and T is the total number of individuals:

Diversity Index Formula

This project works with the census dataset sourced here.

The program was tested on a multicore cluster computer at RIT, but can be used with other cluster machine that have Parallel Java 2 installed. Parallel Java 2 was developed by Alan Kaminsky in the Department of Computer Science at the Rochester Institute of Technology. The link includes a description of PJ2 and its installation guide. Documentation for PJ2 can be found on the same webpage. PJ2 is distributed under the terms of the GNU General Public License as published by the Free Software Foundation.

Compilation

The project comes with the three .java files necessary for this program and can be compiled on a machine with PJ2 installed using the following steps:

  • Navigate to the directory where the .java source files are located
  • Export JDK 1.7 classpath export PATH=/usr/local/dcs/versions/jdk1.7.0_51/bin:$PATH
  • Include PJ2 in classpath export CLASSPATH=.:/var/tmp/parajava/pj2/pj2.jar
  • Make build directory with mkdir build
  • Compile source code with javac -d ./build *.java
  • Enter the build directory: cd build
  • Build jar with jar cvf <name>.jar * where <name> is what you want to call the jar

Now assuming the machine has PJ2's tracker set up correctly, the names of all nodes are known and the census dataset has been downloaded and split properly amongst all nodes, you are ready to run the program.

Execution

Programs written with/for PJ2 are run by using PJ2 as a launcher, so it is imperative to get the command line arguments right. This program DivIndex is launched with the following parameters:

java pj2 debug=<debug> timelimit=<s> jar=<name>.jar threads=<thr> DivIndex <nodes> <path/to/dataset> <year> <states>

  • <debug> is a parameter set to none if no job-specific information is to be printed out or makespan if you want to see job-related information and running times.
  • <s> is the number of seconds you want to allow the program to run for before timing out
  • <thr> is the number of threads you want to devote to this task. Defaults to 1 if omitted.
  • <nodes> is all nodes to be used, separated by commas
  • <path/to/dataset> is the relative path to the csv file where the census dataset is partitioned on each node
  • <year> is an integer argument from 1 to 10, with 1 referring to 2007, and 10 referring to 2017
  • <states> is an optional argument. If none are provided, the program will calculate the diversity index for every county in all 50 states as well as the District of Columbia. Else, the states should be passed in as quoted strings, delimited by a single space.

The program will print output such that the states are in alphabetical order, but the counties will be listed in descending order of diversity index.

Screenshots

Below are three screenshots of the program running with different sets of parameters:

Screenshot One

Screenshot Two

Screenshot Three

map-reduce's People

Contributors

tushariyer avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.