Giter VIP home page Giter VIP logo

norwegiandemographics's Introduction

Norwegian Census Data Analysis using Hadoop

DAT500 MSc course

Configuring the cluster (tested on Ubuntu 16.04 and 18.04)

cd into setup

Shared

  • Ensure that setup_env.sh has correct addresses for /etc/hosts/.
  • Run sudo setup.sh, this will setup the environment, install necessary libraries and Hadoop.
  • Copy files & folders in /scripts/* to /usr/local/hadoop/
  • Ensure that Python 3 is the default Python interpreter! And that it is a version prior to 3.8.0.
  • Ensure that Java 8 is the default Java version. Check ~/.bashrc and ~/.profile for potential Java 11 overrides.

Master Node Only

  • Run sudo setup_spark.sh, this will install Spark.
  • Copy files & folders in /master-only/ to /usr/local/hadoop/
  • Copy files & folders in /spark/ to /usr/local/spark/
  • Update /usr/local/hadoop/etc/hadoop/workers to include your workers. (slave nodes)

Slave Node Only

  • Copy files & folders in /slave-only/ to /usr/local/hadoop/

Finally

Format the namenode like this, hdfs namenode -format (type in terminal)

Running jobs @ master

Starting/Stopping

  • Start Hadoop and Spark, run start-dfs.sh, start-yarn.sh, and start-history-server.sh.
  • Stop Hadoop and Spark, run stop-history-server.sh, stop-yarn.sh and stop-dfs.sh.

Testing

To test if Hadoop is working, simply run setup/setup_runtest.sh.

When Hadoop and Spark is up

  • Regular MapReduce, /run/hadoop-streaming.sh <file_mapper.py> <file_reducer.py>
  • With MRJob, python3 some_file.py --hadoop-streaming-jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -r hadoop hdfs:///data/input_data.csv --output-dir hdfs:///output/xyz --no-output
  • With Spark, spark-submit --master yarn some_file.py

To generate the results for this project (assuming you have access to our dataset)

cd into src and execute run.sh (~/dat500/src/)

Visualizing the results locally

  • Sync results retrieved from Hadoop & Spark
  • cd into src
  • conda install geopandas
  • conda install geoplot -c conda-forge
  • conda install -c conda-forge cartopy
  • pip install -r requirements.txt
  • Open visualize.ipynb to visualize results

(the CSV dataset was generated from the original census dataset by running src/merge_data.ipynb)

Troubleshooting

  • Check if /usr/local/hadoop/etc/hadoop/hadoop-env.sh has any faulty paths.
  • Python 3.6.9 was built from source, and should be the default Python version. However if a different version is preferred, update /usr/local/spark/conf/spark-env.sh to use python3.X for its Python drivers.

Reference(s) / Guide(s)

Example Results

norwegiandemographics's People

Contributors

bae94 avatar bernta avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.