Giter VIP home page Giter VIP logo

aseied-2023's Introduction


Logo

ASEIED 2023 project

Terrain tiles analysis for the western hemisphere
Krzysztof Dymanowski, Bartosz Janicki, Alan Bejnarowicz

Table of Contents
  1. Project Overview
  2. Solution process
  3. Summary

Project Overview

This is a group project for 2023 ASEIED class (Autonomous systems for exploring and analyzing data) at Gdańsk Tech. The goal of course was to gain hands-on experience with AWS, especially with EMR and other big data tools utilized in the industry.

Problem formulation

"Perform data analysis containing information about the terrain elevation diversity, selecting groups of areas with the highest increase (North and South America continent). The elevation increase in a given location should be measured based on at least 10 measurement points. Determine 6 groups of areas based on the average value of elevation increase. Please plot the detected areas on the map."

Tech stack

  • Python
  • Spark
  • AWS

The AWS environment this project was run on was AWS Learning Lab, in which every one of us had 100$ to spend on Amazon web services.

Dataset

Dataset used was the terrain tiles dataset:
https://registry.opendata.aws/terrain-tiles/ which is: "A global dataset providing bare-earth terrain heights, tiled for easy usage and provided on S3."

Specific bounding box used for analysis was:

Coordinates From To
Latitude 72.711037 -55.554805
Longitude -172.964981 -21.269288

The bounding box was found using the http://bboxfinder.com website.

Solution process

Our approach was to use the Merocator projection (https://en.wikipedia.org/wiki/Mercator_projection) to map the selected region's surface onto a 2d plane, which in turn allowed us to perform precise tile analysis of the elevation.

Here are the most important steps of our pipeline:

  1. Generating tile coordinates based on geogprahic bound and zoom level (Bounds specified above, zoom level = 3): tiles: List[Tile] = get_tiles(ZOOM, *BOUNDS)
  2. Based on the list of tuples containing (zoom level, x coordinate, y coordinate) we fetch the elevation data from S3: data_urls: List[str] = generate_links(tiles)
  3. We proceed to load the data into a Spark dataframe (specifying the image format): df = spark.read.format("image").load(data_urls)
  4. The DataFrame is pruned to include only the 'origin' and 'data' columns (metadata and actual images respectively): df = df.select("image.origin", "image.data")
  5. The DataFrame is converted to an RDD of numpy arrays for easier manipulation: tiles_rdd = df_images.rdd.map(lambda img: np.reshape(img, (TILE_HEIGHT, TILE_WIDTH, CHANNELS_NUM)))
  6. Afterwards elevation data is calculated for each tile based on the RGB values of the image, and we calculate gradients of terrain using numpy: elevation_tiles = tiles_rdd.map(get_elevation) grad_arr = elevation_tiles.map(np.gradient)
  7. Then we populate an empty numpy array according to the elevation level for each tile and display the results. plt.imshow(world_map, cmap=plt.get_cmap("terrain"))

AWS setup

In order to reproduce results:

  1. Create a cluster specified by the cluster creation command specified in file clone_cluster_command.txt (You can use any other configuration for the cluster, but we suggest having at least 5 m5.xlarge instances in the cluster).
  2. Attach notebook (or workspace in the new console) to the cluster and run all cells of the "raw" notebook.

Alternatively you can link this repository to your notebook(cluster) and then run the "raw" notebook.

Obstacles

Our first big obstacle to overcome was trying to accomplish the project using Scala and Spark. However, the one and only library we found for plotting in Scala called Vegas was unmaintained and incompatible with the Spark version's we had installed on our cluster. Half-way through the project we decided to switch to Python and PySpark, as the amount of tutorials/documentation/code/problems already solved by others was very significant compared to Scala. Another obstacle was understanding the data format of the terrain tiles. It required of us a notable amount of research related not only of the dataset but also of ways of processing geographical data.

Results

Results

We deduce our calculation methods are rather correct, as plotting the obtained elevation map with terrain color map from matplotlib yields an image similar to one we can find in geography books and other kinds of maps.

Summary

Our initial attempt was to write this project in Scala (Spark), but along the way we pivoted to PySpark. The experience we gained was more or less the same, however we were spared having to deal with many technicalities/areas where achieving the same thing with Scala was much harder than in Python (For example, setting up Vegas to work in the notebook was a mountain to overcome compared to PySpark's sc.install_pypi_package("matplotlib")) Nonetheless we obtained hands-on experience with Scala and Spark, and transferred our knowledge to PySpark.

Back to Start

aseied-2023's People

Contributors

krzysiekdd avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.