Giter VIP home page Giter VIP logo

nearest-airport-locator's Introduction

Approximate Nearest Airport

The objective of this code is to find the nearest airport(possibly best) available to user based on user's coordinates and airport's coordinates

How to Run

  1. Run Tests

test (from sbt-shell) or sbt "test" (from terminal)

  1. Build the project

clean;assembly (from sbt-shell) or sbt "clean;assembly" (from terminal)

  1. Run the application
java -jar target/scala-2.12/nearest-airport-locator.jar --input-files "/path-to/optd-airports-sample.csv,/path-to/user-geo-sample.csv" --output-file "/path-to/nearest_airport/"

Implementation

Infrastructure

The implementation is on Apache Spark as it is known for handling huge volumes of data(or events with spark streaming) in a distributed manner with resilience. This can be also be implemented in Apache Beam/Flink data streams but as I have already worked on spark, I went with that.

Assumptions

  • Airport coordinates might contain invalid latitudes and longitudes.
  • The given dataset is the complete set of coordinates and up-to-date.
  • A maximum difference in distance to the nearest airport is 50 Kms.

Logic

The logic is as follows,

  • Load Airport coordinates as a dataframe and sent to broadcast for lookup operations.
  • Load user coordinates as dataframe and using a UDF to calculate the distance between coordinates.
  • Do a search on the airport coordinates to narrow down by fetching the areas using the latitude and longitude upper and lower limits. This gives us the possible nearby airports. The search happens in two levels, one with the latitude initially and with the longitudes next. Once we have the nearby airports, we use Haversine Formula to calculate the distance between the points from the list of nearby airports and reduce the output to the nearest airport.
  • The above operation will be performed by UDF and for each user event in the dataset.

Haversine vs other formulae

I looked over the internet for the best method to calculate the distance between two points. There are quite a few distance calculating methods like Euclidean, Manhattan etc but for this use-case Haversine might be more suitable because Haversine method gives the distance based on the great-circle along the surface of the earth and for instance Euclidean method gives the distance based on straight-line.

Design for production

Currently, the airport coordinates file is loaded and cleansed inside the same application. But for real-life production scenarios, I would have a nightly job that updates the master Airport coordinates with new coordinates(if there are), clean the data and store the dataset. Then have a scheduled executor(for e.g., akka actor) inside the spark application to refresh the table/file and re-broadcast the airport Coordinates.

Trade-offs/Limitations

  • As of today, spark support objects of size up to 8Gb to broadcast. So, we can safely assume that the master airport coordinates will not exceed even if there are new airports and provided the metadata stays the same. But this also comes at the cost of choosing between the computing cost and efficiency (if we need to go with the sql joins which can also become overly complicated in terms of correctness and maintenance)
  • Due to time limitation, am not able to calculate the error margin in distance calculation and could not precisely test the logic in unit tests.

nearest-airport-locator's People

Contributors

prasanna-ds avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.