Giter VIP home page Giter VIP logo

planet-stream's Introduction

Planet-Stream

Works with the Protocol Buffer Binary Format (PBF) of OpenStreetMap planetfiles to break up a large file into many smaller, standalone files that can be used like their larger counterparts for import and other data pipeline uses.

This can greatly speed up planetfile import, as it allows for multiple chunks to be imported through several different threads or on different machines running parallel operations.

Additionally, this package provides multiple streaming modes for remote servers which provide HTTP Range support, allowing for data to be imported concurrently with its transfer.

This package only deals with PBFs at the Block level, and does not directly interact with or modify more sophisticated datatypes of the OSMPBF format corresponding to the map data itself, such as PrimitiveGroups, Nodes, Ways, or Relations.

If you'd like to interact with the map-related data using Go, you can do so using the OSMPBF library.

For more information on the Protocol Buffer Binary Format for OSM Planetfiles, check out the OSM wiki.

Project Status

This is very much a Work-in-Progress designed for a very specific use case.

planet-stream's People

Contributors

yuffster avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

isabella232

planet-stream's Issues

Request Optimizations

Trying not to optimize this too much before I actually use it in the project I need it for, but here are some patterns that can be interesting for later low-hanging fruit as far as optimizations go.

We don't want to store GBs of these files directly into RAM (we don't even have GBs of RAM), so it's not feasible to save entire blocks into memory by default when we fetch them.

The naive approach would just be to make a new set of Ranged requests each time we want to write a block, which gives us an overhead of:

  1. Grabbing size of next BlockHeader (1 request; 4 bytes)
  2. Grabbing and unmarshaling the BlockHeader (1 request; "always" 13 bytes)
  3. Grabbing the Blob from the size indicated in the BlockHeader (1 request; n bytes)

Which means that for every chunk that we save to a file, retrieving and dumping the FileHeader block creates an additional overhead of 3 requests for every new Block we want to write to its own file.

For large files with thousands of chunks, this could lead to ten or twenty thousand extraneous HTTP requests during an import session. Current Amazon prices for S3 GET requests put this at $0.01 per 10,000 requests, meaning something like two cents of unnecessary overhead plus waiting time for whatever latency is introduced.

For HTTP requests on third-party servers, at best we're being a bit disrespectful of other people's resources and at worse we're impacting service for others.

Additionally, a standard OSM-exported BlockHeader is always 13 bytes in practice, but it would be bad practice to require it to be 13 bytes. We could save one request per chunk if we try to unmarshal a 13-byte BlockHeader first (skipping the four-byte size block indicating the size of the BlockHeader), then fall back to the original procedure if this fails.

This secondary optimization would require both redesigning the code path for that and implementing a smart fallback in case the format ever changes and we end up making a useless request for every chunk before making the two proper requests.

Optimizing the FileHeader writes is easy, though, and I'll be pushing that next commit.

Testing

Right now, tests are done manually by running through a chunking and then re-import of the three different input types (file, HTTP, S3).

If we wanted to do a proper test suite, we'd have to do a few things:

  • Dummy HTTP Server supporting Range requests
  • S3 Mockup supporting Range requests

We'd also probably want to open up the actual Blobs to determine if the data matches between import and export, and keep track of the file size to ensure that the entire file gets imported.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.