planet-stream's Introduction

Planet-Stream

Works with the Protocol Buffer Binary Format (PBF) of OpenStreetMap planetfiles to break up a large file into many smaller, standalone files that can be used like their larger counterparts for import and other data pipeline uses.

This can greatly speed up planetfile import, as it allows for multiple chunks to be imported through several different threads or on different machines running parallel operations.

Additionally, this package provides multiple streaming modes for remote servers which provide HTTP Range support, allowing for data to be imported concurrently with its transfer.

This package only deals with PBFs at the Block level, and does not directly interact with or modify more sophisticated datatypes of the OSMPBF format corresponding to the map data itself, such as PrimitiveGroups, Nodes, Ways, or Relations.

If you'd like to interact with the map-related data using Go, you can do so using the OSMPBF library.

For more information on the Protocol Buffer Binary Format for OSM Planetfiles, check out the OSM wiki.

Project Status

This is very much a Work-in-Progress designed for a very specific use case.

planet-stream's People

Contributors

Watchers

planet-stream's Issues

Request Optimizations

Trying not to optimize this too much before I actually use it in the project I need it for, but here are some patterns that can be interesting for later low-hanging fruit as far as optimizations go.

We don't want to store GBs of these files directly into RAM (we don't even have GBs of RAM), so it's not feasible to save entire blocks into memory by default when we fetch them.

The naive approach would just be to make a new set of Ranged requests each time we want to write a block, which gives us an overhead of:

Grabbing size of next BlockHeader (1 request; 4 bytes)
Grabbing and unmarshaling the BlockHeader (1 request; "always" 13 bytes)
Grabbing the Blob from the size indicated in the BlockHeader (1 request; n bytes)

Which means that for every chunk that we save to a file, retrieving and dumping the FileHeader block creates an additional overhead of 3 requests for every new Block we want to write to its own file.

For large files with thousands of chunks, this could lead to ten or twenty thousand extraneous HTTP requests during an import session. Current Amazon prices for S3 GET requests put this at $0.01 per 10,000 requests, meaning something like two cents of unnecessary overhead plus waiting time for whatever latency is introduced.

For HTTP requests on third-party servers, at best we're being a bit disrespectful of other people's resources and at worse we're impacting service for others.

Additionally, a standard OSM-exported BlockHeader is always 13 bytes in practice, but it would be bad practice to require it to be 13 bytes. We could save one request per chunk if we try to unmarshal a 13-byte BlockHeader first (skipping the four-byte size block indicating the size of the BlockHeader), then fall back to the original procedure if this fails.

This secondary optimization would require both redesigning the code path for that and implementing a smart fallback in case the format ever changes and we end up making a useless request for every chunk before making the two proper requests.

Optimizing the FileHeader writes is easy, though, and I'll be pushing that next commit.

Testing

Right now, tests are done manually by running through a chunking and then re-import of the three different input types (file, HTTP, S3).

If we wanted to do a proper test suite, we'd have to do a few things:

Dummy HTTP Server supporting Range requests
S3 Mockup supporting Range requests

We'd also probably want to open up the actual Blobs to determine if the data matches between import and export, and keep track of the file size to ensure that the entire file gets imported.

Recommend Projects

mapbox / planet-stream Goto Github PK

planet-stream's Introduction

Planet-Stream

Project Status

planet-stream's People

Contributors

Watchers

Forkers

planet-stream's Issues

Request Optimizations

Testing

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent