mapbox / tile-reduce Goto Github PK
View Code? Open in Web Editor NEWmapreduce vector tile processing
License: ISC License
mapreduce vector tile processing
License: ISC License
Howdy Folks - I just noticed a small error in the example code where the roads layer from mapbox-streets is wrong:
layers: ['roads']
should in fact be layers: ['road']
Similarly in buffer.js
it should be:
module.exports = function (tileLayers, opts, done){
var road = tileLayers.streets.road;
var bufferedRoad = turf.buffer(road, 20, 'meters');
done(null, bufferedRoad);
}
We definitely need to document raw
- we don't even mention it right now.
Optionally, we may want to add a section about how to optimize (use raw, use rbush for lots of intersections, etc) and talk about the effect of tiles with buffers and how to generate custom mbtiles without buffers
This may end up being irrelevant depending on new architecture, but if we keep data source configuration centralized, we should think about supporting more sources. Maybe use tilelive, so supporting other sources is trivial / zero changes to tile reduce internals
Background: I'm trying to do OSM stats by country. This means that I either:
Using Geojson-vt and Natural Earth Admin 0 boundaries, I can easily figure out which countries are present in a tile, but currently that requires me to do geojsonvt + fs.readFile
inside every single worker
I think it would be pretty straightforward, using tilelive + node-mbtiles & node-tilejson, to generalize getVectorTile to either backend source.
I think we should make scoping jobs extremely flexible:
tilebelt.getParent
and tilebelt.getChildren
This will be pretty simple to support with tile-cover and tilebelt, and the type of cover can be implicitly classified automatically (given this list anyway).
It seems pretty obvious that bbox and polygon should be supported. Does it make sense to support arbitrary geojson objects (given that tile-cover can handle these already), and tiles which will provide granular control + index caching?
It should be possible to make this work in modern browsers via web workers. To do this:
xhr
instead of request
. (Note: I think the browser will handle the gzip transparently.)child_process.fork
When you're not accumulating results on reduce
events, the memory consumption still creeps up so you can easily go out of memory on a large number of tiles. This doesn't look right โ there's probably a big memory leak somewhere.
Couple possibilities I could see being useful:
options.tiles
as a direct list of tilesmbtiles
file" (since getting such a list is possible with mbtiles
)We need this. Atlas.
I started here: #28
I'm thinking about using tile-reduce to power a little utility module, where I'd need to pass options along from the main process to the workers. Could the map
module get passed the serialized tile reduce options as one of its initial args?
In README.md
, there is a example on URL sources:
sources: [
{
name: 'streets',
url: 'https://b.tiles.mapbox.com/v4/mapbox.mapbox-streets-v6/{z}/{x}/{y}.vector.pbf',
layers: ['roads'],
maxrate: 10
}
]
I expect only transfer roads
layer. However, I still get all layers in streets. The layers
seems have no effects.
This feature would allow for caching or pre-downloading a region, which would speed up jobs that use tons of HTTP requests. I'm thinking that a file path with the usual {x}
{y}
{z}
would suffice.
I think the reducer should be fired when all of the map operations are complete. This will slow things down a tiny amount, but not significantly in most cases. It will also eliminate race conditions, and will allow for much better reliability (internet goes down during a job, event gets "lost" for whatever reason, etc.). This will also allow for anonymous reducers off the client's machine that can be run whenever necessary, or even incrementally updated.
There are a few possibilities for how this should be stored, but I am leaning towards dynamo (or dynalite for local jobs, if it is robust enough. If it's not, then we can use leveldb).
I am still thinking through whether or not we should still have the reduce event at all. It could be useful in some form for progress updates, but if thats all we use it for, it could simply send back the percent complete, and the tile processed.
cc @rclark
https://travis-ci.org/mapbox/tile-reduce/jobs/93201091
It looks like a build failed, but the associated PR passed.
cc @tcql
Currently TileReduce uses all available cores. We should add an option to set this explicitly so people can use less resources if necessary.
cc @tcql
Processors may need to use async resources (eg: tile buffer crawling, c++ libs, etc.). For this to be possible, we need a standard node callback interface, instead of the current sync interface.
I think this will make queueing easier in multi-level deployments. per https://github.com/mapbox/reducer/blob/master/job-config-test.json#L3
cc @MateoV
Per this code:
for (var i = 0; i < results.length; i++) {
data[sources[i].name] = results[i];
if (!results[i]) return process.send({reduce: true});
}
the worker bails out and returns a reduce event if any source doesn't have data for the requested tile. This is usually great, but in some cases where you want to compare disparate data sources and are relying on reduce events to send back information about how much data each source does or doesn't exist in a tile, you end up losing information.
For example, if I want to find the length of roads in San Francisco that are matched by GPS datapoints. I would like to keep a tally of the total length of road in the bbox, as well as how much is matchable by GPS points. Right now, if there is no GPS data in the tile, we bail out, so I'm missing some of the total length information.
To maintain compatibility and provide optimization for the usual cases where you want this bail-out behavior, I'm proposing we add a tile-reduce option for this, maybe requireAllSources: false
(defaulted true).
I downloaded the lastest planet mbtiles from https://s3.amazonaws.com/mapbox/osm-qa-tiles/latest.planet.mbtiles.gz
The count example works with the included data set, but not with the 22gb planet mbtiles.
The example code uses key value to count i.e. count buldings.
{
"vector_layers": [
{
"id": "buildings",
"description": "",
"minzoom": 15,
"maxzoom": 15,
"fields": {
"id": "Number",
"osm_id": "Number",
"type": "String",
"name": "String"
}
},
{
"id": "roads",
"description": "",
"minzoom": 15,
"maxzoom": 15,
"fields": {
"id": "Number",
"osm_id": "Number",
"type": "String",
"name": "String",
"tunnel": "Number",
"bridge": "Number",
"oneway": "Number",
"z_order": "Number",
"class": "String",
"access": "String",
"service": "String",
"ref": "String"
}
}
]
}
however the planet mbtiles the key seems to always be osm, but the fields contain the tags, im just not able to figure out how to convert the examples to work with the full dataset due to the data structure being different, i.e. building is a field not a key.
{
"vector_layers": [
{
"id": "osm",
"description": "",
"minzoom": 12,
"maxzoom": 12,
"fields": {
"_osm_way_id": "Number",
"_version": "Number",
"_changeset": "Number",
"_uid": "Number",
"_user": "String",
"_timestamp": "Number",
"hires": "String",
"hires:checkdate": "String",
"hires:imagery": "String",
"source": "String",
"boat": "String",
"highway": "String",
"name": "String",
"note": "String",
"name:en": "String",
"waterway": "String",
"natural": "String",
"width": "String",
"boundary": "String",
"maritime": "String",
"admin_level": "String",
"border_type": "String",
"water": "String",
"source:name": "String",
"power": "String",
"building": "String",
Appreciate any guidance i'm by no means a developer but really interested in using this code for some data analysis.
let's built a simple script that benches count & road diff on a small/moderate area. then we can keep an eye on general perf, and feel out whether newer node versions will bump our perf
Currently writing to the write stream does not give worker any indication about back pressure not does it handle back pressure itself.
https://github.com/mapbox/tile-reduce/blob/master/src/worker.js#L57-L58
This means that worker will continue to write to the stream even though the data will not make it through to output
. This can result in only a small fraction of total results being written.
We should have an option for max worker tile requests per second, along with a conservative default. If we set this to 50/sec, we could safely say that the max with compositing + 4 cores would be ~1k total per second.
We are starting to use tile-reduce for more than just vector tile processing. To acomplish this we have to remove VT parsing code and whatever parsing code we want. Can tile-reduce simply act as a fetching mechanism and return pass raw data to the processors for custom parsing?
/cc @lbud
These tests always write new files to disk - they'll always pass regardless of whether a change changes their output.
I think I have a good plan for the interleaved output bug. It's clear that we need to pipe processes to the main thread so that the output is done by a single process. My luck with diff on Node 0.12 was probably due to its new feature of stream corking/uncorking (buffering writes) by default, which possibly made the actual writes to stdout happen less often.
Even when piping, many worker streams are still piped to stdout at the same time and each worker pipes buffer chunks instead of logical pieces of output, so interleaved output still happens. To fix it, we need to make sure that we pipe to stdout in logical bits so that output from one tile is never split into several chunks.
We can do that by splitting each stream on tile-by-tile basis before piping to main stdout. Splitting by linebreaks is not ideal since you may not have linebreaks at all (e.g. if you use process.stdout.write
in each tile), and you may have many linebreaks in each tile output which we don't want to split by (it can get interleaved). Additionally, after you split, you have to readd a linebreak to each chunk which is an additional performance overhead.
Instead, we could manually write an RS ASCII character (0x1e
, borrowed from JSON text sequences spec) after each map fn run in worker.js
, and then split by the character. This way we split only per tile, and do not have to append anything to each chunk. Additionally, we can minimize the performance overhead of splitting by using binary-split instead of split
, since we don't need string conversion to control the output.
The only limitation that we'd have to impose with this approach is stating in the docs that you MUST output anything just before calling the done
callback (and not in a different process tick if the map function is async).
Alternatively, we could introduce a special API, e.g. another argument to done
like this:
module.exports = function(data, tile, done) {
done(null, data.osm.osm.length, "My output");
};
Another future problem that may arise is when you want to stream binary output (which may contain 0x1e
byte), e.g. streaming PNG raster files. But you could probably deal with this in an alternative way, e.g. providing an option to split by a different sequence of characters (each PNG starts with a unique set of bytes).
This is a tricky problem to tackle, but this seems like an acceptable solution.
We should have tests for:
What else should we test?
Simple question, but we should nail it down: Should we CamelCase or just do two words? I've been using Tile Reduce in blog posts but it would make sense to follow the lead of MapReduce - as much as I'm not sure why the world ever started to CamelCase outside of programming languages ;-)
In some cases I'm seeing invalid GeoJSON Polygons passed to the map step. It looks like features that consist of multiple exterior polygons are being converted from vector tiles to a GeoJSON Polygon instead of a GeoJSON MultiPolygon.
Then, when these Polygons are used with turf.intersect(), it throws an "TopologyError: side location conflict" exception.
Here's a test case that shows the problem.
Input file: dc.json https://gist.github.com/jamesbursa/2026d7338b7a3d227732#file-dc-json
Convert to MBTiles using Tippecanoe:
$ tippecanoe -f -o dc.mbtiles -Z 15 -z 15 -b 0 -ps dc.json
Decode one tile with tippecanoe-decode:
$ tippecanoe-decode dc.mbtiles 15 9378 12535
{ "type": "FeatureCollection", "features": [
{ "type": "Feature", "properties": { "STFIPS": "11", "CTFIPS": "11001", "STATE": "District of Columbia", "COUNTY": "District of Columbia" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ -76.965699, 38.897320 ], [ -76.968470, 38.893357 ], [ -76.968760, 38.892036 ], [ -76.968237, 38.891032 ], [ -76.970214, 38.891032 ], [ -76.970214, 38.899582 ], [ -76.965852, 38.899582 ], [ -76.965699, 38.897320 ] ] ], [ [ [ -76.965710, 38.891032 ], [ -76.966973, 38.891032 ], [ -76.966501, 38.892122 ], [ -76.966000, 38.894021 ], [ -76.965710, 38.891032 ] ] ], [ [ [ -76.962199, 38.897320 ], [ -76.962100, 38.896821 ], [ -76.963985, 38.891032 ], [ -76.965222, 38.891032 ], [ -76.965699, 38.897320 ], [ -76.964200, 38.899582 ], [ -76.962481, 38.899582 ], [ -76.962199, 38.897320 ] ] ], [ [ [ -76.959227, 38.891032 ], [ -76.962204, 38.891032 ], [ -76.961099, 38.896618 ], [ -76.961132, 38.896812 ], [ -76.961199, 38.897217 ], [ -76.961703, 38.898319 ], [ -76.961169, 38.899509 ], [ -76.961137, 38.899582 ], [ -76.959227, 38.899582 ], [ -76.959227, 38.891032 ] ] ] ] } }
] }
Note that the output is correctly a MultiPolygon (containing 4 Polygons). See https://gist.github.com/jamesbursa/2026d7338b7a3d227732#file-tile-json
Run through a test TileReduce:
https://gist.github.com/jamesbursa/2026d7338b7a3d227732#file-tilereduce_test-js
https://gist.github.com/jamesbursa/2026d7338b7a3d227732#file-tilereduce_test_map-js
This simply processes the one tile of interest (15 9378 12535), outputs the feature in the tile, and attempts a turf.intersect() which throws an exception. Note that the feature as passed to the map function is a Polygon, not a MultiPolygon as tippecanoe-decode produces for the same tile.
Converting the Polygon to a MultiPolygon manually allows the turf.intersect() to work.
$ ./tilereduce_test.js
Starting up 8 workers... Job started.
Processing 1 tiles.
1 tiles processed in 0s.
map tile [9378,12535,15]
---------------------------------------------------------------
feature = { type: 'Feature',
geometry:
{ type: 'Polygon',
coordinates:
[ [ [ -76.96570068597794, 38.89732062336043 ],
[ -76.96847140789032, 38.89335845766496 ],
[ -76.96876108646393, 38.89203699076319 ],
[ -76.96823805570602, 38.89103282648847 ],
[ -76.97021484375, 38.89103282648847 ],
[ -76.97021484375, 38.89958342598271 ],
[ -76.96585357189178, 38.89958342598271 ],
[ -76.96570068597794, 38.89732062336043 ] ],
[ [ -76.965711414814, 38.89103282648847 ],
[ -76.96697473526001, 38.89103282648847 ],
[ -76.96650266647339, 38.8921225841508 ],
[ -76.9660010933876, 38.89402231327574 ],
[ -76.965711414814, 38.89103282648847 ] ],
[ [ -76.9622004032135, 38.89732062336043 ],
[ -76.96210116147995, 38.89682171160487 ],
[ -76.96398675441742, 38.89103282648847 ],
[ -76.96522325277328, 38.89103282648847 ],
[ -76.96570068597794, 38.89732062336043 ],
[ -76.96420133113861, 38.89958342598271 ],
[ -76.96248203516006, 38.89958342598271 ],
[ -76.9622004032135, 38.89732062336043 ] ],
[ [ -76.959228515625, 38.89103282648847 ],
[ -76.96220576763153, 38.89103282648847 ],
[ -76.9611006975174, 38.89661922340707 ],
[ -76.96113288402557, 38.89681336158753 ],
[ -76.96119993925095, 38.89721833629872 ],
[ -76.96170419454575, 38.89832052381641 ],
[ -76.96117043495178, 38.89951036613891 ],
[ -76.9611382484436, 38.89958342598271 ],
[ -76.959228515625, 38.89958342598271 ],
[ -76.959228515625, 38.89103282648847 ] ] ] },
properties:
{ STFIPS: '11',
CTFIPS: '11001',
STATE: 'District of Columbia',
COUNTY: 'District of Columbia' } };
square = { type: 'Feature',
geometry:
{ type: 'Polygon',
coordinates:
[ [ [ -76.965, 38 ],
[ -76, 38 ],
[ -76, 38.895 ],
[ -76.965, 38.895 ],
[ -76.965, 38 ] ] ] },
properties: {} };
*** turf.intersect exception: TopologyError: side location conflict [ (-76.96570068597794, 38.89732062336043) ]
---------------------------------------------------------------
converting to MultiPolygon
feature = { type: 'Feature',
geometry:
{ type: 'MultiPolygon',
coordinates:
[ [ [ [ -76.96570068597794, 38.89732062336043 ],
[ -76.96847140789032, 38.89335845766496 ],
[ -76.96876108646393, 38.89203699076319 ],
[ -76.96823805570602, 38.89103282648847 ],
[ -76.97021484375, 38.89103282648847 ],
[ -76.97021484375, 38.89958342598271 ],
[ -76.96585357189178, 38.89958342598271 ],
[ -76.96570068597794, 38.89732062336043 ] ] ],
[ [ [ -76.965711414814, 38.89103282648847 ],
[ -76.96697473526001, 38.89103282648847 ],
[ -76.96650266647339, 38.8921225841508 ],
[ -76.9660010933876, 38.89402231327574 ],
[ -76.965711414814, 38.89103282648847 ] ] ],
[ [ [ -76.9622004032135, 38.89732062336043 ],
[ -76.96210116147995, 38.89682171160487 ],
[ -76.96398675441742, 38.89103282648847 ],
[ -76.96522325277328, 38.89103282648847 ],
[ -76.96570068597794, 38.89732062336043 ],
[ -76.96420133113861, 38.89958342598271 ],
[ -76.96248203516006, 38.89958342598271 ],
[ -76.9622004032135, 38.89732062336043 ] ] ],
[ [ [ -76.959228515625, 38.89103282648847 ],
[ -76.96220576763153, 38.89103282648847 ],
[ -76.9611006975174, 38.89661922340707 ],
[ -76.96113288402557, 38.89681336158753 ],
[ -76.96119993925095, 38.89721833629872 ],
[ -76.96170419454575, 38.89832052381641 ],
[ -76.96117043495178, 38.89951036613891 ],
[ -76.9611382484436, 38.89958342598271 ],
[ -76.959228515625, 38.89958342598271 ],
[ -76.959228515625, 38.89103282648847 ] ] ] ] },
properties:
{ STFIPS: '11',
CTFIPS: '11001',
STATE: 'District of Columbia',
COUNTY: 'District of Columbia' } };
intersect = { type: 'Feature',
properties: {},
geometry:
{ type: 'MultiPolygon',
coordinates:
[ [ [ [ -76.96142100332834, 38.895 ],
[ -76.959228515625, 38.895 ],
[ -76.959228515625, 38.89103282648847 ],
[ -76.96220576763153, 38.89103282648847 ],
[ -76.96142100332834, 38.895 ] ] ],
[ [ [ -76.965, 38.89103282648847 ],
[ -76.965, 38.895 ],
[ -76.96269454111474, 38.895 ],
[ -76.96398675441742, 38.89103282648847 ],
[ -76.965, 38.89103282648847 ] ] ] ] } };
---------------------------------------------------------------
What I thought before was spherical geometry precision problems I now think is really tile-encoding problems.
Example: http://www.openstreetmap.org/node/1004264211. The real location is lat="38.9347951" lon="-77.0533697"
In the through way http://www.openstreetmap.org/way/38132834 it is encoded at z12 as [-77.05332040786743,38.93477700153804], [1247,1476]
In the way that ends there http://www.openstreetmap.org/way/6054333 it is encoded at z12 as [-77.05336332321167,38.93479369264057], [1245,1475]
Or at least that's what it looks like. I would have expected tile encoding to drop nodes but not to relocate them.
This would be a breaking change. At the moment mapOptions are sent to workers as global objects. I am pretty sure this is safe, since workers should not share globals across processes, however, I think it is better to be explicit. I propose we remove the global assignment and add another parameter to our worker functions. The new interface would look like this:
function(data, tile, write, opts, done) {}
Thoughts?
Looks like the example code is missing an access token in the vtile url that prevents it from being usable.
I run npm test
on tile-reduce 3.0. It fails on Windows 10 with node v5.1.0, while success on Ubuntu. It is strange. Here is the error message:
1) test/test.count.js count implementation, mbtiles cover found all features in overlapping mbtiles:
Error: found all features in overlapping mbtiles
+ expected - actual
-0
+36597
at EventEmitter.<anonymous> (test\test.count.js:53:7)
at shutdown (src\index.js:136:8)
at reduce (src\index.js:126:36)
at ChildProcess.handleMessage (src\index.js:47:25)
at handleMessage (internal/child_process.js:686:10)
at Pipe.channel.onread (internal/child_process.js:440:11)
2) test/test.count.js count implementation, explicit mbtiles cover found all features in overlapping mbtiles:
Error: found all features in overlapping mbtiles
+ expected - actual
-0
+36597
at EventEmitter.<anonymous> (test\test.count.js:72:7)
at shutdown (src\index.js:136:8)
at reduce (src\index.js:126:36)
at ChildProcess.handleMessage (src\index.js:47:25)
at handleMessage (internal/child_process.js:686:10)
at Pipe.channel.onread (internal/child_process.js:440:11)
3) test/test.count.js count implementation, tileStream cover found all features in listed tiles:
Error: found all features in listed tiles
+ expected - actual
-0
+16182
at EventEmitter.<anonymous> (test\test.count.js:91:7)
at shutdown (src\index.js:136:8)
at reduce (src\index.js:126:36)
at ChildProcess.handleMessage (src\index.js:47:25)
at handleMessage (internal/child_process.js:686:10)
at Pipe.channel.onread (internal/child_process.js:440:11)
Hi, I am looking for suggestions. Recently, I am using tile-reduce to do some statistics on osm-qa-tiles. I want to calculate a metric (e.g. road density) on each tile of zoom 12, then output to mapbox studio for visualization. The problem is how to store the result. As zoom 12 has 16 million tiles, using GeoJson, MBtiles, or UTF8Grids maybe too large for storage. So I hope to get some advice from you as you are experts in this field.:pray:
The example has a confusing invocation of tilereduce:
var TileReduce = new require('tile-reduce');
...
var tilereduce = TileReduce(bbox, opts);
A more common expectation would be
var TileReduce = require('tile-reduce');
...
var tilereduce = new TileReduce(bbox, opts);
https://github.com/mapbox/tile-reduce/blob/master/index.js#L86-L88
This checks the first tile number 3 times, when it should check 0,1,2.
I think there should be a list of awesome tile-reduce projects, such as osm-coverage, osm-sidewalker.
There is an example now, but there should also be explicit docs.
Algorithms that involve crawling tiles may make extra requests. This option would allow for throttling beyond the default 200 per/sec limit to account for this.
@aaronlidman can you sketch out what to-fix input would like? My understanding is that if we output a csv where each row holds geometry (WKT?), to-fix should be able to handle this. Do we need to make a custom plugin for each type of task with its own UI?
For the purpose of this example, let's say we had a tile-reduce job that output a collection of geojson points where there were disconnected major roads identified. What would be the best way to get this data into to-fix?
We have docs for passing mapOptions, but not consuming them.
cc @tcql
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.