Giter VIP home page Giter VIP logo

geobuf's Introduction

Geobuf

Build Status

Geobuf is a compact binary encoding for geographic data.

Geobuf provides nearly lossless compression of GeoJSON data into protocol buffers. Advantages over using GeoJSON alone:

  • Very compact: typically makes GeoJSON 6-8 times smaller.
  • 2-2.5x smaller even when comparing gzipped sizes.
  • Very fast encoding and decoding — even faster than native JSON parse/stringify.
  • Can accommodate any GeoJSON data, including extensions with arbitrary properties.

The encoding format also potentially allows:

  • Easy incremental parsing — get features out as you read them, without the need to build in-memory representation of the whole data.
  • Partial reads — read only the parts you actually need, skipping the rest.

Think of this as an attempt to design a simple, modern Shapefile successor that works seamlessly with GeoJSON. Unlike Mapbox Vector Tiles, it aims for nearly lossless compression of datasets — without tiling, projecting coordinates, flattening geometries or stripping properties.

Note that the encoding schema is not stable yet — it may still change as we get community feedback and discover new ways to improve it.

"Nearly" lossless means coordinates are encoded with precision of 6 digits after the decimal point (about 10cm).

Sample compression sizes

Data JSON JSON (gz) Geobuf Geobuf (gz)
US zip codes 101.85 MB 26.67 MB 12.24 MB 10.48 MB
Idaho counties 10.92 MB 2.57 MB 1.37 MB 1.17 MB

API

encode

var buffer = geobuf.encode(geojson, new Pbf());

Given a GeoJSON object and a Pbf object to write to, returns a Geobuf as UInt8Array array of bytes. In [email protected] or later, you can use Buffer.from to convert back to a buffer.

decode

var geojson = geobuf.decode(new Pbf(data));

Given a Pbf object with Geobuf data, return a GeoJSON object. When loading Geobuf data over XMLHttpRequest, you need to set responseType to arraybuffer.

Install

Node and Browserify:

npm install geobuf

Browser build CDN links:

Building locally:

npm install
npm run build-dev # dist/geobuf-dev.js (development build)
npm run build-min # dist/geobuf.js (minified production build)

Command Line

npm install -g geobuf

Installs these nifty binaries:

  • geobuf2json: turn Geobuf from stdin or specified file to GeoJSON on stdout
  • json2geobuf: turn GeoJSON from stdin or specified file to Geobuf on stdout
  • shp2geobuf: given a Shapefile filename, send Geobuf on stdout
json2geobuf data.json > data.pbf
shp2geobuf myshapefile > data.pbf
geobuf2json data.pbf > data.json

Note that for big files, geobuf2json command can be pretty slow, but the bottleneck is not the decoding, but the native JSON.stringify on the decoded object to pipe it as a string to stdout. On some files, this step may take 40 times more time than actual decoding.

See Also

  • geojsonp — the prototype that led to this project
  • pygeobuf — Python implementation of Geobuf
  • twkb — a geospatial binary encoding that doesn't support topology and doesn't encode any non-geographic properties besides id
  • vector-tile-spec
  • topojson — an extension of GeoJSON that supports topology
  • WKT and WKB — popular in databases
  • EWKB — a popular superset of WKB

geobuf's People

Contributors

claws avatar ivorblockley avatar jamesbursa avatar marcjansen avatar mcwhittemore avatar mick avatar mourner avatar rclark avatar rouault avatar slaskis avatar spaxe avatar tmcw avatar tschaub avatar waldyrious avatar willwhite avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

geobuf's Issues

Command line conversion loses the properties of a FeatureCollection

It looks to me in the code that the properties of a FeatureCollection is written.

So I'm thinking this might a bug:

echo '{"type": "FeatureCollection", "properties": {"name": "collection"}, "features": []}' \
 | ./node_modules/.bin/json2geobuf \
 | ./node_modules/.bin/geobuf2json \
 | jq .
# -> {"type":"FeatureCollection","features":[]}

Implementation guide.

I'm wondering is there any example implementation guide for geobuf? Anything that could guide someone like me (who's totally lost as to how to use it). Ideally some type of leaflet/mapbox related implementation spec would be great.

I understand you need to first encode your geojson file.
Then include your geobuf browser build in your leaflet file.

For leaflet, I get that you can convert the encoded file to geojson in the browser like so:

var layer = L.geoJson( geobuf.decode( new Pbf(data) ) ).addTo(map);

However, I'm totally lost as to how to bring my pbf file into leaflet. How do I bring it in? Can I just include it in a similar way to how I would have for a normal geojson layer, i.e:

script src="json_County201602090.pbf"></script

And 'data' would be a variable inside my pbf file?

Can someone shed some light on how to properly use geobuf?

Apologies if my issue isn't that sophisticated. I'd really appreciate some kind of guide.

Thanks.

Leaflet geobuf example

Hi, is it possible to get a leaflet example map, perhaps using the US zip data you mention as an example of how well geobuf performs, and to show how it is implemented?

Thanks,

Conor.

Feature set or feature?

Should this encode single features, like WKT, or featuresets, like GeoJSON? what's the overhead of doing featuresets always?

Does not retain feature.id

Not sure if this is by design or not?

var assert = require('assert');
var geobuf = require('geobuf');
var f = {
    type: 'Feature',
    id: 'hello there',
    properties: { some: 'thing' },
    geometry: {
        type: 'Point',
        coordinates: [ 0, 0 ]
    }
};

// throws
assert.equal(f, geobuf.geobufToFeature(geobuf.featureToGeobuf(f).toBuffer()));

Browser version cannot be created

I downloaded the latest release, ran a npm install and then npm run build-dev (or build-min) and it is erroring out because it cannot find the build-dev or build-min scripts.

What is the correct way of producing browser js?

Geobuf Index

Lets discuss indexing. Previous discussion: #27 (comment)

I think the solution I'd like to see here is a separate PBF-based format that would come as a separate file coupled with a Geobuf file that would store:

  • serialized R-Tree (rbush) with leafs pointing to feature offsets in the geobuf pbf
  • a map of feature ids to feature offsets for fast single-feature seeking

The R-tree serialization should avoid embedded messages because they are hard to decode lazily. I'd imagine one possible solution to be nodes stored as a flat set of messages, with references to children implemented as offset pointers.

Support missing coordinates

[[10, 10, 5], [10, 10]] currently roundtrips to [[10, 10, 5], [10, 10, 0]]. Ideally we should support things like that.

Ditch TopoJSON support?

@mourner: I also had some thoughts about whether I made a mistake by pushing TopoJSON support instead of keeping things simple and limited to GeoJSON
The size compression benefits are good but it makes the format significantly more complex, this will harm adoption rate
well, not very complex currently, but it'll be much more complex when we make it streamable
@tmcw: hm, i think yes, we should dump it.
i'm divided on topojson because of the dual purpose of topology in it
like, a really good open source implementation of a topology-supporting geometry system... super useful
but topojson is mainly doing it to save bytes

Inaccurate floating point arithmetics produce invalid Geometries (non-closed LinearRings)

How to reproduce

Input GeoJSON file created with ogr2ogr (polygon.geojson)

{
"type": "FeatureCollection",                                          
"features": [
{ "type": "Feature", "properties": { }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 5425435.733081569895148, 2012689.63544030720368 ], [ 5425333.066045090556145, 2012658.8061882276088 ], [ 5425324.357915714383125, 2012693.518385621719062 ], [ 5425426.5193927353248, 2012720.238697179593146 ], [ 5425435.733081569895148, 2012689.63544030720368 ] ] ] } }
]
}

1.

json2geobuf data/polygon.geojson > data/polygon.geobuf

2.

geobuf2json data/polygon.geobuf 

Output:

{"type":"FeatureCollection","features":[{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[5425435.733082,2012689.63544],[5425333.066046,2012658.806188],[5425324.357917,2012693.518385],[5425426.519394,2012720.238697],[5425435.733083,2012689.63544]]]},"properties":{}}]}

Note, first and last vertex are not equal

re: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Math/round

Geobuf Format: IDs

IDs can be of type string or sint32. Whats the rational behind this? Can we support 64bit IDs? I think in this day and age, 32 bit its are too limiting and OSM data would need 64bit IDs.

Geobuf format and streaming writes

/cc @mourner @springmeyer @artemp

The problem

The geobuf format as it is currently defined can't be written in a stream. Instead, the whole data has to be assembled in memory first and then written out. This directly follows from the way Protobuf encodes its messages. A good description of the problem can be found in the header comments of the protobuf writer of the UPB library. In short the problem is that Protobuf uses nested messages to encode the data and each message has a length header which is encoded as a Varint. But we can't write out the length (or even know how long the length field is, because a Varint is of variable length) before we have the whole message assembled.

This is, of course, a major problem for a format that is intended for huge files.

A possible solution: Remove outermost message

Because this is an inherent limitation of the Protobuf format we have to look outside the format for a solution. Of course we could throw away the whole Protobuf format, but thats not needed. Whats needed is a wrapper around it so that the Protobuf encoder/decoder only sees part of the data. In the simplest case we encode the data in pieces:

We remove the outermost message Data and then write each of the other data pieces on their own as their own protobuf message. We might want to move the keys, dimensions, and precision into a message Header or so. The oneof data_type construct doesn't work any more, instead we just have to keep reading messages until EOF. But we don't know which kind of messages will be in there (Feature, FeatureCollection, etc.) so we have to add this information to the newly introduced message Header in some way and then parse accordingly. This should certainly be doable and doesn't require a lot of change. But I think there is...

A better solution: Chunking the data

This looks slightly more complicated in the beginning but has many advantages, so bear with me. Lets encode the data in chunks, each chunk gets a length field and the data:

CHUNK
    LENGTH (4 bytes)
    DATA (LENGTH bytes)
CHUNK
    LENGTH (4 bytes)
    DATA (LENGTH bytes)
...

Each DATA block is a complete Protobuf message which can be parsed on its own. This idea is, of course, not new. It is what the OpenStreetMap OSM PBF is doing.

A typical DATA field will contain maybe a few thousand geometries or features. Note that this does not mean that the contents of the different chunks are somehow logically distinct. Logically this is still one data stream. The chunking is purely an encoding issue and files with the same data split up into different sized chunks would still represent the same data.

This format some added advantages:

  • the LENGTH field can tell us how much memory to allocate for buffering the DATA part
  • reading and writing can be done in parallel, because several threads can work on encoding/decoding different chunks at the same time. In fact thats what Libosmium does when parsing OSM PBF.
  • concatenating two files is trivial: deal with the headers, then just copy data chunks

Now it gets a little bit more complicated than that. (This is again based on experience with the OSM PBF format.) The first chunk should probably contain some kind of header. This could include metadata such as the dimensions setting and the keys. All other chunks contains the data itself. So chunks (OSM PBF calls them Blobs) need to contain some kind of type identifier to differentiate between a header chunk and a data chunk. OSM PBF does this by adding another level of Protobuf encoding (See BlobHeader and Blob messages) which seems like overkill to me. It makes the implementation rather confusing and probably slower. Instead we can just add a type field:

CHUNK
    HEADER (fixed size)
        TYPE (1 byte, first chunk always =META)
        LENGTH (4 bytes)
    DATA (LENGTH bytes)
CHUNK
    HEADER (fixed size)
        TYPE (1 byte, following chunks always =GEOMDATA)
        LENGTH (4 bytes)
    DATA (LENGTH bytes)
...

Strictly speaking we can live without that TYPE field, because the header always has to be the first chunk and following chunks data, but it seems cleaner and allows us more flexibility if we have this type. And maybe we want to have different types of data? This is something which has to be explored.

OSM PBF adds another useful functionality: Encoding chunks (or Blobs) with zlib or other compression formats. Each chunk can be optionally compressed and the type of compression is noted in the chunk header. This can squeeze out the last bytes from the resulting files, but it is still possible to encode and decode the file in chunks and in parallel. To add this we need, again, some type field. And we should also store the size of the uncompressed data, because it allows us to give the decompressor a buffer with the correct size.

Note that I have used 4 byte length fields in my description. This is probably enough for each chunk, in fact chunks should not get too big, because each one has to fit into memory after all (several is we encode/decode in parallel). OSM PBF has some extra constraints on sizes of different structures which can help with implementations because fixed-sized buffers can be used.

Note also that there is no overall length field for the whole file. Thats important, because it allows use to streaming write the data without knowing beforehand how many features the file will contain or how big it will be. (We might want to end the file with some END chunk that marks the end of file to guard against truncation. Optionally it could contain a checksum. This is something that OSM PBF is missing, but could be useful to detect data corruption.)

And while we are at it, I suggest adding a fixed 4-byte (or so) magic header that is always the same but can be used by tools such as find to determine the file size easily and a fixed-size version field for future-proofing the format.

This brings us to something like this:

MAGIC (fixed size)
VERSION (=1, fixed size)
CHUNK
    HEADER (fixed size)
        TYPE (1 byte, first chunk always =META)
        COMPRESSION_TYPE (1 byte)
        RAW_LENGTH (4 bytes)
        ENCODED_LENGTH (4 bytes)
    DATA (LENGTH bytes)
CHUNK
    HEADER (fixed size)
        TYPE (1 byte, first chunk always =GEOMDATA)
        COMPRESSION_TYPE (1 byte)
        RAW_LENGTH (4 bytes)
        ENCODED_LENGTH (4 bytes)
    DATA (LENGTH bytes)
...
CHUNK
    HEADER (fixed size)
        TYPE (1 byte, last chunk always =END)
        COMPRESSION_TYPE (1 byte)
        RAW_LENGTH (4 bytes)
        ENCODED_LENGTH (4 bytes)
    DATA (LENGTH bytes)
        CHECKSUM

Some padding might be necessary to have length fields on 4-byte boundaries etc. And all length fields should probably be encoded in network byte order. Those details can be worked out.

Inside the DATA we'd still use the Protobuf-encoded data (nearly) as before. No big change there. For some things such as the keys we have to discuss whether they fit better in the META header or the DATA part. In the META and END blocks we can use Protobuf, too, or any other encoding. Because thats not a lot of data it isn't that important that we pack it so tightly and using a simpler format might allow simpler access to the metadata. On the other hand Protobuf is tried and true and allows for easy extensibility.

Purpose, use cases, and priorities?

Perhaps you are already planning to address this in #3 but I'd like to know:

What is the primary purpose of this? Why come up with a new format in a landscape dominated by shapefiles, geojson, and (suboptimal) OGC specs? Don't get me wrong, I think applying to ideas of vector tiles to a more open-ended format is brilliant and I want to see where this goes - I just think we'd all benefit from a little more insight into the justification.

What are the primary use cases for this? Is it intended to provide a compact means of transferring data from server to client (i.e., browser), from file storage to server (i.e., vector tiles to mapnik), etc?

What are your main priorities for this? What are the order of things like this?

  • speed of encoding / decoding
  • file size
  • support for streaming parser (i.e., random access to feature at a time) vs parsing entire file into a local data representation that allows random access

Configurable precision

1e6 rounding is hardcoded in geobuf, but some datasets don't loose much with lower precision like 1e4. Perhaps we could make this configurable and also encoded as a property in the format to give more room for geometry compression.

Should get more relevant with delta encoding #23, because lower-precision data will have much lower deltas with configurable precision.

Error: Unimplemented type: 3

I'm making a fairly simple call to the Mapbox API in a node script, and it's failing to decode the response given use of the geobuf example. Logging the body shows a valid, still-encoded response has come back from the API, but attempting to decode throws this:

/Users/wboykinm/github/tribes/processing/water/node_modules/pbf/index.js:204
        else throw new Error('Unimplemented type: ' + type);
                   ^
Error: Unimplemented type: 3
    at Object.Pbf.skip (/Users/wboykinm/github/tribes/processing/water/node_modules/pbf/index.js:204:20)
    at Object.Pbf.readFields (/Users/wboykinm/github/tribes/processing/water/node_modules/pbf/index.js:40:45)
    at Object.decode (/Users/wboykinm/github/tribes/processing/water/node_modules/geobuf/decode.js:17:19)
    at Request._callback (/Users/wboykinm/github/tribes/processing/water/get.js:29:26)
    at Request.self.callback (/Users/wboykinm/github/tribes/processing/water/node_modules/request/request.js:199:22)
    at Request.emit (events.js:110:17)
    at Request.<anonymous> (/Users/wboykinm/github/tribes/processing/water/node_modules/request/request.js:1036:10)
    at Request.emit (events.js:129:20)
    at IncomingMessage.<anonymous> (/Users/wboykinm/github/tribes/processing/water/node_modules/request/request.js:963:12)
    at IncomingMessage.emit (events.js:129:20)

Am I missing some basic preprocessing of the API response?

mapbox/pbf as an encoding/decoding alternative?

As an alternative to Protobuf.js and protocol-buffers libraries, could we use Konstantin's pbf? It's much more low level and without proto reading, but also simpler and gives us more control over encoding and decoding, potentially making it faster.

shp2geobuf fails with shapefile containing null geometries

node node_modules\geobuf\bin\shp2geobuf contour_5.shp > contour_5.geobuf

node_modules\geobuf\encode.js:51
    if (obj.type === 'FeatureCollection') {
           ^
TypeError: Cannot read property 'type' of null
    at analyze (node_modules\geobuf\encode.js:51:12)
    at analyze (node_modules\geobuf\encode.js:56:9)
    at analyze (node_modules\geobuf\encode.js:52:51)
    at encode (node_modules\geobuf\encode.js:26:5)
    at node_modules\geobuf\bin\shp2geobuf:9:26
    at node_modules\geobuf\node_modules\shapefile\index.js:15:5
    at node_modules\geobuf\node_modules\shapefile\read.js:27:11
    at notify (node_modules\geobuf\node_modules\shapefile\node_modules\queue-async\queue.js:47:18)
    at node_modules\geobuf\node_modules\shapefile\node_modules\queue-async\queue.js:39:16
    at FSReqWrap.oncomplete (fs.js:95:15)

Logging obj at the start of analyze function shows this before the TypeError occurs:

-------obj:
 { type: 'Feature',
  properties: { ID: 51, ELEV: 375 },
  geometry: null }
-------obj:
 null

Include projection info

Since this deals in native projections, it should encode something about the projection in the protobuf. How about a proj4 string?

shp2geobuf fails with shapefile containing one feature

shp2geobuf data/polygon.shp > data/polygon.geobuf

/Users/artem/Projects/mapbox/geobuf/encode.js:50
    if (obj.type === 'FeatureCollection') {
           ^
TypeError: Cannot read property 'type' of undefined
    at analyze (/Users/artem/Projects/mapbox/geobuf/encode.js:50:12)
    at encode (/Users/artem/Projects/mapbox/geobuf/encode.js:26:5)
    at /Users/artem/Projects/mapbox/geobuf/bin/shp2geobuf:8:26
    at /Users/artem/Projects/mapbox/geobuf/node_modules/shapefile/index.js:14:23
    at /Users/artem/Projects/mapbox/geobuf/node_modules/shapefile/read.js:26:29
    at notify (/Users/artem/Projects/mapbox/geobuf/node_modules/shapefile/node_modules/queue-async/queue.js:45:26)
    at /Users/artem/Projects/mapbox/geobuf/node_modules/shapefile/node_modules/queue-async/queue.js:35:11
    at /Users/artem/Projects/mapbox/geobuf/node_modules/shapefile/read.js:17:33
    at /Users/artem/Projects/mapbox/geobuf/node_modules/shapefile/index.js:70:27
    at readRecordHeader (/Users/artem/Projects/mapbox/geobuf/node_modules/shapefile/shp.js:34:40)

Nested properties are not handled properly

Nested properties, e.g.,

var feat = {
    type: 'Feature',
    geometry: {
        type: 'Point',
        coordinates: [0, 0]
    },
    properties: {
        nested: {nope: 'yep'}
    }
};

are not handled properly; they are lost during encode / decode.

Stream multiple geobuf messages

I have a toolchain where I pipe GeoJSON documents as line delimited JSON from one to another. It would be nice to encode the single GeoJSON instances as geobuf and pipe those around too.

My first try was to use binary-split to split a stream into several geobuf instances. Unfortunately you have to use a special splitOn ASCII sequence I don't know as the default linebreak could be part of a geobuf too.

So is there any ASCII sequence which is significant for a geobuf start or ending?
I know of #37, but there only the streaming mode of a single GeoJSON document has been discussed.

Fix 1.0.x release tags

Tags for 1.0.0 and 1.0.1 should be prefixed with a v for consistency with other tags on this project.

Using geobuf in browser, require not found

I am trying to use geobuf in browser
I ran
npm run build-min
and got a geobuf-min.js file and have put it in the same folder as html file
I keep getting this error when trying to use the decode function
Uncaught ReferenceError: require is not defined
Probably I am doing some wrong with the browserify part?
Thank you very much

geobuf does not play well with pbf => 1.3.6

I have a server -> client setup where i encode and transfer GeoJSON using geobuf. This last 24 hours a couple of new releases of pbf has been made (1.3.6 and 2.0.0). When upgrading to any of these geobuf seems to generate (or just parse) malformed GeoJSON.

This is when I encode with 1.3.6 or 2.0.0 on the server side and use a matching pbf version on the client end. If I downgrade to 1.3.5 everything looks fine again.

Not sure if this is a pbf bug or just interplay problems between pbf and geobuf.

A revised breaking protobuf schema for Geobuf

The more I find ideas to improve the sizes of geobuf, the more I realize that we would need to completely rewrite the schema to support the improvements, breaking compatibility. So I'm opening the ticket to start a discussion about what a perfect Geobuf schema would look like (not to say this is a priority, but still a good thing to discuss).

I wrote a prototype schema with all the improvements I could think of here: https://gist.github.com/mourner/3c6ddca04c9772593302

The main difference is utilizing the power and flexibility of the new oneof statement to create a better and more compact schema.

Features:

  • the data itself contains information whether it's Feature, FeatureCollection, Geometry or GeometryCollection, so you don't need to guess this when decoding
  • keys and values for properties are stored separately in the top-level Data object, and features only store indexes to them (like vector-tile-spec does); keys and possibly values are reduced to unique values #26 — for much better properties packing
  • geometry coordinates are a oneof set of different fields (depending on type), which solves the ambiguity with null values since a oneof field can be empty (without a default value), and also makes it easier to understand and work with
  • feature has a oneof of either geometry or geometry collection instead of repeated geometries
  • coordinates are stored as delta-encoded sint32 to take full advantage of varint and zig-zag encoding #24 — for much more compact geometries
  • Value message is also a oneof of different value types
  • the property value int type is sint32 instead of int64 — it's more compact and JS doesn't actually handle int64 values; in addition, uint type becomes uint32
  • the data contains optional flag indicating whether it contains altitude (third coordinate), since this is a global-level setting
  • the data also contains optional precision information (6 by default) #25

@tmcw @springmeyer what do you think?

Delta encoding

geobuf uses delta encoding by virtue of using protobuf

Turns out it's actually not, which leaves a great room for geometry compression. Going to look into this and PR.

Switch to sint32 for coords

Since we use 1e6 encoding, it might make sense to switch coords from int64 to int32 for better geometry compression.

int32 range is -2147,483,648 through 2147,483,647, which is plenty of headroom for the usual -180..180 plus a handful of repeating worlds. We probably should not care about other CRS because it's going to be ditched out of the GeoJSON spec.

Geobuf Format: Too many options

Looking at the proto file, if I interpret it correctly, I think there are too many options how the file format can actually look like. The data_type seems to suggest the outermost structure can be either a FeatureCollection, a Feature, a Geometry or a Topology. Is that necessary? Can't it always be a Feature Collection, possibly with just one feature in it which in turn contains one Geometry? I am concerned that different implementors will implement slightly different subsets of the format making implementations incompatible.

Whats the difference between the Value, the properties and the custom_properties? Repeating those fields in Feature and Geometry doesn't look good to me. Do we really need that? In my understanding a feature is a geometry plus some attributes. If we can put attributes on the geometry, why is there a distinction between feature and geometry?

Why has the Topology no properties, just Value and custom_properties? (Unlike Feature and Geometry which have all three.)

Fancy encoding for properties

Since @mourner is checking this library out, might as well:

Could we encode properties in a more efficient way? Right now we're re-encoding property name & value for every single feature. This could save a lot of space if (a) we encode names once and then do an array or (b) we do some magic around structs

License

Please add a license file.

I submitted a PR without first checking against this, but assume that you'll use the same license as other mapbox repos.

column-major order

For overnoded shapefiles, we might be able to grasp some big advantage by ordering coordinates in dimension order rather than in tuples. But there's a parsing and generation overhead, and random access is harder.

Specification

Same as vector-tile-spec we should have a SPEC.md specification for this.

Compatibility with "classic GIS file formats"

I think we need to think about compatibility with "classic GIS file formats" like Shapefiles. What I mean by that is all those formats that only support one type of geometry per layer and maybe even only one layer per file. It should be well defined how those files map to Geobuf files. I am not saying we should limit ourselves to what those files support. But I think it would be useful to define a subset of the Geobuf format that is guaranteed to map well to those formats. Maybe even define some flag that can be set promising that the file contents behave in that way.

Large integer attribute values are altered

For example, using the Census TIGER states dataset, the attributes ALAND and AWATER are large integers (e.g. 62266581604).

Here they are getting encoded as float types and thus get altered (e.g., 62266580992).

For testing I was doing this: shp -> geobuf -> geojson and comparing the property values.

I tried adding the ability to detect and set integer types around line 149 but that produced odd property values in geojson (e.g., "ALAND":{"low":2137039460,"high":14,"unsigned":true}). Presumably this is because Long is not used (but should be)?

Simply setting the values as double type preserves proper values, but produces a bigger geobuf file (as expected).

Presumably the proper fix would be to detect the proper numeric type and encode using that, e.g.,

         switch (typeof v) {
            case 'number':
                if (v|0 === v){
                    val.set((v > 0)? 'uint_value': 'int_value', v);
                }
                else {
                    val.set('float_value', v);
                }
                break;
            case 'boolean':
                val.set('bool_value',  v);
                break;
            case 'string':
                val.set('string_value',  v.toString());      

Not sure how to detect which to use: float vs double but in this case we wanted uint anyway.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.