digidem / osm-p2p-db Goto Github PK

View Code? Open in Web Editor NEW

234.0 9.0 25.0 221 KB

Peer-to-peer database for OpenStreetMap data

License: BSD 2-Clause "Simplified" License

JavaScript 100.00%

osm-p2p-db's Introduction

osm-p2p-db

p2p database for open street map data

Install
Usage
API
Browser
Replication
Architecture
Contribute
License

Install

npm install osm-p2p-db

Usage

var hyperlog = require('hyperlog')

var level = require('level')
var db = {
  log: level('/tmp/osm-p2p/log'),
  index: level('/tmp/osm-p2p/index')
}
var fdstore = require('fd-chunk-store')
var storefile = '/tmp/osm-p2p/kdb'

var osmdb = require('osm-p2p-db')
var osm = osmdb({
  log: hyperlog(db.log, { valueEncoding: 'json' }),
  db: db.index,
  store: fdstore(4096, storefile)
})

if (process.argv[2] === 'create') {
  var value = JSON.parse(process.argv[3])
  osm.create(value, function (err, key, node) {
    if (err) console.error(err)
    else console.log(key)
  })
} else if (process.argv[2] === 'query') {
  var q = process.argv.slice(3).map(csplit)
  osm.query(q, function (err, pts) {
    if (err) console.error(err)
    else pts.forEach(function (pt) {
      console.log(pt)
    })
  })
}

function csplit (x) { return x.split(',').map(Number) }

Now we can create a few nodes and search with a bounding box query:

$ mkdir /tmp/osm-p2p
$ node db.js create '{"type":"node","lat":64.6,"lon":-147.8}'
11892499690884077339
$ node db.js create '{"type":"node","lat":64.3,"lon":-148.2}'
1982521011513780909
$ node db.js create '{"type":"node","lat":64.5,"lon":-147.3}'
14062704270722785878
$ node db.js query 64.1,64.6 -148,-147
{ type: 'node',
  lat: 64.5,
  lon: -147.3,
  id: '14062704270722785878',
  version: 'e635d07b9fc0a9d048cdd5d9e97a44a19ba3a0b2a51830d1e3e0fadcb80935fc' }

We can make a way document that refers to a list of node documents:

$ node db.js create '{"type":"way","refs":
["11892499690884077339","1982521011513780909","14062704270722785878"]}'
14666931246975765366

When we query, any ways that have one or more nodes within the bounding box will turn up in the results:

$ node db.js query 64.1,64.6 -148,-147
{ type: 'node',
  lat: 64.5,
  lon: -147.3,
  id: '14062704270722785878',
  version: 'e635d07b9fc0a9d048cdd5d9e97a44a19ba3a0b2a51830d1e3e0fadcb80935fc' }
{ type: 'way',
  refs: [ '11892499690884077339', '1982521011513780909', '14062704270722785878' ],
  id: '14666931246975765366',
  version: 'f4fc0045e298ca4f9373fab78dee4f0561b4056dcd7975eb92f21d0a05e0eede' }

Terminology

document: a map element, such a a node or way.
document id, osm id: an identifier of a document at its latest known version.
version, version id: an identifier of a document at a specific point in time.

API

var osmdb = require('osm-p2p-db')

var osm = osmdb(opts)

Create a new osm instance with:

opts.log - a hyperlog with a valueEncoding of json
opts.db - a levelup instance to store index data
opts.store - an abstract-chunk-store instance

You may optionally pass in a hyperkv instance as opts.kv, but otherwise one will be created from the opts.log and opts.db.

You may safely delete the index database whenever you like. The index data is automatically regenerated. This is very useful if there are breaking changes to the index code or if the data becomes corrupted. The hyperlog contains the source of truth.

osm.create(doc, opts={}, cb)

Store a new document from doc. cb(err, id, node) fires with the generated OSM id and the node from the underlying hyperlog.

Elements are node, way, and relation. Each element should have a type property that contains the element type as a string.

Nodes should have doc.lat and doc.lon coordinates.
Ways should have an array of OSM keys as doc.refs.
Relations should have an array member objects as doc.members. Each member object has a member.type of the document pointed at by member.ref and optionally a member.role.

Another type of document is a changeset. Each element should have a changeset property that refers to the id of a changeset document.

It is recommended to use tags.comment to store free-form text describing the changeset.

osm.put(id, doc, opts={}, cb)

Replace a document at id with doc.

If the document didn't exist previously, it will be created.

The options opts are passed to the underlying hyperkv instance.

By default, hyperkv will merge the most recent known forks into a single fork. To add modifications to a fork without merging the changes into other forks, set opts.links to an array of only the single key you want to update.

osm.del(id, opts={}, cb)

Delete a document at id.

The options opts are passed to the underlying hyperkv instance.

cb(err, node) fires with the underlying node in the hyperlog.

osm.batch(rows, opts={}, cb)

Atomically insert an array of documents rows.

Each row in rows should have:

row.type - 'put' or 'del'
row.key or row.id - the id of the document (generated if not specified)
row.links - array of links to ancestor keys
row.value - for puts, the value to store

osm.get(id, opts={}, cb)

Get a document as cb(err, docs) by its OSM id.

docs is an object mapping hyperlog hashes to current document values. If a document has been deleted, it will only have the properties { id: <osm-id>, version: <osm-version>, deleted: true}.

osm.getByVersion(version, cb)

Fetch a specific OSM element by its version string. Returns null if not found, otherwise the single element.

osm.query(q, opts, cb)

Query for all nodes, ways, and relations in the query given by the array q. Queries are arrays of [[minLat,maxLat],[minLon,maxLon]] specifying a bounding box.

cb(err, res) fires with an array of results res. Each result is the document augmented with an id property and a version property that is the hash key from the underlying hyperlog. If a document has been deleted, it will only have the properties { id: <osm-id>, version: <osm-version>, deleted: true}.

Optionally:

opts.order - set to 'type' to order by type: node, way, relation

osm.getReferrers(id, cb)

Fetch a list of all OSM ways and relations that refer to the element with ID id. For a node, this can be ways or relations. For a way or relation, this can only be relations.

Objects of the following form are returned:

{
  id: '...',
  version: '...'
}

osm.ready(cb)

Runs the callback cb once all of osm's internal indexes are caught up to the latest data. cb is called exactly once.

osm.close(cb)

Closes the Level and chunk-store backends associated with the database. cb is called upon completion.

var rstream = osm.queryStream(q, opts)

Return a readable object stream rstream of query results contained in the query q. Queries are arrays of [[minLat,maxLat],[minLon,maxLon]] specifying a bounding box.

Each object in the stream is a document augmented with an id property and a version property that is the hash key from the underlying hyperlog.

Optionally:

opts.order - set to 'type' to order by type: node, way, relation

var rstream = osm.getChanges(changeset, [cb])

Get the list of document version ids in a changeset by a changeset id changeset.

If a callback is provided, the version ids are returned as cb(err, versions). Without callback, the versions are provided by the returned readable object stream rstream.

osm.on('error', function (err) {})

Handle errors from the underlying indexes with the 'error' event.

osm.kv

You can get at the hyperkv instance directly to perform more operations using osm.kv.

For example, you can use osm.kv.createReadStream() to list all the id/value pairs in the database.

osm.log

The hyperlog instance is available as the opts.log property if you need to get at it directly later.

Browser

To use this module in the browser, use level-browserify to provide the opts.db instance and idb-chunk-store as the opts.store. Each of these is backed by IndexedDB, a native browser storage interface.

Replication

If you have two hyperlogs log0 and log1, pipe them together and back again to replicate:

var r0 = log0.replicate()
var r1 = log1.replicate()
r0.pipe(r1).pipe(r0)

Insert additional streams as necessary for network transports if the logs live in different processes or machines.

If both logs have made edits to the same IDs, multiple records will appear for the same ID in the results. To merge these "conflicts" back into a single value, use osm.put(id, doc, cb) to store the desired document value.

Architecture

Read about the internal architecture.

Contribute

If you would like to support our work, or if you have ideas about how to use and adapt osm-p2p for your own project, then please dive in. Open an issue with a bug report or feature request, or send us a pull request with a bug-fix or new feature.

We need help right now adding tests and fixing edge-cases with osm-p2p-server and increasing compatibility with other OSM tools such as JOSM.

License

osm-p2p-db's People

Contributors

Stargazers

Watchers

osm-p2p-db's Issues

Store presets in db

iD Editor uses JSON presets to define a UI for categories of features to map, which symbol to use, and which fields to show, and how these map to tag key-value pairs.

We currently maintain these as a separate repo and build them into a single JSON that iD can use with id-presets-builder and import that into Mapeo by bundling the presets and the translations file with mapeo-settings-builder and then manually importing this file https://github.com/digidem/mapeo-desktop/blob/master/browser/main.js#L70-L84 and https://github.com/digidem/mapeo-desktop/blob/master/lib/user-config.js#L49-L62

We should store these presets in the DB and sync them. In the future users should be able to update presets themselves, but I think we need for an admin override. Dealing with forks of presets starts to get really complicated for the user to understand, so I think we should have a fork/conflict resolution method for this.

Add osm.ready() to docs

v4 ideas - simplifying the API

I think we can simplify some of the API, and move higher-level API calls into a separate module, something like osm-p2p-api. This is where all the checks can take place to maintain data integrity (not deleting nodes that are part of a way etc.). Much of this work is already in digidem/osm-p2p-server/api.

Remove osm.create() - functionality is in osm.put().
Remove osm.getChanges() - this is not particularly useful, and the higher-level functionality is in the api from osm-p2p-server.
Merge osm.query() and osm.queryStream() to a single method that returns a stream when no callback is passed.
Change the property refs on ways to nodes to match OSM JSON format.

Joining ways to nodes fails for numeric ids

See: https://github.com/digidem/osm-p2p-db/blob/numeric-ids/test/numeric_ids.js

Add .close() method

We have a very specific use case for the osm-p2p-db. The main process that runs the server creates a database. Subsequently, when we try to connect to the database from a different node process, we get a Resource temporarily unavailable because of the LOCK.

Looking through the documentation I wasn't able to find any way to close the db connection. What would be the best way to go about this?

Cc @olafveerman @gmaclennan

Way not returned when visible nodes are a subset of another way

very well articulated in WorldBank-Transport/ram-backend#185

ready() method does not wait for changeset index to catch up

osm.ready() does not wait for the changeset index. As well as fixing this we should make this more robust so that we don't hit this bug again when we add additional indexes, perhaps store all indexes on this._indexes and iterate over each one in ready()?

Read up on bkd trees

@substack recommended I read up on BDK Trees, which sound like they have great performance characteristics:

https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf

Changeset is not stored on deleted elements

Failing test on osm-p2p-server: https://github.com/digidem/osm-p2p-server/blob/master/test/api/get_changes.test.js#L40-L52

Spatial and date indexes on Changesets

We need to implement GET /api/0.6/changesets in order to create an interface for reviewing recent changes.

This would require both a spatial index on changesets and a date index. Is there a way to cheaply get an ordered list of changesets? e.g. if we can't rely on clocks being set correctly can we just pull the most recent changesets off the db?

relations

The current implementation treats relations like ways. Relations may entail more lookups.

published version of osm-p2p-db contains benchmark outputs

While investigating the size of mapeo-desktop, I found that the osm-p2p-db/benchmark/copy folder size is 59mb

How do we prevent this? Will simply adding a .gitignore for benchmark/copy be enough, or will we need to make sure it is removed before publishing?

Add option to return ordered query response

This is probably the easiest way to solve digidem/osm-p2p-server#19

How to index deleted points?

Consider the following scenario:

For a point A, one user changes a tag, and another user moves the point, creating a fork. The point is then deleted:

        /--- A1 <---\
A0 <----             ---- A3 (deleted)
        \--- A2 <---/

If we are going to return deleted points, what should be returned from a bbox query?

My thoughts: A3 should be returned whether A1 or A2 is in the bbox. The client should be responsible for choosing what to do next (e.g. read the two linked points A1 and A2 and display them on the map). This would mean that there should be two references in the kdb index both pointing to A3. We would probably need to dedupe for the results of a bbox that included both A1 and A2.

How to handle deletions of documents referenced by others in a distributed system?

The OSM API does a check when you attempt to delete a node/way/relation to ensure that it is not referenced by any other documents, and if it is referenced it returns a 412 error: http://wiki.openstreetmap.org/wiki/API_v0.6#Delete:_DELETE_.2Fapi.2F0.6.2F.5Bnode.7Cway.7Crelation.5D.2F.23id

The changeset upload API has an option if-unused that silently ignores deletions of elements used by others: http://wiki.openstreetmap.org/wiki/API_v0.6#Diff_upload:_POST_.2Fapi.2F0.6.2Fchangeset.2F.23id.2Fupload

In a p2p distributed system a document could be unused on the current peer, but in-use on another peer.

E.g. Clients α and β have a way A with nodes B,C,D. Client α deletes the way, and deletes the nodes B,C,D since they are not used by any document in the db on α. Client β modifies a tag on the way, but does not modify any of the nodes B,C,D. After replication way A is forked, but nodes B,C,D have a single head which is a deletion. The undeleted fork of A now points to deleted nodes.

match overpass json format

osm-p2p-db's format is very similar to overpass json but overpass uses nodes instead of refs for ways.

Modifying a fork without merging conflict

From the readme 'To merge these "conflicts" back into a single value, use osm.put(id, doc, cb) to store the desired document value.'

How do you update a doc that has a fork without merging? I can think of several scenarios where a user would need to update a forked doc (i.e. update one of the forks) and postpone dealing with merging / resolving conflicts.

Search index

We need a way of searching for existing features by name (i.e. value of the tag name:), preset name, or potentially the value of any tag.

A place to start might be search-index and hook it up with the hyperlog the way hyperlog-index works to add / modify / delete text index records as we go.

store versions to compare historical edits

Ways and relations only refer to refs by their id, so it's hard to compare what a way looked like at the time it was created.

listing conflicts (forks)

Can we add an example for listing forked records? Should we add a method osm.conflicts() for this?

need tests for batch modify and delete operations

Close gracefully

We need a method to close all the open stores gracefully, both level and chunk stores. There are many of them.

I'm unclear right now whether when electron quits it waits for any pending leveldb writes, we should make sure and close osm-p2p-db in the quit handler.

We need this function for cleanup of test dbs when running in the browser.

opts.size === chunkLength?

Should the opts.size argument be equal to chunkLength which is passed when creating the chunk-store? Maybe rename to opts.chunkLength? Can we read this from the chunkStore rather than requiring it? What would happen if it was different?

Performance/optimization opportunities

We're probably too early to start thinking about optimizations, but these ideas/questions were on my mind so leaving this here for discussion.

As far as I understand the code, for the indexes we are creating look-ups by id to the original value. What performance might be gained from creating complete 'materialized views' with a copy of the original data, at the expense of increased disk space, and, perhaps, longer write times? i.e. looking up a value from the index would only require a single read, rather than two, one for the id(s), and then for the original record.
For the spatial index, what might we get by creating a quadtile index as opposed to kdb tree? My idea would be to create a quad-tile index at a single zoom level (e.g. z16) and respond to boundbox queries approximating the bbox to the overlapping zoom 16 tiles. If we stored this as a materialized view as opposed to an index, bbox quieries could be very fast, since it could be a single batch read from leveldb bounded by the min-max of the tiles we are looking for. Would it matter if we responded to bounding box queries with a close approximation (e.g. rounded to z16 quadtiles) as opposed to the exact bbox boundaries? iD Editor for example only makes bbox requests to the OSM API for bounding boxes equivalent to z16 tiles.

new types

This feature would support adding custom new types on top of the existing entities. For example, an observation type might describe a fact at a particular time and place by linking to a node, but observations would be immutable whereas nodes and other entities can mutate.

e.g. when mapping a coffee shop that does not currently exist
in the db, the first step would be to make an observation with a GPS/mobile
app: "I was here and there is a coffee shop here", and on the basis of that
we can create a coffee shop either automatically or manually (this is
actually the current OSM workflow using GPX files from a GPS). On a
subsequent visit to the coffee shop you might observe that the name has
change, or it has closed down, and you would make an additional observation
linked to the coffee shop node or way indicating its current status.

Aside from reading refs and members from ways and relations, the database doesn't care what type a document is, so observation types could be added right now. Observations would likely need their own index over the hyperlog, which is also possible to do externally before the log is passed to osm-p2p-db.

I think this use case warrants an example. If the example is messy or difficult, I'll make a separate module or plugin mechanism to make custom types easier to deal with.

media replication

this should either be a separate module or a plugin, using hyperdrive

changesets

Needed upstream in osm-p2p-server: digidem/osm-p2p-server#5

Reposting here. This is currently in-progress using hyperlog-join to handle the changeset foreign key relations. I'm also refactoring the refs/members relation code to use hyperlog-join.

Add ready() method

We should have a ready() method executes a function when indexes are fully caught up, essentially exposing the same method that is on the hyperlog-index.

Relations

Relation members in OSM have type and role properties. How do we store that data in osm-p2p-db?

database isolation

We need to have:

a way to set a database ID
refuse to replicate with an ID that doesn't match the current ID
set the database ID after replication if an ID isn't set
a way to force replication for when an ID doesn't match

Defaults - hyperlog

Are there potential alternatives for hyperlog, or will this always be used? I am thinking we should make hyperlog a hard dependency and just pass the leveldb? If so we could just pass a single leveldb in the options and use sublevel? Are there performance gains from having two separate leveldbs?

Do we need opts.size or can this be read from the fd-store?

I'm thinking how we can reduce the complexity of the options.

Quantization of lon/lat coordinates

We should quantize the coordinates to a fixed precision to avoid database bloat and potential floating point rounding errors at the spatial index. Decimal degrees precision:

decimal	degree	@ 0° Lat	@ 30° Lat	@ 60° Lat
2	0.01°	~ 1 km	~ 0.9 km	~ 0.5 km
3	0.001°	~ 100 m	~ 90 m	~ 50 m
4	0.0001°	~ 10 m	~ 9 m	~ 5 m
5	0.00001°	~ 1 m	~ 0.9 m	~ 0.5 m
6	0.000001°	~ 0.1 m	~ 0.09 m	~ 0.05 m
7	0.0000001°	~ 0.01 m	~ 0.009 m	~ 0.005 m
8	0.00000001°	~ 0.001 m	~ 0.0009 m	~ 0.0005 m

8 decimal places would get us 1mm precision, I think that's more than enough. The GeoJSON Spec recommends 6 decimal places for <10cm precision. There may be use-cases for greater precision, e.g. drawing precise building footprints. I think 8 decimal places (1mm precision) would be more than enough. Perhaps configurable at a DB level? Would it need to be kept consistent across the whole db?

user/uid/device id in the data model

We will need:

an index to associate uids with user names so that users can change their names with a continuity of editing history
a device id based on a keypair to sign the log messages
an index to show the list of devices with the time of the last edit to make sure datasets are up to date

document architecture

how the hyperlog, indexes, chunk store, and database all fit together

Store version as well as id on way.nodes and relation.members

Moving from #29 (comment) since this should be its own issue:

The issue of ways and relations referring to nodes/members only by OsmID is a big problem, not just for osm-p2p but for anybody dealing with historic OSM data. I know the Mapbox data team has hit this in their need to review data changes. The workaround they use I think is to use the timestamps to reconstruct which version of nodes/members were referenced by a particular way/relation. This is obviously fragile and costly, especially in a p2p system.

I think we can add versions to the way/relations in a way that remains compatible with existing clients:

Internally we should store both id and version of nodes within a way and members within a relation.
We should prefer version when doing a lookup in the index, but fallback to id. -- see also #48

When a non-p2p-aware client submits a change, use the version of the nodes / members from the changeset if present in the changeset, if not set the version by selecting the most recent fork using the same algorithm that would have been used to present the most recent fork to the client.
Some issues we would hit:

iD editor does not include any unchanged nodes in a changeset if you change the tags on a way.
If you move a node in a way, the way itself is not included in the changeset.
The version number of a way would need to change every time a node was changed. iD might need a patch to ensure it pulls down the updated way after only a node was updated.

For relations, this would make it hard to avoid forks: if we store version numbers on relations, any update to a member would need to update the version of the relation. This would mean we would not be able to use relations for long rivers to avoid forks - any edit to any segment of the river would create a new version of the relation.

edge cases with forks

Here are some edge cases regarding ways and forks of them and their nodes that we ought to consider & test for, pulled from @gmaclennan in IRC:

If a point is forked and one fork is inside the query and the other outside, does it just return the one inside? It probably should and probably does.

Here's a reason for some ghost points: when there is a forked way, the query lookup I believe will return the points associated with both forks of the way. But the points arenM- forked, so we return them all, but we choose only one of the forks of the way to return.

Also: if a way is forked and one fork is a delete, does the indexer correctly delete the reverse lookup? We should ensure we have a test for that.

Switch query to use bbox [west, south, east, north]

Currently osm.query(q) expects q to be a bounding box of the format [[minLat,maxLat],[minLon,maxLon]]. This is not a format commonly used elsewhere and it is not documented how this is interpreted at the poles and across the meridian.

The new GeoJSON standard RFC 7946 has a clearly defined bbox in the format [west, south, east, north], with clear explanations of how this is defined at the poles and at the meridian. I think it makes more sense to use this bbox format which is more commonly used, and it matches the format expected in the OSM REST API 0.6.

This does not need to be a breaking change, as we can continue to support both formats as needed with a simple test Array.isArray(q[0]).

batch changesets

Right now items in changesets are written individually, so if the process crashes mid-way through an upload, there could be nodes without their way. It makes sense to expose batches through the existing changeset system in OSM.

Investigate indexing performance

I'd really like to see faster indexing. Multiple people have brought up how slow it can be. Reminder to myself to do some profiling on the indexing pipeline and find the bottlenecks.

deleting relations does not correctly update the join index

osm._del() does not consider relations:

https://github.com/digidem/osm-p2p-db/blob/master/index.js#L163-L173

It looks like the hyperlog-join map function expects row.value.members to be set by osm._del()

https://github.com/digidem/osm-p2p-db/blob/master/index.js#L65-L73

bulk import/export

It should be possible to import an OSM extract for offline editing and to export a changeset or sequence of changesets back to the mainline OSM.

deterministic keys for osm data when importing

Debugging the issue we've had with exporting observation data, I found that the primary issue was that the osm-p2p key for the same OSM node would be different on the desktop and mobile apps.

This issue came up because we are doing an import from xml in two separate apps, and create is assigning random values for the key of the same data.

In the create function, instead of:

var key = hex2dec(randomBytes(8).toString('hex'))

We could do something like

var key = hex2dec(Buffer(value.id).toString('hex'))

Where value.id is the id from OSM, so that no matter who creates the data the key is always the same.

There are situations where no id is present, so we could do something like this:

if (value.id) {
  key = hex2dec(Buffer(value.id).toString('hex'))
} else {
  key = hex2dec(randomBytes(8).toString('hex'))
}

An alternate solution could be that when importing we use put and provide the key. But it seems like having create handle existing OSM data by using its id would be reasonable.

I'd also be tempted to just use the OSM id in its original form when importing instead of a new key or a hash of the id. I'm not sure if there's a reason not to do that.

deforking abstraction

I think osm-p2p-defork may be a level of abstraction too far. There are two ways of getting data out of osm-p2p-db that would need deforking: .queryStream() and .get(). I think it makes more sense to build in deforking to these methods, essentially merging the work done on osm-p2p-server into osm-p2p-db. Deforking is very tied to internal implementation details of osm-p2p-db, and I think having it as an external module could make it more difficult to maintain and for others to understand.