digidem / osm-p2p-db Goto Github PK

Peer-to-peer database for OpenStreetMap data

License: BSD 2-Clause "Simplified" License

JavaScript 100.00%

osm-p2p-db's Issues

Store version as well as id on way.nodes and relation.members

Moving from #29 (comment) since this should be its own issue:

The issue of ways and relations referring to nodes/members only by OsmID is a big problem, not just for osm-p2p but for anybody dealing with historic OSM data. I know the Mapbox data team has hit this in their need to review data changes. The workaround they use I think is to use the timestamps to reconstruct which version of nodes/members were referenced by a particular way/relation. This is obviously fragile and costly, especially in a p2p system.

I think we can add versions to the way/relations in a way that remains compatible with existing clients:

Internally we should store both id and version of nodes within a way and members within a relation.
We should prefer version when doing a lookup in the index, but fallback to id. -- see also #48

When a non-p2p-aware client submits a change, use the version of the nodes / members from the changeset if present in the changeset, if not set the version by selecting the most recent fork using the same algorithm that would have been used to present the most recent fork to the client.
Some issues we would hit:

iD editor does not include any unchanged nodes in a changeset if you change the tags on a way.
If you move a node in a way, the way itself is not included in the changeset.
The version number of a way would need to change every time a node was changed. iD might need a patch to ensure it pulls down the updated way after only a node was updated.

For relations, this would make it hard to avoid forks: if we store version numbers on relations, any update to a member would need to update the version of the relation. This would mean we would not be able to use relations for long rivers to avoid forks - any edit to any segment of the river would create a new version of the relation.

Switch query to use bbox [west, south, east, north]

Currently osm.query(q) expects q to be a bounding box of the format [[minLat,maxLat],[minLon,maxLon]]. This is not a format commonly used elsewhere and it is not documented how this is interpreted at the poles and across the meridian.

The new GeoJSON standard RFC 7946 has a clearly defined bbox in the format [west, south, east, north], with clear explanations of how this is defined at the poles and at the meridian. I think it makes more sense to use this bbox format which is more commonly used, and it matches the format expected in the OSM REST API 0.6.

This does not need to be a breaking change, as we can continue to support both formats as needed with a simple test Array.isArray(q[0]).

Search index

We need a way of searching for existing features by name (i.e. value of the tag name:), preset name, or potentially the value of any tag.

A place to start might be search-index and hook it up with the hyperlog the way hyperlog-index works to add / modify / delete text index records as we go.

Relations

Relation members in OSM have type and role properties. How do we store that data in osm-p2p-db?

How to index deleted points?

Consider the following scenario:

For a point A, one user changes a tag, and another user moves the point, creating a fork. The point is then deleted:

        /--- A1 <---\
A0 <----             ---- A3 (deleted)
        \--- A2 <---/

If we are going to return deleted points, what should be returned from a bbox query?

My thoughts: A3 should be returned whether A1 or A2 is in the bbox. The client should be responsible for choosing what to do next (e.g. read the two linked points A1 and A2 and display them on the map). This would mean that there should be two references in the kdb index both pointing to A3. We would probably need to dedupe for the results of a bbox that included both A1 and A2.

How to handle deletions of documents referenced by others in a distributed system?

The OSM API does a check when you attempt to delete a node/way/relation to ensure that it is not referenced by any other documents, and if it is referenced it returns a 412 error: http://wiki.openstreetmap.org/wiki/API_v0.6#Delete:_DELETE_.2Fapi.2F0.6.2F.5Bnode.7Cway.7Crelation.5D.2F.23id

The changeset upload API has an option if-unused that silently ignores deletions of elements used by others: http://wiki.openstreetmap.org/wiki/API_v0.6#Diff_upload:_POST_.2Fapi.2F0.6.2Fchangeset.2F.23id.2Fupload

In a p2p distributed system a document could be unused on the current peer, but in-use on another peer.

E.g. Clients α and β have a way A with nodes B,C,D. Client α deletes the way, and deletes the nodes B,C,D since they are not used by any document in the db on α. Client β modifies a tag on the way, but does not modify any of the nodes B,C,D. After replication way A is forked, but nodes B,C,D have a single head which is a deletion. The undeleted fork of A now points to deleted nodes.

Spatial and date indexes on Changesets

We need to implement GET /api/0.6/changesets in order to create an interface for reviewing recent changes.

This would require both a spatial index on changesets and a date index. Is there a way to cheaply get an ordered list of changesets? e.g. if we can't rely on clocks being set correctly can we just pull the most recent changesets off the db?

media replication

this should either be a separate module or a plugin, using hyperdrive

Way not returned when visible nodes are a subset of another way

very well articulated in WorldBank-Transport/ram-backend#185

user/uid/device id in the data model

We will need:

an index to associate uids with user names so that users can change their names with a continuity of editing history
a device id based on a keypair to sign the log messages
an index to show the list of devices with the time of the last edit to make sure datasets are up to date

new types

This feature would support adding custom new types on top of the existing entities. For example, an observation type might describe a fact at a particular time and place by linking to a node, but observations would be immutable whereas nodes and other entities can mutate.

e.g. when mapping a coffee shop that does not currently exist
in the db, the first step would be to make an observation with a GPS/mobile
app: "I was here and there is a coffee shop here", and on the basis of that
we can create a coffee shop either automatically or manually (this is
actually the current OSM workflow using GPX files from a GPS). On a
subsequent visit to the coffee shop you might observe that the name has
change, or it has closed down, and you would make an additional observation
linked to the coffee shop node or way indicating its current status.

Aside from reading refs and members from ways and relations, the database doesn't care what type a document is, so observation types could be added right now. Observations would likely need their own index over the hyperlog, which is also possible to do externally before the log is passed to osm-p2p-db.

I think this use case warrants an example. If the example is messy or difficult, I'll make a separate module or plugin mechanism to make custom types easier to deal with.

Performance/optimization opportunities

We're probably too early to start thinking about optimizations, but these ideas/questions were on my mind so leaving this here for discussion.

As far as I understand the code, for the indexes we are creating look-ups by id to the original value. What performance might be gained from creating complete 'materialized views' with a copy of the original data, at the expense of increased disk space, and, perhaps, longer write times? i.e. looking up a value from the index would only require a single read, rather than two, one for the id(s), and then for the original record.
For the spatial index, what might we get by creating a quadtile index as opposed to kdb tree? My idea would be to create a quad-tile index at a single zoom level (e.g. z16) and respond to boundbox queries approximating the bbox to the overlapping zoom 16 tiles. If we stored this as a materialized view as opposed to an index, bbox quieries could be very fast, since it could be a single batch read from leveldb bounded by the min-max of the tiles we are looking for. Would it matter if we responded to bounding box queries with a close approximation (e.g. rounded to z16 quadtiles) as opposed to the exact bbox boundaries? iD Editor for example only makes bbox requests to the OSM API for bounding boxes equivalent to z16 tiles.

opts.size === chunkLength?

Should the opts.size argument be equal to chunkLength which is passed when creating the chunk-store? Maybe rename to opts.chunkLength? Can we read this from the chunkStore rather than requiring it? What would happen if it was different?

Modifying a fork without merging conflict

From the readme 'To merge these "conflicts" back into a single value, use osm.put(id, doc, cb) to store the desired document value.'

How do you update a doc that has a fork without merging? I can think of several scenarios where a user would need to update a forked doc (i.e. update one of the forks) and postpone dealing with merging / resolving conflicts.

bulk import/export

It should be possible to import an OSM extract for offline editing and to export a changeset or sequence of changesets back to the mainline OSM.

deforking abstraction

I think osm-p2p-defork may be a level of abstraction too far. There are two ways of getting data out of osm-p2p-db that would need deforking: .queryStream() and .get(). I think it makes more sense to build in deforking to these methods, essentially merging the work done on osm-p2p-server into osm-p2p-db. Deforking is very tied to internal implementation details of osm-p2p-db, and I think having it as an external module could make it more difficult to maintain and for others to understand.

need tests for batch modify and delete operations

ready() method does not wait for changeset index to catch up

osm.ready() does not wait for the changeset index. As well as fixing this we should make this more robust so that we don't hit this bug again when we add additional indexes, perhaps store all indexes on this._indexes and iterate over each one in ready()?

listing conflicts (forks)

Can we add an example for listing forked records? Should we add a method osm.conflicts() for this?

Joining ways to nodes fails for numeric ids

See: https://github.com/digidem/osm-p2p-db/blob/numeric-ids/test/numeric_ids.js

v4 ideas - simplifying the API

I think we can simplify some of the API, and move higher-level API calls into a separate module, something like osm-p2p-api. This is where all the checks can take place to maintain data integrity (not deleting nodes that are part of a way etc.). Much of this work is already in digidem/osm-p2p-server/api.

Remove osm.create() - functionality is in osm.put().
Remove osm.getChanges() - this is not particularly useful, and the higher-level functionality is in the api from osm-p2p-server.
Merge osm.query() and osm.queryStream() to a single method that returns a stream when no callback is passed.
Change the property refs on ways to nodes to match OSM JSON format.

Quantization of lon/lat coordinates

We should quantize the coordinates to a fixed precision to avoid database bloat and potential floating point rounding errors at the spatial index. Decimal degrees precision:

decimal	degree	@ 0° Lat	@ 30° Lat	@ 60° Lat
2	0.01°	~ 1 km	~ 0.9 km	~ 0.5 km
3	0.001°	~ 100 m	~ 90 m	~ 50 m
4	0.0001°	~ 10 m	~ 9 m	~ 5 m
5	0.00001°	~ 1 m	~ 0.9 m	~ 0.5 m
6	0.000001°	~ 0.1 m	~ 0.09 m	~ 0.05 m
7	0.0000001°	~ 0.01 m	~ 0.009 m	~ 0.005 m
8	0.00000001°	~ 0.001 m	~ 0.0009 m	~ 0.0005 m

8 decimal places would get us 1mm precision, I think that's more than enough. The GeoJSON Spec recommends 6 decimal places for <10cm precision. There may be use-cases for greater precision, e.g. drawing precise building footprints. I think 8 decimal places (1mm precision) would be more than enough. Perhaps configurable at a DB level? Would it need to be kept consistent across the whole db?

deterministic keys for osm data when importing

Debugging the issue we've had with exporting observation data, I found that the primary issue was that the osm-p2p key for the same OSM node would be different on the desktop and mobile apps.

This issue came up because we are doing an import from xml in two separate apps, and create is assigning random values for the key of the same data.

In the create function, instead of:

var key = hex2dec(randomBytes(8).toString('hex'))

We could do something like

var key = hex2dec(Buffer(value.id).toString('hex'))

Where value.id is the id from OSM, so that no matter who creates the data the key is always the same.

There are situations where no id is present, so we could do something like this:

if (value.id) {
  key = hex2dec(Buffer(value.id).toString('hex'))
} else {
  key = hex2dec(randomBytes(8).toString('hex'))
}

An alternate solution could be that when importing we use put and provide the key. But it seems like having create handle existing OSM data by using its id would be reasonable.

I'd also be tempted to just use the OSM id in its original form when importing instead of a new key or a hash of the id. I'm not sure if there's a reason not to do that.

Add .close() method

We have a very specific use case for the osm-p2p-db. The main process that runs the server creates a database. Subsequently, when we try to connect to the database from a different node process, we get a Resource temporarily unavailable because of the LOCK.

Looking through the documentation I wasn't able to find any way to close the db connection. What would be the best way to go about this?

Cc @olafveerman @gmaclennan

Add osm.ready() to docs

relations

The current implementation treats relations like ways. Relations may entail more lookups.

document architecture

how the hyperlog, indexes, chunk store, and database all fit together

Add option to return ordered query response

This is probably the easiest way to solve digidem/osm-p2p-server#19

Changeset is not stored on deleted elements

Failing test on osm-p2p-server: https://github.com/digidem/osm-p2p-server/blob/master/test/api/get_changes.test.js#L40-L52

database isolation

We need to have:

a way to set a database ID
refuse to replicate with an ID that doesn't match the current ID
set the database ID after replication if an ID isn't set
a way to force replication for when an ID doesn't match

Investigate indexing performance

I'd really like to see faster indexing. Multiple people have brought up how slow it can be. Reminder to myself to do some profiling on the indexing pipeline and find the bottlenecks.

Close gracefully

We need a method to close all the open stores gracefully, both level and chunk stores. There are many of them.

I'm unclear right now whether when electron quits it waits for any pending leveldb writes, we should make sure and close osm-p2p-db in the quit handler.

We need this function for cleanup of test dbs when running in the browser.

Defaults - hyperlog

Are there potential alternatives for hyperlog, or will this always be used? I am thinking we should make hyperlog a hard dependency and just pass the leveldb? If so we could just pass a single leveldb in the options and use sublevel? Are there performance gains from having two separate leveldbs?

Do we need opts.size or can this be read from the fd-store?

I'm thinking how we can reduce the complexity of the options.

Read up on bkd trees

@substack recommended I read up on BDK Trees, which sound like they have great performance characteristics:

https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf

Add ready() method

We should have a ready() method executes a function when indexes are fully caught up, essentially exposing the same method that is on the hyperlog-index.

edge cases with forks

Here are some edge cases regarding ways and forks of them and their nodes that we ought to consider & test for, pulled from @gmaclennan in IRC:

If a point is forked and one fork is inside the query and the other outside, does it just return the one inside? It probably should and probably does.

Here's a reason for some ghost points: when there is a forked way, the query lookup I believe will return the points associated with both forks of the way. But the points arenM- forked, so we return them all, but we choose only one of the forks of the way to return.

Also: if a way is forked and one fork is a delete, does the indexer correctly delete the reverse lookup? We should ensure we have a test for that.

store versions to compare historical edits

Ways and relations only refer to refs by their id, so it's hard to compare what a way looked like at the time it was created.

published version of osm-p2p-db contains benchmark outputs

While investigating the size of mapeo-desktop, I found that the osm-p2p-db/benchmark/copy folder size is 59mb

How do we prevent this? Will simply adding a .gitignore for benchmark/copy be enough, or will we need to make sure it is removed before publishing?

batch changesets

Right now items in changesets are written individually, so if the process crashes mid-way through an upload, there could be nodes without their way. It makes sense to expose batches through the existing changeset system in OSM.

changesets

Needed upstream in osm-p2p-server: digidem/osm-p2p-server#5

Reposting here. This is currently in-progress using hyperlog-join to handle the changeset foreign key relations. I'm also refactoring the refs/members relation code to use hyperlog-join.

match overpass json format

osm-p2p-db's format is very similar to overpass json but overpass uses nodes instead of refs for ways.

deleting relations does not correctly update the join index

osm._del() does not consider relations:

https://github.com/digidem/osm-p2p-db/blob/master/index.js#L163-L173

It looks like the hyperlog-join map function expects row.value.members to be set by osm._del()

https://github.com/digidem/osm-p2p-db/blob/master/index.js#L65-L73

Store presets in db

iD Editor uses JSON presets to define a UI for categories of features to map, which symbol to use, and which fields to show, and how these map to tag key-value pairs.

We currently maintain these as a separate repo and build them into a single JSON that iD can use with id-presets-builder and import that into Mapeo by bundling the presets and the translations file with mapeo-settings-builder and then manually importing this file https://github.com/digidem/mapeo-desktop/blob/master/browser/main.js#L70-L84 and https://github.com/digidem/mapeo-desktop/blob/master/lib/user-config.js#L49-L62

We should store these presets in the DB and sync them. In the future users should be able to update presets themselves, but I think we need for an admin override. Dealing with forks of presets starts to get really complicated for the user to understand, so I think we should have a fork/conflict resolution method for this.

digidem / osm-p2p-db Goto Github PK

osm-p2p-db's Issues

Recommend Projects

Recommend Topics

Recommend Org