opentraffic / architecture Goto Github PK

OTv1 overview

architecture's Introduction

The documentation and issues in this repository describe the OTv1 platform. The basic goals remain the same for Open Traffic, although some of the specifics of the OTv2 architecture have changed. For more information, see opentraffic/otv2-platform.

Concept

Real-time and historical traffic speed data is a critical input for most transportation and transport analysis applications. Currently traffic speed data is absent from OpenStreetMap (OSM), undermining its value as basemap for transport applications. Commercial data sets are available for some locations, however, licensing terms and prices are prohibitive for many potential applications. And many areas of the world, particularly developing countries do not have commercial traffic data sets.

This project aims to develop a non-commercial global traffic speed data set linked to OpenStreetMap and built on open source software. Speed data are derived from GPS probe data pooled from fleet operators and app and device makers. The GPS location data is converted into OSM segment-linked speed measurements and strippped of any identifying information about the source and/or journey. The speed data is archived to support real-time and historical analysis applications.

As currently conceived the traffic pool will be operated by a non-profit entity with a mission to improve access to traffic data and with the responsibility for coordinating activities of pool stakeholders. The exact structure of this entity will be determined in collaboration with stakeholders.

Components

https://docs.google.com/a/conveyal.com/drawings/d/1sGqv2nPg9K1uWwD846W7mb1anQO1ZyugMjpKYT_Pl-4/edit

This project is designed as a data pool that connects entities and individuals with real-time location data with processing, data storage, and routing and analysis applications. A primary design goal is enabling data contributors to share derived traffic statistics without sharing underlying GPS location data or fleet information. Additionally contributors to the pool gain access to routing and analysis tools that enable them to immediately utilize derived traffic data.

The pool consists of several related components:

Traffic Engine

The Traffic Engine (TE) translates vehicle location to OSM-linked speed estimates. By design the TE can be run inside a fleet operator allowing internal conversion from GPS location data to to traffic statistics. This ensures that the only data to leave the data provider’s network are fully anonymized traffic statistics.

Similarly, a version of the TE SDK can be embedded into consumer applications or GPS-enabled devices allowing direct calculation and sharing of traffic statics on the device. This helps address privacy concerns for sharing location data and significantly reduces power consumption and data transfer by contributing traffic statistics.

Traffic Data Pool

The Traffic Data Pool is a central storage repository for speed observations. The pool is operated by a non-commercial entity ensuring security of the data and continuity of operations.

The pool collects traffic observations from contributors and provides access to an aggregate real-time snapshot of traffic conditions, as well a historical archive of observation data in support of analytic applications.

OSM-linked Traffic Data Set

The pool will create a static archival snapshot of traffic data and a real-time feed for use by third-party application developers. This will enable developers to incorporate traffic data sets into routing, mapping and analytic applications without restriction for any location where data is available. This data set will include one or more linear referencing methods to link traffic data to OSM or other non-OSM basemaps.

Real-time Routing API

The pool provides multiple interfaces to conditions data, including a real-time routing API available for use by pool contributors. This enables contributors to derive direct benefit from shared data by generating routing and arrival time estimates.

OSM Trace Data + Map Dust?

In addition to storing GPS-derived traffic data there may be value in storing and analyzing trace data to improve the basemap. These traces could include existing OSM GPX trace data sets as well as new trace data collected via the Traffic Engine to the extent that data privacy considerations allow. These traces could be processed to generate OSM “map dust” (missing links, poorly documented turn/directional restrictions etc.) and incorporated into OSM data improvement workflows.

architecture's People

Contributors

Stargazers

Watchers

Forkers

gruppopotente surfcao lkngin ktngoykalolo kristaum gautham-kandela gautam1858 mauritsri niminmptechonly sunmer

architecture's Issues

Anonymization method

As stated in the architecture README:

That is to say, the series of GPS positions identified with individual vehicles are fed into the traffic engine, but the only information that is pushed out of it into the shared database is identified with map features. It is of course possible to reconstruct paths in places where the number of observations is very low (one or two taxis moving through a sparse residential area) but the traffic engine has a threshold number of observations below which it will not report any data. We may even assume that a place with so few observations has negligible congestion or is rarely traveled through.

This reporting threshold is configurable by the organization running a particular instance. So in sum, each contributor runs their own traffic engine, that traffic engine never shares vehicle identifiers with the outside world, and it only exports speed/congestion data under conditions set freely by the contributor.

This basic architecture should go a long way toward eliminating the risk of tracking any one probe vehicle, but some questions still remain: how high should the observation threshold be set, and might there be other subtle details that would allow a sophisticated consumer of this data to reconstruct trajectories?

The French Open Taxi Data project, which is interested in contributing to our congestion/speed database, is undergoing review by the CNIL and has statisticians available for counsel on user privacy issues. Like all of us they are interested in anonymization, but they have some specific strict guidelines to adhere to. I would welcome any comments here from @l-vincent-l @odtvince or their anonymization advisors.

simulator to generate test traces from OSM geometries

The simulator produces test GPS traces with know speed and trace characteristics. We can compare TE outputs to inputs to measure error. Also can introduce GPS trace noise to detect implications for traffic data quality. (There's an existing simulator implementation for OTP version that needs to be ported to new TE OSM)

make sure traffic-engine can run continuously on streaming data

benchmark/tune cpu and memory performance for streaming operation

need to know per thread, per core, and per machine throughput
need to know per vehicle memory overhead and make sure we don't leak

benchmark/tune spatial extent

Can we load planet.pbf into the system? is that the right way to run multiple cities?

build server wrapper for traffic-engine appliance

We need a way to stand up traffic engine and expose an end-point that allows GPS streams to be pointed at it. This endpoint could accept REST based GPS events, a blob of CSV or a Protobuff message with lots of data. We defined a protobuf format in the old project that could be reused. CSV and REST based stuff may not be great for high-volume applications b/c of parsing overhead but could be still be useful alternatives for some users.

Traffic Data Exchange Format

How do we exchange traffic data in bulk and/or real-time?

How do we stream LR data with time/speed info?
Does the OpenLR binary spec work for this or should we use something more modern/well defined. E.g. web mercator tiles via protocol buffers?
What produces the stream? How to consumers attach to it and/or define extent of the stream that they care about? Can we build on the success of vector mercator tiles to segment/distribute data?
How do we store archival time/speed data? Our pilot uses an in-memory OLAP cube store. Fun for getting quick time slices, but probably not scalable. Could we move this to an offline S3-based system? * Are there good object store-centric models for archiving temporal data at planet.pbf scale?

Storage architecture for traffic data (the pool)

Need to store temporal data for speeds, attached to OSM ways/segments. Ideally data would be stored (or at minimum viewed) as a histogram of speed observations for a given window of time.

In the past we've used an in-memory key/value store with a list of values per segment & time window. As we scale up the geographic area we'll need to make sure the storage architecture can accommodate a global data set.

Currently considering a tile-based store that's bounded by time (/data/x/y/z/time-range.datafile). Files can be archived to an object store (e.g. S3). But question remains about which file format is best? PBF or something more application specific?

Are there formats that are well suited for querying (e.g. SQLite?) or should we use simple data storage and move the computation into the downstream application?

Compactness, strong client-lib support, and read/write/seek latency seem to be key considerations.

Hosting arrangements for traffic data applications

One potential benefit of the pool would be to make it easier for TNCs to provide a valuable service to government counterparts through the traffic data applications.

If these applications were hosted by the pool, some level of effort would be necessary to support admin, technical support, map-centering, etc. One thought may be to charge a fee for hosted applications, based on a sliding scale relative to a country’s GDP. The fees could be used to support app hosting costs, and could, conceivably, be paid back to the TNCs, in proportion to the amount of data they contribute to a particular government’s jurisdiction.The TNCs logos could also appear on the application.

A possible benefit of central hosting would be to concentrate global developers’ efforts on the improvement of a refined pool of applications?

Alternatively, we could look at models where the hosting and support arrangements are worked out locally by the data contributors and third parties – something to discuss.

compare against previous (old implementation) results for existing data sets

create time binning for statsitics

Can we develop a stats bucket that grows the time window to capture data in a reasonable way?

vehicles don't travel down pedestrian ways

Disallow speed emissions for ways that disallow vehicle travel.

This raises a question about small TNP vehicles that in fact do travel on small or pedestrian ways, such as motorcycle taxis or rickshaws.

store speed distributions

store speed distributions on a way segment basis.

investigate triplines out of order

In some cases if a road segment is short enough, the standoff between an intersection and its associated tripline will result in the triplines crossing the road in reverse order.

minimum tripline pair segment length

Specify a minimum tripline pair segment length. If a segment is below the minimum length, merge it with a neighbor. The goal is to clean up the tripline clutter in complicated intersections, which tend to interfere with tripline pair completions.

Distributed servers

The nature of this project (crowd-sourced, collaborative, big data, geospatial, append-only) leads me to think of using a distributed architecture, a bit like how DVCS works.

A distributed architecture could be used in the following scenarios:

Load balancing: several identical servers, answering the same requests;
Geographical balancing: several distinct servers, each taking care of a region (possibly overlapping);
Anonymization: a server taking care of data import for a commercial source (a taxi company for example), merging data only once collected;
Open/closed data: servers containing only open-data, servers containing closed and open data;
...

In the following, I'm assuming:

A new atomic data can only be created by one server;
The ID mapping can be done independentely (for example only depends on the OSM data the server use);
The stored data is simply an append-only list of atomic data, where the primary key is (ID/timestamp/creator).

Each server would be referenced by an unique ID (for example, the server URL). When creating a new atomic data, a new entry is generated (mapped ID, timestamp, creator ID). We could potentially use an internal indexing for the creator ID (in case of URL) for optimizing memory usage.

Each server keep a list of last updated timestamps for a list of server. By default this list is empty.

When syncing data, the client pull data from the server: it sends this list of timestamps with the geographical zone it is interested in to the server in the request. The server send back the list of data that have been modified later than the last updated timestamp, alongside it's own list of last updated timestamps, adding itself using "now".

Once synced, the client update the list of last updated timestamps for each server, using the list sent by the server: for each server timestamp, a greater than the current one is incremented.

This procedure could be made incremental to allow for initial syncing, which could transfer large amount of data (in that case the timestamp would increase little by little), the server only answer with a partial list of data.

This procedure is meant to handle gracefully various scenarios:

A sync with B, B with C
A sync with B, B with C, C with D, A with C
etc..

I maybe over-simplifying things, and there are probably subtles issues that would arise, but this maybe a useful feature.

limit tripline-based speed estimates to arterials

Limit tripline-based speed estimates to arterials.

Non-arterial roads tend to have poor speed estimate accuracy as a result of noise and crosstalk with nearby heavily-used roads.

Estimate speed depending on turn direction

For some ways the speed profile (at least before an intersection) depends a lot on the direction taken at the following intersection. For example, a left turn being on average much slower that a right turn. Or the time profile for each direction is very different (from 8 to 9 turning left is slow, from 16-18 turning right is slow...)

A GPS trace contains implicitly this information; for a given speed segment we usually know a more complete path, that is which segment will be the next one. We could then encode this data to get better speed estimates (that's a place were OpenLR shortest path encoding may be helpful in exporting this kind of contextual data).

A simple solution could be to split each segment, one for each next direction, a bit like the "turn-edges" graph approach taken by OpenTripPlanner some time ago to solve the turn restrictions (but now deprecated). Another solution could be to encode the next turn (using some index or ID) alongside the stored data. Or maybe a more generic mechanism where the next turn is treated in a same way as vehicle type information (taxi, bike, bus, etc...).

When the amount of data is low, we could apply ("smear out") speed profiles from some turns to others. For example if a right turn to a very seldom used street do not have enough data, we should be able to fallback on the main ("go straight") speed data.

Is this feature useful? How can we detect if and where this is useful? What's the impact in term of data storage? Code complexity? Export format?

disallow travel wrong way down a one-way road

Disallow travel wrong day down a one-way road.

tripline segment completions aren't valid if they're interrupted

Right now if a vehicle trips A1, then B1, then A2, then B2, it will emit two different speed estimates, at the completion of both segment A and segment B. The idea there is that the vehicle may have actually traversed segment A or segment B, but probably not both. Since we have no way to say which one it was, we might as well emit a speed estimate for both.

Considering that such a case is actually impossible, one simple solution is to only emit a speed estimate only if the last two tripline crossings are adjacent on a way.

Speed emissions for non-consecutive node id segments

Right now the traffic engine only emits a speed sample for traversal of triplines associated with consecutive nodes on a way. In fact a speed should be emitted when a vehicle completes a traversal from consecutive tripline clusters on the same way, where a tripline cluster is the pair of triplines flanking an intersection.

Apparent memory leak in traffic engine

At about 43 million points, the thing consumes 4gb of memory and grinds to a halt. There's no good reason that should happen.

Using OSM GPX planet as a data-source

Announcing the GPX Planet. 2.6 Trillion GPX points.

It could be nice to be able to use this (or part of it) as a GPS source.

create time binning for statsitics

Can we develop a stats bucket that grows the time window to capture data in a reasonable way?

Shapefile output

Command line traffic-engine outputs shapfile. The shapefile shall be a linestring shapefile, where objects are tagged with speed statistics.

One significant challenge of this is that each way segment will have 336 attributes - a count and a mean for each hour of the week. 225 is the maximum number of fields for a shapefile. The shapefile would either need to store more coarse statistics - average speed overall, for example - or the shapefile could ship with a variety of id-keyed statistical extract CSVs which one could join to the shapefile in the GIS application of their choice.

Self-validation using sub-trace O/D pairs

As discussed with @abyrd we can self-validate the traffic model (+routing) by comparing observed O/D travel times against estimated travel times using the traffic model.

By sampling O/Ds from within traces we can avoid privacy issues since the O/D paris don't correspond to actual movements.

Concerns were raised by @mattwigway about sample bias as sub-trace O/Ds may include stopovers or non-direct routes since the sub-trace sections do not correspond with purposeful movements

I think there's a possibility for overcoming this by storing O/D, time and distance. Sub traces where distances aren't in-line with typical shortest path distances could be discarded, eliminating many non-direct routes. Also trips where times don't correspond with other similar observations could be discounted.

Detecting and Storing Map Dust from GPS data

How do we create Map Dust? (http://www.mapdust.com/)

Clear opportunities for producing an automated side-stream of OSM improvements (missing links, incorrect directional/turn restrictions, etc.) from probe data sources.
How do we collect GPS data (different from GPS-derived traffic data) that protects privacy and makes data sharable with OSM community?
Can we make this an opt-in feature of Traffic Engine?
Where does this data live? Obviously different from traffic data but is it conceptually related enough to make part of the same project?

full switch to MapDB for data storage

Deciding what gets included in the archived traffic statistics

In addition to storing the average travel time by road segment and by time period, it would be very useful if we could find a way to also include the number of observations associated with the average travel times -- both for the purposes of establishing the reliability of the results, as well as for use in other applications that may rely on such data. Including observations makes the pool more valuable. Of course, if there is only one data contributor in a given region, this may impinge on their commercial data security concerns.

Thus, a technical challenge may be posed. Would it possible to make the # of observations accessible only in in cases where there are at least two operators covering roughly the same geographic area?

Linear referencing + stable IDs for OSM edges

How do we solve linear referencing in OSM?

Possible paths:

OpenLR

Existing standard. Who’s using besides TomTom & INRIX?
Advantage of not requiring new IDs/curation of IDs as basemap changes
Disadvantage is that it’s completely car-centric. Do we need bike/ped LR features? Can we extend OpenLR to include them?
- Are we creating a new standard if we augment OpenLR given hardware device interoperability?
Does the OpenLR binary format work or are we better served with standards-based/future-proof formats like PBF etc.?

Telenav TTLs

http://wiki.openstreetmap.org/w/images/1/16/SotM2012_Telenav_Traffic_Locations.pdf
Not widely adopted. Is it really “open”?
Requires ID curation
Appears to be car/traffic centric

OpenStreetMapLR

Is there an opportunity to build something that takes advantage of having a truly shared/open basemap? Or taking advantage of specific OSM characteristics/features?
Do we use the ideas/features of OpenLR but build around OSM conventions?
Do we build an ID catalog system for TTLs that works with real-time map updates?
Do the upsides of building in multimodal functionality and better exchange formats outweigh the downsides of “yet another standard”? http://xkcd.com/927/

OpenStreetMapLR + OpenLR?

Can we mitigate downsides by producing both as output streams. Compliant OpenLR for backward compatibility and OSMLR for full feature set?