opentraffic / datastore Goto Github PK

View Code? Open in Web Editor NEW

28.0 28.0 12.0 764 KB

OTv2: centralized ingest and aggregation of anonymous traffic data

License: GNU Lesser General Public License v3.0

Python 50.16% Shell 6.11% Java 35.83% HTML 7.73% CSS 0.17%

datastore's People

Contributors

Stargazers

Watchers

Forkers

emanuz f-hashemi buminta cuulee jkatgithub parthsareen bboyayaa marwinski datafactortop quanvn1206 amusaku finnlidbetter

datastore's Issues

Consider serving tiles where gzip is enabled

From a test I did earlier today, a server that gzips tiles on-the-fly gives us a >90% savings on the transferred file size:

If you look on the right side, the top number is the actual size transferred over the wire; the bottom number is the uncompressed size. Note that these were the original JSON files with unmangled properties. I did a similar test with the mangled properties, and the resulting file sizes remained pretty close, because of the nature of the gzip algorithm.

This is significantly better for download performance (but says nothing about memory or processing performance yet). What this means is that a user had to wait about ~3 minutes before to download 600MB, and about ~20-30 seconds to do 60MB. This is excellent. My goal was to get our download sizes to about 60MB per request. Even if we were to spend a bunch of time rejiggering how we roll up data, or determining what properties to include, we would probably, at best, get us 50% of the way there. By gzipping the tiles, we get 90% savings immediately with only a tweak in the infrastructure.

Please note that this does not mean serve files that were gzipped manually. This is because the browser will automatically uncompress files that were transmitted over the wire in gzip encoding. If the files were transmitted as gzipped files, the browser does not automatically uncompress it, and then you would require something in JavaScript to parse the file and gzip client side, which is not optimal. Therefore, we must have the server serve files with gzip compression turned on, which is not the same as having the export process create gzipped files.

In summary, here's my recommendations for now:

Store raw data exports with unmangled properties (more user-friendly to work with)
Serve the S3 files from behind something like Cloudfront so that we can get gzipped transfers.

Design static data product pipeline

The static data product pipeline will read the Datastore's internal histogram tile files and produce:

data extracts for public use (see #36)
historical routing tiles for use by Valhalla (see #33)

Requirements:

run jobs at scheduled intervals (no need to process constant streams)
read ORC files [*]
output a variety of formats (PBF for Valhalla routing tiles, as well as whatever formats are selected for #36)
easily scale available memory, disk, and other resources so that close supervision and tuning is not required -- either scaling "vertically" with larger instances/machines or "horizontally" by distributing data across instances/machines
run multiple jobs at once in parallel
monitoring of job performance and resource usage
handle job failures and retries
logging for development and debug purposes
... others?

Given those overall requirements, a few options to evaluate more closely:

AWS Glue fully managed; appears to use Spark; only available in preview
AWS Data Pipeline fully managed; appears that it will be replaced by AWS Glue; can run either EMR jobs or arbitrary shell commands
Apache Spark on AWS EMR could load histogram tile files from S3 using "EMR File System"
AWS Batch can run Docker container-based processes on on-demand and spot EC2 instances
... others?

[*] The histogram tile files are available in both ORC and FlatBuffer formats. Based on the performance found in https://github.com/opentraffic/histogram-format-tests, it appears that ORC will perform better for producing data products that involve reading in an entire tile file (rather than subsets of many tile files). So, we're planning to use FlatBuffer files to power the ad-hoc query API and ORC files to power this pipeline for producing static data products.

How to prevent multiple uploads of archived traces

A potential problem with a central Datastore shared by numerous providers could be that archived data could be uploaded numerous times - thus leading to over-weighting of statistics from that set of archived trace data.

There does not seem to be a way to identify any single report as being uploaded previously (especially since we want to maintain privacy).

One possible protection could be to have an extra control at the data provider key (or Id) level that would reject data over X days old. For some initial period, a data provider would be permitted to upload archived statistics. After that initial period only "recent" traces would be permitted. This might require some level of coordination between the Datastore maintainer and the data provider.

Access to OSMLR segment geometries

Moved to opentraffic/api#2

Investigate binned histogram-style data model

query API for network statistics

moved to opentraffic/api#1

scaffold ingest API

using https://github.com/opentraffic/reporter/blob/master/py/reporter_service.py

Collect match failure statistics from Reporter instances

Each Reporter instance now logs statistics on match failures (see opentraffic/reporter#34). We should aggregate this at the Datastore for debugging and evaluation purposes.

An option to consider:

send from Reporter to an S3 bucket (either directly or through Kinesis Firehose)
store in structured files in S3 (likely CSV or JSON)
whenever it's necessary to run queries on the stats, load the files from S3 into AWS Athena

We should also consider any privacy-related concerns around the statistics being aggregated.

Produce data extracts for public use

The Open Traffic platform will regularly produce data extracts for public use. These will contain historical traffic speeds and be available for download by the public from S3.

individual privacy concerns: Speeds will only be reported for OSMLR segment for which the number of original observations in the Reporter passed privacy thresholds -- that is, it will not be possible to recreate the journeys of individuals using one of these public data products.

data-provider competitive concerns: Absolute counts of vehicles/observations (which are stored in Datastore's internal histogram tile files) will not be included in publicly accessible data extracts.

Overall tasks:

Spec out the formats
Implement jobs using the static data product pipeline (see #37) to turn internal histogram tile files into the public data extract formats

Contents:

geometries: embed geometries from OSMLR GeoJSON tiles?
vehicle/observation counts: Absolute counts of the number of observations used to produce an average speed cannot be made public. Include an index to indicate the relative number of counts used to produce each average? If so, what should be the range and resolution of this index?
TODO: @kpwebb to fill in more details

Format Requirements:

TODO: @kpwebb to fill in more details about which consumer apps and languages to support

Structure log output and document error conditions

Logging output should be structured and error conditions documented. In production, logs will be streamed to Cloudwatch Logs, where filters will be built to pick up the structured data and turn it into Cloudwatch metrics which can then be used to trigger alarm conditions.

http://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html

Produce Reference Speed Tiles

Need a process to periodically query histogram tiles to create "reference" speed information.

Reference speeds will present the speed at various percentile within a sorted list of speed observations. Having several percentile values will allow users to see how variable speeds are across all hours of the week. An analysis application can choose one of the percentile readings to use as a reference speed when comparing different hourly speed values.

draft table schemas

Produce historical Valhalla Routing Tiles

Need a batch process to convert histogram tiles to historical Valhalla routing tiles.

Define tile format:
Per segment speed (indexed by segment Id within a tile) - perhaps 168 bytes (one byte for speed in kph for each hour of the week).
Segment - to -Segment Transition Times - this cannot be indexed so easily as each traffic segment may have a variable number of transitions. Each segment Id (single Id within the tile) may have several fixed size records including "to_segment_id" (full Valhalla GraphId since it may not reside in the tile) and 168 bytes for the transition time at each hour of the week.
Create Batch Tile Generation Script to read histogram tiles over some range of times.

Strategies for creating historical speed tiles may vary and some configuration control should be supported, for example:

Aggregated over all weeks of the year over the last 2 years
Aggregated over a recent time period - last 8 weeks
Weighted averages over similar time periods from prior years (e.g. to show seasonal averages) along with recent week's data.

Block in transaction: possible bug

So in the "production" (and dev) hosted env, I've managed to get things into a state where datastore requests are always returning:

{"response": "{\"response\": \"current transaction is aborted, commands ignored until end of transaction block\\n\"}"}

A bit of googling suggests:

Summary:

The reason you get this error is because you have entered a transaction and one of your SQL
Queries failed, and you gobbled up that failure and ignored it. But that wasn't enough, THEN you
used that same connection, using the SAME TRANSACTION to run another query. The exception
gets thrown on the second, correctly formed query because you are using a broken transaction to
do additional work. Postgresql by default stops you from doing this.

Include a "prevalence" property in public data extracts

Datastore's internal histogram tile files store counts for the number of vehicles/observations within each bin. Public data extracts turn these accurate counts into a coarser "prevalence" property that can be shared publicly.

Goal is to share a measurement that can be used to tell the rough confidence of speed estimates on segments and to convey the approximate relative magnitude of vehicles on different segments -- but not to share counts that are so accurate that they can be used by competitors to understand a data-provider's business.

As a temporary place holder, we round to counts to the nearest 10's (see

datastore/scripts/make_speeds.py

Lines 189 to 190 in 308c8b4

 def prevalence(val): 

 return int(round(val / 10.0) * 10)

Let's consider better alternatives.

Run the datastore and the whole framework

Hi,
I was checking your repos lately and I'm very interested in the idea, and I was trying to make some tests locally in my own PC but I didn't find in the docs the right way to make all things work together (reporter, datastore, ui-analyst...etc). Is there a way to run the whole project locally?

explore alternatives for lessening the bandwidth/memory/disk size requirements on clients

right now we are writing extracts in json format, they are human readable and everyone knows json. plus it plays well with the browser... that is until a single file is getting in the 1gb range...

so we need to look at alternatives. currently we are talking about two ways to tackle this.

a format change from clear text to binary. binary allows us to pack data structures more efficiently but does increase the burden of the client when it comes to understanding/working with the format
lessening the geographic extents of a given tile. like by a quarter... level 0 looks like level 1, level 1 looks level 2, level 2 looks like 1/4 the size...

modify flatbuffer orc histogram writer

we need to make modifications to #34 to support the work detailed in #38. the thinking is since we are going to wrap this in a little bit of python that calls it we need to make a few changes:

add queue length to the measurements in the histogram
only read/write to/from disk as the python will orchestrate with sqs and s3
be able to merge new histograms with the already written ones

this should simplify the code by removing s3 and different output options but complicate it in that it will have to parse back in flatbuffers to do the merging.

ive started work on all of these, the queue length one is just about done and the only use disk one is about halfway done

the final point there about merging can be done by either keeping the input files around forever but that will grow quickly. so the real way to do it will be to read from previously written flatbuffers to merge the data together before writing the updated files. this is the only open part of this work remaining

datastore reporting prepare error

datastore is reporting a prepare error, but everything runs ok

postgres_1 | ERROR: prepared statement "report" already exists
postgres_1 | STATEMENT: PREPARE report AS INSERT INTO segments (segment_id,prev_segment_id,mode,start_time,end_time,length,provider) VALUES ($1,$2,$3,$4,$5,$6,$7);
datastore_1 | Connected to db
datastore_1 | Created prepare statement.
datastore_1 | Exception in thread Thread-8:
datastore_1 | Traceback (most recent call last):
datastore_1 | File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
datastore_1 | self.run()
datastore_1 | File "/usr/lib/python2.7/threading.py", line 754, in run
datastore_1 | self.__target(*self.__args, **self.__kwargs)
datastore_1 | File "/datastore/datastore_service.py", line 73, in process_request_thread
datastore_1 | self.make_thread_locals()
datastore_1 | File "/datastore/datastore_service.py", line 70, in make_thread_locals
datastore_1 | raise Exception("Can't check for prepare statement: %s" % repr(e))
datastore_1 | Exception: Can't check for prepare statement: Exception('Can't create prepare statement: ProgrammingError('prepared statement "report" already exists\n',)',)
datastore_1 |

Merging of historical data into existing histogram tile files

produce a GeoJSON file that shows where in the world data coverage is available

Goal: The analyst UI could use a way to show where in the world traffic stats are available for querying. (The POC UI did this by having a hard-coded drop-down select with the names of metro regions. That is no longer an option, since OTv2 is built as a single platform for global coverage.)

Either when Datastore creates histogram tile files (#30) or when it produces data extracts for public use (#36), let's consider generating a GeoJSON file that that includes coarse polygons around all the tile extents. This file could be placed on S3, alongside the tile files.

Perhaps this could be similar to how @kevinkreiser has Valhalla produce GeoJSON output to show multimodal transit tile coverage.

Aggregate data uploaded from reporters and cut histogram tiles

using the agreed upon file naming and format found in opentraffic/reporter#70
using the histogram tile format defined for #28 (based on research in https://github.com/opentraffic/histogram-format-tests)

Docker rig for local development

Python web service (for ingest API and for query API)
Postgres for storage

Datastore as a service and thoughts on deployment

With the work in #34 we have a process that takes input files form the reporter and produces a tile in the given format. This is basically a run once command. This isnt exactly perfect for use in docker but we have a plan around that.

The current plan is we'll setup a lambda that monitors sqs. The lambda will run as a cron and wake up every 5 minutes. If there is nothing in sqs, it will enumerate everything in s3 (that the reporters reported), and turn those into jobs. Each job will be concerned with 1 or more reporter output files but those files will only touch a single output file (single geographic region and time). It will move those files into a working location under s3, push those jobs into the queue and exit.

Mean while we'll have the datastore farm of ecs containers running a hybrid python java service. The python will be listening to sqs for jobs. It'll pull the jobs off the queue continuously. For each job it will unpack the job message, pull down the relevant tiles from s3, merge them into one by invoking the java program, and then upload the result back to s3 and delete the job from the queue.

Meanwhile the lambda cron from above wakes up and sees that there is still jobs in the queue. At this point it just exists. If the queue was finished burning down it will know that all the stuff in the working s3 location is done, so it will delete that. It can then move the latest files to the working location and enqueue them as new jobs.

We'll also be able to set up auto scale for the datastore workers so that if the queue gets a large influx of data we can spin up more workers to handle the jobs.

ingest API connection to database

Datastore documentation

Items to document in README.md and other Markdown files within this repo:

Datastore scripts: expected inputs, controllable parameters, expected outputs
public data extract file formats and contents, for consumers
- weekly speed tiles
- reference speed tiles
- GeoJSON data coverage maps
...

Items to document in the private https://github.com/opentraffic/wiki/wiki:

basic overview of Datastore batch jobs and their configuration
how to set-up a new data-provider; authentication required for their Reporter instances to upload to Datastore S3
...

Request for canned or other suitable endpoint to service ELB healthchecks

Grant Heffernan [11:57 AM] 
in semi related new, we’re going to want some sort of either canned health check endpoint for these, or I just need something else that will work when things are healthy and break when they aren’t. Want me to just put it in an issue?

Kevin Kreiser [11:58 AM] 
issue is good. i think it has to be canned and we can do some special code to handle it

[11:58]  
to make sure the system is running etc

[11:58]  
without actually making data

Grant Heffernan [11:58 AM] 
k

Kevin Kreiser [11:58 AM] 
i mean actually

[11:58]  
a trace of a single point could be enough

[11:59]  
it will never get a non partial match

[11:59]  
and it will check redis and all o fthat stuff

How to access .spd or .nex data ?

This is probably not an issue but a question.
My objective is to get average speed data for a particular tile is a json format. I can access the tiles data using https://s3.amazonaws.com/osmlr/ and then if I have to get data for a specific tile I can use https://s3.amazonaws.com/osmlr/v1.1/geojson/0/000/747.json
In the similar way I was trying to get the average speed data for different roads and highways in the form of a tile. I found this link https://s3.amazonaws.com/speedtiles-prod/ which was working fine but in the keys for example there is <key>2017/31/0/001/071.spd.0.gz</key> so when I tried this https://s3.amazonaws.com/speedtiles-prod/2017/31/0/001/071.spd.0.gz it return the following response

<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>11FD8511BA7B8A52</RequestId>
<HostId>
hxj/2l5qRwpgyE04UXesOe8f69//FhrxWZnkVboLhK0SBrnPloHKJipJhMv5XDijk3x3Maw1hpk=
</HostId>
</Error>

How can I get average speed data for a specific tile ?

OR IN OTHER WORDS
The URL described in the datastore docs https://<Prefix URL>/1/037/740.spd.0.gz. What is <Prefix URL> here ?