opentraffic / datastore Goto Github PK
View Code? Open in Web Editor NEWOTv2: centralized ingest and aggregation of anonymous traffic data
License: GNU Lesser General Public License v3.0
OTv2: centralized ingest and aggregation of anonymous traffic data
License: GNU Lesser General Public License v3.0
From a test I did earlier today, a server that gzips tiles on-the-fly gives us a >90% savings on the transferred file size:
If you look on the right side, the top number is the actual size transferred over the wire; the bottom number is the uncompressed size. Note that these were the original JSON files with unmangled properties. I did a similar test with the mangled properties, and the resulting file sizes remained pretty close, because of the nature of the gzip algorithm.
This is significantly better for download performance (but says nothing about memory or processing performance yet). What this means is that a user had to wait about ~3 minutes before to download 600MB, and about ~20-30 seconds to do 60MB. This is excellent. My goal was to get our download sizes to about 60MB per request. Even if we were to spend a bunch of time rejiggering how we roll up data, or determining what properties to include, we would probably, at best, get us 50% of the way there. By gzipping the tiles, we get 90% savings immediately with only a tweak in the infrastructure.
Please note that this does not mean serve files that were gzipped manually. This is because the browser will automatically uncompress files that were transmitted over the wire in gzip encoding. If the files were transmitted as gzipped files, the browser does not automatically uncompress it, and then you would require something in JavaScript to parse the file and gzip client side, which is not optimal. Therefore, we must have the server serve files with gzip compression turned on, which is not the same as having the export process create gzipped files.
In summary, here's my recommendations for now:
The static data product pipeline will read the Datastore's internal histogram tile files and produce:
Requirements:
Given those overall requirements, a few options to evaluate more closely:
[*] The histogram tile files are available in both ORC and FlatBuffer formats. Based on the performance found in https://github.com/opentraffic/histogram-format-tests, it appears that ORC will perform better for producing data products that involve reading in an entire tile file (rather than subsets of many tile files). So, we're planning to use FlatBuffer files to power the ad-hoc query API and ORC files to power this pipeline for producing static data products.
A potential problem with a central Datastore shared by numerous providers could be that archived data could be uploaded numerous times - thus leading to over-weighting of statistics from that set of archived trace data.
There does not seem to be a way to identify any single report as being uploaded previously (especially since we want to maintain privacy).
One possible protection could be to have an extra control at the data provider key (or Id) level that would reject data over X days old. For some initial period, a data provider would be permitted to upload archived statistics. After that initial period only "recent" traces would be permitted. This might require some level of coordination between the Datastore maintainer and the data provider.
Moved to opentraffic/api#2
moved to opentraffic/api#1
Each Reporter instance now logs statistics on match failures (see opentraffic/reporter#34). We should aggregate this at the Datastore for debugging and evaluation purposes.
An option to consider:
We should also consider any privacy-related concerns around the statistics being aggregated.
The Open Traffic platform will regularly produce data extracts for public use. These will contain historical traffic speeds and be available for download by the public from S3.
individual privacy concerns: Speeds will only be reported for OSMLR segment for which the number of original observations in the Reporter passed privacy thresholds -- that is, it will not be possible to recreate the journeys of individuals using one of these public data products.
data-provider competitive concerns: Absolute counts of vehicles/observations (which are stored in Datastore's internal histogram tile files) will not be included in publicly accessible data extracts.
Overall tasks:
Contents:
Format Requirements:
Logging output should be structured and error conditions documented. In production, logs will be streamed to Cloudwatch Logs, where filters will be built to pick up the structured data and turn it into Cloudwatch metrics which can then be used to trigger alarm conditions.
http://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html
Need a process to periodically query histogram tiles to create "reference" speed information.
Reference speeds will present the speed at various percentile within a sorted list of speed observations. Having several percentile values will allow users to see how variable speeds are across all hours of the week. An analysis application can choose one of the percentile readings to use as a reference speed when comparing different hourly speed values.
Need a batch process to convert histogram tiles to historical Valhalla routing tiles.
Define tile format:
Per segment speed (indexed by segment Id within a tile) - perhaps 168 bytes (one byte for speed in kph for each hour of the week).
Segment - to -Segment Transition Times - this cannot be indexed so easily as each traffic segment may have a variable number of transitions. Each segment Id (single Id within the tile) may have several fixed size records including "to_segment_id" (full Valhalla GraphId since it may not reside in the tile) and 168 bytes for the transition time at each hour of the week.
Create Batch Tile Generation Script to read histogram tiles over some range of times.
Strategies for creating historical speed tiles may vary and some configuration control should be supported, for example:
So in the "production" (and dev) hosted env, I've managed to get things into a state where datastore requests are always returning:
{"response": "{\"response\": \"current transaction is aborted, commands ignored until end of transaction block\\n\"}"}
A bit of googling suggests:
Summary:
The reason you get this error is because you have entered a transaction and one of your SQL
Queries failed, and you gobbled up that failure and ignored it. But that wasn't enough, THEN you
used that same connection, using the SAME TRANSACTION to run another query. The exception
gets thrown on the second, correctly formed query because you are using a broken transaction to
do additional work. Postgresql by default stops you from doing this.
Datastore's internal histogram tile files store counts for the number of vehicles/observations within each bin. Public data extracts turn these accurate counts into a coarser "prevalence" property that can be shared publicly.
Goal is to share a measurement that can be used to tell the rough confidence of speed estimates on segments and to convey the approximate relative magnitude of vehicles on different segments -- but not to share counts that are so accurate that they can be used by competitors to understand a data-provider's business.
As a temporary place holder, we round to counts to the nearest 10's (see
datastore/scripts/make_speeds.py
Lines 189 to 190 in 308c8b4
Let's consider better alternatives.
Hi,
I was checking your repos lately and I'm very interested in the idea, and I was trying to make some tests locally in my own PC but I didn't find in the docs the right way to make all things work together (reporter, datastore, ui-analyst...etc). Is there a way to run the whole project locally?
right now we are writing extracts in json format, they are human readable and everyone knows json. plus it plays well with the browser... that is until a single file is getting in the 1gb range...
so we need to look at alternatives. currently we are talking about two ways to tackle this.
a format change from clear text to binary. binary allows us to pack data structures more efficiently but does increase the burden of the client when it comes to understanding/working with the format
lessening the geographic extents of a given tile. like by a quarter... level 0 looks like level 1, level 1 looks level 2, level 2 looks like 1/4 the size...
we need to make modifications to #34 to support the work detailed in #38. the thinking is since we are going to wrap this in a little bit of python that calls it we need to make a few changes:
this should simplify the code by removing s3 and different output options but complicate it in that it will have to parse back in flatbuffers to do the merging.
ive started work on all of these, the queue length one is just about done and the only use disk one is about halfway done
the final point there about merging can be done by either keeping the input files around forever but that will grow quickly. so the real way to do it will be to read from previously written flatbuffers to merge the data together before writing the updated files. this is the only open part of this work remaining
datastore is reporting a prepare error, but everything runs ok
postgres_1 | ERROR: prepared statement "report" already exists
postgres_1 | STATEMENT: PREPARE report AS INSERT INTO segments (segment_id,prev_segment_id,mode,start_time,end_time,length,provider) VALUES ($1,$2,$3,$4,$5,$6,$7);
datastore_1 | Connected to db
datastore_1 | Created prepare statement.
datastore_1 | Exception in thread Thread-8:
datastore_1 | Traceback (most recent call last):
datastore_1 | File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
datastore_1 | self.run()
datastore_1 | File "/usr/lib/python2.7/threading.py", line 754, in run
datastore_1 | self.__target(*self.__args, **self.__kwargs)
datastore_1 | File "/datastore/datastore_service.py", line 73, in process_request_thread
datastore_1 | self.make_thread_locals()
datastore_1 | File "/datastore/datastore_service.py", line 70, in make_thread_locals
datastore_1 | raise Exception("Can't check for prepare statement: %s" % repr(e))
datastore_1 | Exception: Can't check for prepare statement: Exception('Can't create prepare statement: ProgrammingError('prepared statement "report" already exists\n',)',)
datastore_1 |
Goal: The analyst UI could use a way to show where in the world traffic stats are available for querying. (The POC UI did this by having a hard-coded drop-down select with the names of metro regions. That is no longer an option, since OTv2 is built as a single platform for global coverage.)
Either when Datastore creates histogram tile files (#30) or when it produces data extracts for public use (#36), let's consider generating a GeoJSON file that that includes coarse polygons around all the tile extents. This file could be placed on S3, alongside the tile files.
Perhaps this could be similar to how @kevinkreiser has Valhalla produce GeoJSON output to show multimodal transit tile coverage.
With the work in #34 we have a process that takes input files form the reporter and produces a tile in the given format. This is basically a run once command. This isnt exactly perfect for use in docker but we have a plan around that.
The current plan is we'll setup a lambda that monitors sqs. The lambda will run as a cron and wake up every 5 minutes. If there is nothing in sqs, it will enumerate everything in s3 (that the reporters reported), and turn those into jobs. Each job will be concerned with 1 or more reporter output files but those files will only touch a single output file (single geographic region and time). It will move those files into a working location under s3, push those jobs into the queue and exit.
Mean while we'll have the datastore farm of ecs containers running a hybrid python java service. The python will be listening to sqs for jobs. It'll pull the jobs off the queue continuously. For each job it will unpack the job message, pull down the relevant tiles from s3, merge them into one by invoking the java program, and then upload the result back to s3 and delete the job from the queue.
Meanwhile the lambda cron from above wakes up and sees that there is still jobs in the queue. At this point it just exists. If the queue was finished burning down it will know that all the stuff in the working s3 location is done, so it will delete that. It can then move the latest files to the working location and enqueue them as new jobs.
We'll also be able to set up auto scale for the datastore workers so that if the queue gets a large influx of data we can spin up more workers to handle the jobs.
Items to document in README.md and other Markdown files within this repo:
Items to document in the private https://github.com/opentraffic/wiki/wiki:
Grant Heffernan [11:57 AM]
in semi related new, we’re going to want some sort of either canned health check endpoint for these, or I just need something else that will work when things are healthy and break when they aren’t. Want me to just put it in an issue?
Kevin Kreiser [11:58 AM]
issue is good. i think it has to be canned and we can do some special code to handle it
[11:58]
to make sure the system is running etc
[11:58]
without actually making data
Grant Heffernan [11:58 AM]
k
Kevin Kreiser [11:58 AM]
i mean actually
[11:58]
a trace of a single point could be enough
[11:59]
it will never get a non partial match
[11:59]
and it will check redis and all o fthat stuff
This is probably not an issue but a question.
My objective is to get average speed data for a particular tile is a json format. I can access the tiles data using https://s3.amazonaws.com/osmlr/
and then if I have to get data for a specific tile I can use https://s3.amazonaws.com/osmlr/v1.1/geojson/0/000/747.json
In the similar way I was trying to get the average speed data for different roads and highways in the form of a tile. I found this link https://s3.amazonaws.com/speedtiles-prod/
which was working fine but in the keys for example there is <key>2017/31/0/001/071.spd.0.gz</key>
so when I tried this https://s3.amazonaws.com/speedtiles-prod/2017/31/0/001/071.spd.0.gz
it return the following response
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>11FD8511BA7B8A52</RequestId>
<HostId>
hxj/2l5qRwpgyE04UXesOe8f69//FhrxWZnkVboLhK0SBrnPloHKJipJhMv5XDijk3x3Maw1hpk=
</HostId>
</Error>
How can I get average speed data for a specific tile ?
OR IN OTHER WORDS
The URL described in the datastore docs https://<Prefix URL>/1/037/740.spd.0.gz
. What is <Prefix URL>
here ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.