Giter VIP home page Giter VIP logo

ichnaea's Introduction

image

Ichnaea

Ichnaea is an application that provides geolocation coordinates from other sources of data (Bluetooth, cell or WiFi networks, GeoIP, etc.).

For more information look at the full docs.

Please use our github tracker to report issues.

License

ichnaea is offered under the Apache License 2.0.

ichnaea's People

Contributors

alexcottner avatar almet avatar ashernor avatar boostrack avatar bsieber-mozilla avatar ckolos avatar crankycoder avatar dantescode avatar dependabot-preview[bot] avatar dependabot[bot] avatar djmitche avatar graydon avatar hannosch avatar jaredlockhart avatar jwhitlock avatar lonnen avatar morrisjobke avatar pdehaan avatar pyup-bot avatar rajpratik71 avatar rajreet avatar requires avatar rfk avatar rtilder avatar szjozsef avatar tarekziade avatar therewillbecode avatar tyagi-data-wizard avatar uvinduperera avatar willkg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ichnaea's Issues

Mandate sufficiently different bssid's in "you need to know two AP's" restriction

Many modern wifi AP's have multiple network interfaces, for example one for the 2.4GHz and one for the 5GHz band or a "fake" address for a guest network. These typically have very similar mac addresses, most often one being just "plus one" the other, a real example being:

00:1a:1e:12:70:00
00:1a:1e:12:70:01
00:1a:1e:12:70:02

Since we want to disallow "single device" tracking, knowing two very similar addresses shouldn't be enough to get a location lookup.

Create load tests

We had a couple of bugs which only presented themselves under concurrent load (mainly bugs due to concurrent access to the same SQL rows).

We should make sure to have tests for those. And also general load tests for testing new configuration / deployment options or identifying bottlenecks.

Production logging / exception handling

We need to capture and ship logging output from the app; probably via heka-py.

In addition we should capture any Python tracebacks and ship them via "raven" or some other new heka equivalent.

Data validation and sanitizing tips

Got this useful feedback from an email a while back:

About the cell id, GSM uses 5 digits (0-65535) as the cell id, Android may sometimes default an invalid value 65534 as an invalid cell id value. (I have seen this before, but I haven't concluded that this is the case.

UMTS cell ID that Android captures is at the mercy of the radio chip that comes with the phone.
You usually want a 9 digit value, which in UMTS represents a UC-ID (universal cell id).

A universal cell id is a 32 bit value made up of a RNC and a cell id concatenated together. (similar to the one in GSM). I think the RNC is like the first two bytes, and the cell ID is the last two.

Some phone radios misinterpret this and returns a 5 digit cell ID for UMTS. I have seen this before, because manufacturers didn't realize the spec had changed.
Google knew about this, that's why they said if you send them a 5 digit cell id, it may not be accurate.
https://developers.google.com/maps/documentation/business/geolocation/#cell_tower_object


About the open cell id data, http://dump.opencellid.org/cells.txt.gz, it's a good set to start. However, there are some issues with sets of the data, mainly, issues that were found by others.
http://code.google.com/p/opencellid/issues/detail?id=14

Most notably, it has to do with issues such as encoding mobile network codes in hex.
For U.S, AT&T networks are encoded 310,1040, instead of 310,410. The 1040 is a hexadecimal representation of 410.
In Canada, a lot of the details are documented here.
http://code.google.com/p/opencellid/issues/detail?id=14

better storage for coordinates

For the db we can store the coordinates in long by doing value * 10000 instead of storing REAL so we don't introduce a floating point error

in the view we're using decimals quantized to 1.00000 so we're sending back the right value to the json clients

Implement basic nickname support

After we got a "token" to identify users, we want a publicly visible nickname.

Suggested approach: Add two optional headers on the submit API X-Token and X-Nickname (case-independent).

Then use the nickname on the leaderboard at /static/stats.html

Add a "locate me" option to the website

We could add a way to the website to locate yourself on a map. The input would be the same as the underlying geolocate/search APIs and result in a marker/radius shown on a map.

A first step would be to have a text field with a JSON dump as input. Later that could be extended to have some actual widgets.

Don't associate measurements with users

Currently incoming measurements are directly associated with tokens. We shouldn't do that and only store aggregated metrics about tokens (aka. users) to minimize the personal tracking potential.

In order to do this, we need to define what metrics or scores we want per user. Maybe:

  1. Number of measurements
  2. Number of AP's
  3. Number of Wifis
  4. Number of new/undiscovered AP's
  5. Number of new/undiscovered Wifis

At a later point we might want to extend this, with scores like "covered new area" or others.

Disable /map

It is really scary, you can find my house and university on it, because nobody else in my area uses the stumbler. I really like the view for myself, but with this amount of users it's rather scary.

Switch to a proper queue for celery

We are currently abusing MySQL as the queue for celery tasks. It's not meant for production and we ran into a couple of problems. It seems celery can only tolerate a certain number of open tasks in the queue, or else the workers and app processes get stuck on a long running SQL query.

We are investigating using either Redis / Amazon ElastiCache, RabbitMQ or Amazon SQS. Dean is looking into the ops side, @hannosch is looking into the required config changes.

log exceptions in redis workers

for some reason, exceptions occuring in the retools worker is ignored - and the job is marked as done

we need to make sure that raises an error and keep the job

Move location info out of the POST to /v1/location

The API currently looks like:

POST /v1/location/{lat}/{lon}

From a privacy point of view it is better to move these parameters to the POST data so that they will not accidentally end up in log files that we or a third party owns.

It will also make it eaiser to add extra parameters like altitude for example, without having to change the API endpoint signature.

Expose stats via heka-py

We should expose some "production stats" via the heka-py / statsd output.

For example:

  • Number of HTTP requests
  • Request timings
  • HTTP requests by status code

Add "tiling" to the map

Loading the map for the entire world has gotten too slow, thx to too much data.

Some approach to "tiling" is needed. Maybe at first it's enough to distinguish "Amerika" vs. "the rest" and load either one or the other, via some links or layer controls in the map.

Clean up kombu_message table

celery doesn't do automatic cleanup of the kombu_message table for the sqlalchemy backend. At this point the table contains about 50k entries and is steadily growing. Once a job is processed and finished, there's no need to keep track of the job message. celery already cleans up the celery_taskmeta table with a daily scheduled task, so we should do something similar here.

Create and define a DB schema upgrade process

Currently we are lacking any DB schema upgrade / evolution process. We need to define a way to do automatic code / DB schema changes. And some way to schedule or run one-off tasks / scripts on the data itself.

Guess missing lac/cid for neighboring cells

On Android we often get "incomplete" cell records for neighboring cells. They usually only have the mcc/mnc and psc fields but lack the lac/cid. While psc isn't unique at all (it's only 512 different values worldwide), it is unique in a certain area. Or rather neighboring cells cannot have the same psc values. So based on the lat/lon and mcc/mnc/psc we should be able to identify the lac/cid -> if we got at least one full record for that cell.

New API to get per user scores / rank?

The client side application might want to show scores inside the app. So we might want an API to expose scores on a per-user basis.

Maybe things like:

  1. Total score / rank
  2. Filter by date (day, week, month, quarter, year) -> for example "you are second best today, you need only X points to get to the top spot"

... a lot more unknowns, keep it simple :)

Document canonical format of BSSID and SSID used for hashing

Clients will need to know how to canonicalize the BSSIDs in their requests. The comments in views.py say:

    The `key` is a SHA1 hash of the concatenated BSSID and SSID of the wifi
    network. So for example for a bssid of `01:23:45:67:89:ab` and a
    ssid of `network name`, the result should be:
    `3680873e9b83738eb72946d19e971e023e51fd01`.

The documentation should clarify whether the canonical format of the BSSID uses upper- or lowercase hexadecimal digits and whether the : colon delimiters are included. I've seen Wi-Fi scanning apps that display BSSIDs using uppercase, lowercase, colon delimiters, - dash delimiters, or no delimiters.

Also, the comment's example code encodes the SSID as UTF-8: ssid = 'network name'.encode('utf-8'). Technically, SSIDs have no character encoding and are defined as simply 0-32 octets, which could contain random data or NUL characters. We should at least document whatever character encoding is already used by Firefox.

Comply with french privacy laws

To comply with french rules, we would have to:

  • Declare the database to the (CNIL)[https://en.wikipedia.org/wiki/CNIL]
  • Provide a page, in french, explaining about this data retrieval, and how to ask for removal of such data
  • CNIL asks that the collected data does not remain stored for more than 5 years.

http://www.cnil.fr/linstitution/actualite/article/article/geolocalisation-et-collecte-dinformations-issues-des-points-dacces-wi-fi-les-regles-a-respec/

Also, is github the right place to have these discussions or do we have a bugzilla space for this project as well?

Simplify token/nickname handling

Currently the token is the primary user identifier and a nickname is associated with each token. We want to simplify this and use the nickname as the primary identifier.

This probably means getting rid of the user table and storing nickname directly in the score table.

Fix deployment docs

The deploy docs don't yet mention the need for a MySQL database.

They also refer to "download the database and server" - but there is no initial database for download.

Create simulation framework to measure accuracy of geolocation algorithms

Because we have a log of all the submitted reports (with their APs and GPS locations), we can compare the accuracy of different server algorithms by simulating geolocation requests for each report and comparing the server result with the reports' GPS location. Then a statistically-minded individual can then compare the error measurements to tune our algorithms.

Document nickname header

The stumbler app sends the nickname via a header.

The server / API docs don't mention any of this - they should.

Filter submissions by GeoIP

Use GeoIP on the server to ignore measurements submitted from a different region. This would break the 0.000001% case where a user scans access points in one country then flies to another country to upload the data.

Create a blackbox testing plan

I wrote some very crude instructions on how to do blackbox / integration testing. We should formalize that a bit and put it into a more permanent place:

    Repeat a couple times (at least twice, but 10 times or more also works):

    $ curl -i -k -XPOST -H "Content-Type: application/json" -H "X-Nickname: foo" http://localhost/v1/submit -d '{"items": [{"lat": 1.23, "lon": 3.45, "altitude": 10, "accuracy": 10, "radio": "gsm", "cell": [{"mcc": 1, "mnc": 1, "lac": 2, "cid": 3, "signal": -70}]}]}'
    HTTP/1.1 204 No Content
    Server: gunicorn/0.14.6
    Date: Mon, 25 Nov 2013 18:51:04 GMT
    Connection: close


    Right away there's no result found yet:

    $ curl -i -k -XPOST -H "Content-Type: application/json" http://localhost/v1/search -d '{"radio": "gsm", "cell": [{"mcc": 1, "mnc": 1, "lac": 2, "cid": 3}]}'
    HTTP/1.1 200 OK
    Server: gunicorn/0.14.6
    Date: Mon, 25 Nov 2013 18:53:06 GMT
    Connection: close
    Content-Type: application/json; charset=UTF-8
    Content-Length: 23

    {"status": "not_found"}


    After the async task has run (~6 every minutes), there should be a result:

    $ curl -i -k -XPOST -H "Content-Type: application/json" http://localhost/v1/search -d '{"radio": "gsm", "cell": [{"mcc": 1, "mnc": 1, "lac": 2, "cid": 3}]}'
    HTTP/1.1 200 OK
    Server: gunicorn/0.14.6
    Date: Mon, 25 Nov 2013 18:54:08 GMT
    Connection: close
    Content-Type: application/json; charset=UTF-8
    Content-Length: 71

    {"status": "ok", "lat": 1.2300000, "lon": 3.4500000, "accuracy": 35000} 

Moar leaderboards

Some ideas:

  • Add daily/weekly/monthly leaderboards
  • Add different boards for "locations", "wifi", "cell", "new unique cell/wiif"

Basically things to allow different people to reach the top 1 spot.

Add FAQ / help text about batch uploading existing data

Some people have existing wardriving data - we should have a FAQ to describe how to contribute that data to us and under what terms we can accept it (data was gathered based on GPS, and not from a service which limits the data to personal use).

Provide real user registration / authentication

At first we cheat and use no real authentication, but rely on a random uuid as the secret token / key.

But really we should expand this to get real auth. For example web-based persona login, server-side "secret key" generation (and re-generation) and a proper http-auth scheme like macauth to sign requests.

Other ideas/options welcome :)

Guess radio type for old data

Up until just now, we didn't capture the radio field in each cell record, but only the measure-level radio type. And we defaulted to 0 - the same as the valid 'gsm' record. This lead to all records having a radio=0 record.

I've changed the code to use -1 for "unknown" / "none" (like tablets/laptops) and capture the cell.radio field. New data is coming in and writing radio=2 records (umts).

Maybe there's a way to fix/add the radio field for old records based on matching mcc/mnc/lac/cid records with corrected radio fields.

Add a way to upload stumbler logs

Many people might have existing stumbler logs from earlier wardriving efforts, which they'd be willing to share with us. It would be nice to have a way to accept these.

One option might be an email address like [email protected] + a gpg key and manual processing. Or an upload form as part of the website.

Offer data dumps

Hi, as most of similar location projects offer data under a open license, also Mozilla should release all the data (raw and processed) to allow everybody to use/analyse/visualize them.

Add one or two more zoom levels to map

When you switched the map from the heat map back to the numbered map markers, I think you restricted the number of zoom levels. It would be helpful if users could see more details about which neighborhoods have few measurements. Then users would know where they need to stumble more. Currently, all of San Francisco is represented by a single "136" marker!

Also, markers with just one measurement are shown as blue pin markers instead of numbered circles. The blue pins are a little confusing because they are different. Can we just show a circle with the number "1"?

Move map data generation to a scheduled async task

Currently the map.csv data is generated live every time. The mysql slow log shows the query time often going up to 10 seconds. That's too damn long :)

So we need to move the data gathering part to some async task and update it at certain intervals. Maybe daily is enough, though it would be nice to get it more often.

Date seems to be off

location-services-2013-11-03

It could be a localization issue, but at around 1383677700 (2013-11-05 13:55 EST) the attached screenshot was taken, indicating data from 2013-11-03.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.