mozilla / activedata Goto Github PK

Provide high speed filtering and aggregation over data

License: Mozilla Public License 2.0

CSS 6.09% HTML 3.32% JavaScript 78.24% Python 9.22% Shell 0.42% Batchfile 0.08% Dockerfile 0.03% Less 1.30% SCSS 1.32%

activedata's Introduction

ActiveData

Provide high speed filtering and aggregation over data see ActiveData Wiki Page for project details

Branch	Status	Coverage
master
dev
v1.7

Use it now!

ActiveData is a service! You can certainly setup your own service, but it is easier to use Mozilla's!

curl -XPOST -d "{\"from\":\"unittest\"}" http://activedata.allizom.org/query

Requirements

Python2.7 installed
Elasticsearch version 6.x

Elasticsearch Configuration

Elasticsearch has a configuration file at config/elasticsearch.yml. You must modify it to handle a high number of scripts

script.painless.regex.enabled: true
script.max_compilations_rate: 10000/1m

We enable compression for faster transfer speeds

http.compression: true

And it is a good idea to give your cluster a unique name so it does not join others on your local network

cluster.name: lahnakoski_dev

then you can run Elasticsearch:

c:\elasticsearch>bin\elasticsearch

Elasticsearch runs off port 9200. Test it is working

curl http://localhost:9200

you should expect something like

{
  "status" : 200,
  "name" : "dev",
  "cluster_name" : "lahnakoski_dev",
  "version" : {
    "number" : "1.7.5",
    "build_hash" : "00f95f4ffca6de89d68b7ccaf80d148f1f70e4d4",
    "build_timestamp" : "2016-02-02T09:55:30Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.4"
  },
  "tagline" : "You Know, for Search"
}

Installation

There is no PyPi install. Please clone master branch off of Github:

git clone https://github.com/mozilla/ActiveData.git
git checkout master

and install your requirements:

pip install -r requirements.txt

Configuration

The ActiveData service requires a configuration file that will point to the default Elasticsearch index. You can find a few sample config files in resources/config. simple_settings.json is simplest one:

    {
        "flask":{
             "host":"0.0.0.0",
             "port":5000,
             "debug":false,
             "threaded":true,
             "processes":1
         },
        "constants":{
            "mo_http.http.default_headers":{"From":"https://wiki.mozilla.org/Auto-tools/Projects/ActiveData"}
        },
        "elasticsearch":{
            "host":"http://localhost",
            "port":9200,
            "index":"unittest",
            "type":"test_result",
            "debug":true
        }
        ...<snip>...
    }

The elasticsearch property must be updated to point to a specific cluster, index and type. It is used as a default, and to find other indexes by name.

Run

Jump to your git project directory, set your PYTHONPATH and run app.py:

    cd ~/ActiveData
    export PYTHONPATH=.:vendor
    python active_data/app.py --settings=resources/config/simple_settings.json

Verify

If you have no records in your Elasticsearch cluster, then you must add some before you can query them.

Make a table in Elasticsearch, with one record:

curl -XPUT "http://localhost:9200/movies/movie/1" -d "{\"name\":\"The Parent Trap\",\"released\":\"29 July` 1998\",\"imdb\":\"http://www.imdb.com/title/tt0120783/\",\"rating\":\"PG\",\"director\":{\"name\":\"Nancy Meyers\",\"dob\":\"December 8, 1949\"}}"

Assuming you used the defaults, you can verify the service is up if you can access the Query Tool at http://localhost:5000/tools/query.html. You may use it to send queries to your instance of the service. For example:

    {"from":"movies"}

Tests

The Github repo also included the test suite, and you can run it against your service if you wish. The tests will create indexes on your cluster which are filled, queried, and destroyed

Linux

    cd ~/ActiveData
    export PYTHONPATH=.:vendor
    python -m unittest discover -v -s tests

Windows

    cd ActiveData
    SET PYTHONPATH=.:vendor
    python -m unittest discover -v -s tests

activedata's People

Contributors

Stargazers

Watchers

Forkers

davehunt rohit-rk krishnamadgula maggienj alexandrasp javatreble dhara159 brokenpeace archaeopteryx la0 gmierz kerlynnkep madinab mars-f mozilla-github-standards jinseopim doc22940 team-githubs klahnakoski

activedata's Issues

Fix tests in class TestSchemaMerging(BaseTestCase):

Nested filter is confused

In an attempt to query treeherder.job_log.failure_line, we can see that the query appears to a placing the filters inside the nested query instead at top level

{
	"sort":{"action.start_time":"desc"},
	"select":["test","created","status","expected"],
	"from":"treeherder.job_log.failure_line",
	"limit":1,
	"where":{"and":[
		{"eq":{"failure.notes.failure_classification":"fixed by commit"}},
		{"exists":"failure.notes.text"},
		{"in":{"build.branch":["mozilla-inbound","autoland"]}},
		{"gte":{"created":{"date":"today-week"}}},
		{"prefix":{"job.type.name":"test-"}}
	]}
}

{
	"from":0,
	"query":{"bool":{"filter":[
		{"bool":{"filter":[
			{"terms":{"build.branch.~s~":["mozilla-inbound","autoland"]}},
			{"prefix":{"job.type.name.~s~":"test-"}}
		]}},
		{"nested":{
			"inner_hits":{"size":100000},
			"path":"job_log.~N~.failure_line.~N~",
			"query":{"bool":{"filter":[
				{"term":{"failure.notes.failure_classification.~s~":"fixed by commit"}},
				{"exists":{"field":"failure.notes.text.~s~"}}
			]}}
		}}
	]}},
	"size":1,
	"sort":[{"action.start_time.~n~":"desc"}],
	"stored_fields":["job_log.status.~n~"]
}

Boolean `in` clause not working

     {
       "in":{
         "result.ok":[
           "F"
         ]
       }
     }

neq should be supported

So, from moz-sql-parser (not sure if this is a bug in that or here)

>>> parse("""
... select build.product from tasks where foo != "firefox"
... """)
{'select': {'value': 'build.product'}, 'from': 'tasks', 'where': {'neq': ['foo', 'firefox']}}

While activedata supports "ne".. https://github.com/mozilla/ActiveData/blob/dev/docs/jx_expressions.md#ne-operator
Specifically in this case I'd have expected it to be ... 'where': {'ne': {'foo', 'firefox'}}}

@klahnakoski thoughts?

Err while executing - python27 active_data/app.py --settings=resources/config/simple_settings.json

Commands Executed:

198  rm -rf ActiveData
 199  git clone https://github.com/klahnakoski/ActiveData.git
 200  git checkout master
 201  cd ActiveData
 202  git checkout master

 206  python27 -m pip install -r requirements.txt
 207  export PYTHONPATH=.
 208  export PYTHONPATH=.
 209  python27 active_data/app.py --settings=resources/config/simple_settings.json

Result:

$ python27 active_data/app.py --settings=resources/config/simple_settings.json
kabalidaa - 2017-04-08 18:45:34 - Main Thread - "threads.py:499" (join) - "Main Thread" waiting on thread "log thread"
Traceback (most recent call last):
File "active_data/app.py", line 192, in
setup()
File "active_data/app.py", line 125, in setup
Log.error("Serious problem with ActiveData service construction! Shutdown!", cause=e)
File "C:\ActiveData\pyLibrary\debugs\logs.py", line 375, in error
raise e
pyLibrary.debugs.exceptions.Except: ERROR: Serious problem with ActiveData service construction! Shutdown!
File "active_data/app.py", line 125, in setup
File "active_data/app.py", line 192, in
caused by
ERROR: Problem with call to http://localhost:9200/active_data_requests20170408_184534
{"mappings": {"request_log": {"properties": {"content_length": {"index": "not_analyzed", "type": "string"}, "http_user_agent": {"index": "not_analyzed", "type": "string"}, "from": {"index": "not_analyzed", "type": "string"}, "remote_addr": {"index": "not_analyzed", "type": "string"}, "timestamp": {"type": "double"}, "error": {"type": "object", "enabled": false, "store": "yes", "index": "no"}, "query": {"type": "object", "enabled": false, "store": "yes", "index": "no"}, "path": {"index": "not_analyzed", "type": "string"}, "data": {"index": "not_analyzed", "type": "string"}, "http_accept_encoding": {"index": "not_analyzed", "type": "string"}}, "dynamic_templates": [{"default_strings": {"match_mapping_type": "string", "mapping": {"index": "not_analyzed", "type": "string"}, "match": "*"}}], "_source": {"compress": true}}}, "settings": {"index": {"number_of_replicas": 0, "store": {"throttle": {"max_bytes_per_sec": "2mb", "type": "merge"}}, "number_of_shards": 3}}}
File "C:\ActiveData\pyLibrary\env\elasticsearch.py", line 760, in post
File "C:\ActiveData\pyLibrary\env\elasticsearch.py", line 635, in create_index
File "C:\ActiveData\pyLibrary\meta.py", line 140, in wrapper
File "C:\ActiveData\pyLibrary\env\elasticsearch.py", line 509, in get_or_create_index
File "C:\ActiveData\pyLibrary\meta.py", line 144, in wrapper
File "active_data/app.py", line 97, in setup
File "active_data/app.py", line 192, in
caused by
ERROR: Bad Request: No handler found for uri [/active_data_requests20170408_184534] and method [POST]
File "C:\ActiveData\pyLibrary\env\elasticsearch.py", line 738, in post
File "C:\ActiveData\pyLibrary\env\elasticsearch.py", line 635, in create_index
File "C:\ActiveData\pyLibrary\meta.py", line 140, in wrapper
File "C:\ActiveData\pyLibrary\env\elasticsearch.py", line 509, in get_or_create_index
File "C:\ActiveData\pyLibrary\meta.py", line 144, in wrapper
File "active_data/app.py", line 97, in setup
File "active_data/app.py", line 192, in

Verify push_to_es prefers nested insertion over inner insertion

Since a column can have a nested and an inner version, verify that insertion uses the nested version for inner singleton objects.

Perf query fails

{
	"sort":"date",
	"from":"perf",
	"edges":[{
		"domain":{
			"max":"tomorrow",
			"interval":"day",
			"type":"time",
			"min":"today-month"
		},
		"name":"date",
		"value":"run.timestamp"
	}],
	"limit":2000,
	"where":{"and":[
		{"gte":{"run.timestamp":{"date":"today-month"}}},
		{"eq":{"run.framework.name":"vcs"}},
		{"eq":{"run.suite":"clone"}}
	]},
	"select":[
		{
			"aggregate":"count",
			"name":"count",
			"value":"result.stats.s1"
		},
		{
			"aggregate":"median",
			"name":"median",
			"value":"result.stats.s1"
		},
		{
			"aggregate":"percentile",
			"percentile":0.9,
			"name":"90th",
			"value":"result.stats.s1"
		}
	],
	"meta":{"save":true}
}

fix backup to ops

There is a small machine that keeps a backup of saved_queries and repo, fix it.

Document quering nested documents

Cross-origin request fails in browser due to preflighting

Since Flask uses whitelisting for HTTP methods ActiveData doesn't respond correctly to CORS protections that come from a browser that attempt to preflight the request. Namely ActiveData doesn't include the OPTIONS method in its catchall route.

CORS headers are new to me so this is my source:
https://stackoverflow.com/questions/1256593/why-am-i-getting-an-options-request-instead-of-a-get-request#13030629

Curl script generated from my browser that reproduces the issue:

curl 'http://activedata.allizom.org/query' -X OPTIONS -H 'Pragma: no-cache' -H 'Access-Control-Request-Method: POST' -H 'Origin: http://localhost:8080' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36' -H 'Accept: */*' -H 'Cache-Control: no-cache' -H 'Referer: http://localhost:8080/' -H 'Connection: keep-alive' -H 'DNT: 1' -H 'Access-Control-Request-Headers: content-type' --compressed -I

Gives you a response headers of

HTTP/1.1 200 OK
Server: nginx
Date: Fri, 14 Oct 2016 07:52:06 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 0
Connection: keep-alive

Which as you can tell does not contain the "Access-Control-Allow-Origin" header which is entered on this line

I believe the patch needs to be made to the accepted HTTP methods for this route.

It may also be worth adding all the rest of the well known headers to the catch all at the same time:
http://flask.pocoo.org/docs/0.11/quickstart/#routing

LASTLY:
You get style points if you change the header to "Access-Control-Allow-Origin"

I would have wrote the patch myself but I'm not sure just adding OPTIONS is going to be sufficient and I wont be able to test the code on my side.

Expose ActiveData through https+password with elasticsearch-readonlyrest-plugin

This Elasticsearch java plugin elasticsearch-readonlyrest-plugin could replace esFrontline, and offer several advantages:

actively tested & maintained
removes the gunicorn perf bottleneck
directly integrated into ES frontend server
Allows SSL without nginx as frontend
Support users/groups/indexes limitation (we could have a codecoverage user allowing RO access to coverage & repo only)

@klahnakoski What do you think ? I can test this solution on a local cluster & report back with a documentation & integration in your deployment

stop old cluter ingestion

The Orange dashboards must first be fixed

Security Checklist

Risk Management

The service must have performed a Rapid Risk Assessment and have a Risk Record bug
The service must be registered via a New Service issue

Infrastructure

Access and application logs must be archived for a minimum of 90 days
Use Modern or Intermediate TLS
Set HSTS to 31536000 (1 year)
- strict-transport-security: max-age=31536000
- If the service is not hosted under services.mozilla.com, it must be manually added to Firefox's preloaded pins. This only applies to production services, not short-lived experiments.
If service has an admin panels, it must:
- only be available behind Mozilla VPN (which provides MFA)
- require Auth0 authentication

Development

Ensure your code repository is configured and located appropriately:
- Application built internally should be hosted in trusted GitHub organizations (mozilla, mozilla-services, mozilla-bteam, mozilla-conduit, mozilla-mobile, taskcluster). Sometimes we build and deploy applications we don't fully control. In those cases, the Dockerfile that builds the application container should be hosted in its own repository in a trusted organization.
- Secure your repository by implementing Mozilla's GitHub security standard.
Sign all release tags, and ideally commits as well
- Developers should configure git to sign all tags and upload their PGP fingerprint to https://login.mozilla.com
- The signature verification will eventually become a requirement to shipping a release to staging & prod: the tag being deployed in the pipeline must have a matching tag in git signed by a project owner. This control is designed to reduce the risk of a 3rd party GitHub integration from compromising our source code.
enable security scanning of 3rd-party libraries and dependencies
- Use npm audit for node.js (see usage in FxA) (NB: there are open issues for handling exceptions)
- For Python, enable pyup security updates:
  - Add a pyup config to your repo (example config: https://github.com/mozilla-services/antenna/blob/master/.pyup.yml)
  - Enable branch protection for master and other development branches. Make sure the approved-mozilla-pyup-configuration team CANNOT push to those branches.
  - From the "add a team" dropdown for your repo /settings page
    - Add the "Approved Mozilla PyUp Configuration" team for your github org (e.g. for mozilla and mozilla-services)
    - Grant it write permission so it can make pull requests
  - notify [email protected] to enable the integration in pyup
Keep 3rd-party libraries up to date (in addition to the security updates)
- For NodeJS applications, use renovate or GreenKeeper
- For Python, use pip list --outdated or requires.io or pyup outdated checks
- For Rust, use cargo update and cargo upgrade when changing versions
Integrate static code analysis in CI, and avoid merging code with issues
- Javascript applications should use ESLint with the Mozilla ruleset
- Python applications should use Bandit
- Go applications should use the Go Meta Linter
- Use whitelisting mechanisms in these tools to deal with false positives

Dual Sign Off

Services that push data to Firefox clients must require a dual sign off on every change, implemented in their admin panels
- This mechanism must be reviewed and approved by the Firefox Operations Security team before being enabled in production

Logging

Publish detailed logs in mozlog format (APP-MOZLOG)
- Business logic must be logged with app specific codes (see FxA)
- Access control failures must be logged at WARN level

Web Applications

Security Features

Authentication of end-users should be via FxA. Authentication of Mozillians should be via Auth0/SSO. Any exceptions must be approved by the security team.
Session Management should be via existing and well regarded frameworks. In all cases you should contact the security team for a design and implementation review
- Store session keys server side (typically in a db) so that they can be revoked immediately.
- Session keys must be changed on login to prevent session fixation attacks.
- Session cookies must have HttpOnly and Secure flags set and the SameSite attribute set to 'strict' or 'lax' (which allows external regular links to login).
- For more information about potential pitfalls see the OWASP Session Management Cheat Sheet
Form that change state should use anti CSRF tokens. Anti CSRF tokens can be dropped for internal sites using SameSite session cookies where we are sure all users will be on Firefox 60+. Forms that do not change state (e.g. search forms) should use the 'data-no-csrf' form attribute.
Access Control should be via existing and well regarded frameworks. If you really do need to roll your own then contact the security team for a design and implementation review.
If you are building a core Firefox service, consider adding it to the list of restricted domains in the preference extensions.webextensions.restrictedDomains. This will prevent a malicious extension from being able to steal sensitive information from it, see bug 1415644.

Databases

All SQL queries must be parameterized, not concatenated
Applications must use accounts with limited GRANTS when connecting to databases
- In particular, applications must not use admin or owner accounts, to decrease the impact of a sql injection vulnerability.

Common issues

User data must be escaped for the right context prior to reflecting it
- When inserting user generated html into an html context:
  - Python applications should use Bleach
  - Javascript applications should use DOMPurify
Apply sensible limits to user inputs, see input validation
- POST body size should be small (<500kB) unless explicitly needed
When managing permissions, make sure access controls are enforced server-side
If caching is used then make sure that any data cached does not incorrectly allow allow access to data protected by access control
If handling cryptographic keys, must have a mechanism to handle quarterly key rotations
- Keys used to sign sessions don't need a rotation mechanism if destroying all sessions is acceptable in case of emergency.
Do not proxy requests from users without strong limitations and filtering (see Pocket UserData vulnerability). Don't proxy requests to link local, loopback, or private networks or DNS that resolves to addresses in those ranges (i.e. 169.254.0.0/16, 127.0.0.0/8, 10.0.0.0/8, 100.64.0.0/10, 172.16.0.0/12, 192.168.0.0/16, 198.18.0.0/15).
Do not use target="_blank" in external links unless you also use rel="noopener noreferrer" (to prevent Reverse Tabnabbing)

json formatter link

i guess the subdomain people.mozilla.org doesn't exist. So, json formatter link isn't working in this page, https://github.com/klahnakoski/ActiveData/blob/master/docs/GettingStarted.md

metadata is not updating

build.type and run.type are multivalued, and the metadata is not updating.

Logging is corrupted

Logging to ES has serialization problem, It has been disabled in prod, but it is left on in dev so we can see it.

Fix null column on scripted values

The following query shows a count > 0 in the null row when it should not. The cause appears to be in the missing ES filter; which is not inverting scripted values properly.

Please

Make a test
Check if the problem still exists on the new cluster
Fix it

{
	"from":"coverage",
	"where":{"and":[
		{"prefix":{"source.file.name":"mfbt/"}},
		{"eq":{"repo.changeset.id12":"752465b44c79"}}
	]},
	"groupby":[{
		"name":"subdir",
		"value":{
			"then":{"between":{"source.file.name":["mfbt/","/"]}},
			"when":{"start":5,"find":{"source.file.name":"/"}},
			"else":{"not_left":{"source.file.name":5}}
		}
	}],
	"select":[
		{"aggregate":"count"},
	],
	"limit":10000
}

Ensure `pull` and `put` not in response

All the select properties are from the internal parser. They should be from the original expression.

		{
			"name":".",
			"pull":"<function output at 0x7fca8a164de8>",
			"put":{"child":"etl.source.machine.os","index":0,"name":"."},
			"value":"etl.source.machine.os.~s~"
		}

Faster Startup

ActiveData has slow startup. It is caused by the metadata scan it does; specifically pulling the cardinality and "many"ness of the various columns. This is required to perform queries correctly, and to provide caps on query resources (not yet implemented).

Startup can be made much faster by storing the metadata in a Sqlite database. The database can be shared with sibling instances (gunicorn creates multiple AD instances to serve requests) and future instances of ActiveData.

The metadata is managed in meta.py. The current ES metadata, which upon which the latter is based, is held in pylibrary.env.elasticsearch._meta.

Ensure proper versioning of requirements via exact version specifiers

'future==0.16.0' vs. 'future' for example

Range query not working

{
    "from": "unittest",
    "format": "cube",
    "edges": [
        {
            "domain": {
                "type": "range",
                "key": "name",
                "partitions": [
                    {
                        "max": 1,
                        "min": 0,
                        "dataIndex": 0,
                        "name": "1sec"
                    },
                    {
                        "max": 2,
                        "min": 1,
                        "dataIndex": 1,
                        "name": "2sec"
                    },
                    {
                        "max": 5,
                        "min": 2,
                        "dataIndex": 2,
                        "name": "5sec"
                    },
                    {
                        "max": 10,
                        "min": 5,
                        "dataIndex": 3,
                        "name": "10sec"
                    },
                    {
                        "max": 20,
                        "min": 10,
                        "dataIndex": 4,
                        "name": "20sec"
                    },
                    {
                        "max": 30,
                        "min": 20,
                        "dataIndex": 5,
                        "name": "30sec"
                    },
                    {
                        "max": 45,
                        "min": 30,
                        "dataIndex": 6,
                        "name": "45sec"
                    },
                    {
                        "max": 60,
                        "min": 45,
                        "dataIndex": 7,
                        "name": "60sec"
                    },
                    {
                        "max": 90,
                        "min": 60,
                        "dataIndex": 8,
                        "name": "90sec"
                    },
                    {
                        "max": 120,
                        "min": 90,
                        "dataIndex": 9,
                        "name": "120sec"
                    },
                    {
                        "max": 150,
                        "min": 120,
                        "dataIndex": 10,
                        "name": "150sec"
                    },
                    {
                        "max": 600,
                        "min": 150,
                        "dataIndex": 11,
                        "name": "600sec"
                    }
                ]
            },
            "value": "result.duration"
        }
    ],
    "limit": 10000,
    "where": {
        "and": [
            {
                "in": {
                    "repo.branch.name": [
                        "mozilla-central"
                    ]
                }
            },
            {
                "gte": [
                    "repo.push.date",
                    {
                        "date": "today-week"
                    }
                ]
            },
            {
                "lte": [
                    "repo.push.date",
                    {
                        "date": "eod"
                    }
                ]
            },
            {
                "eq": {
                    "build.type": "opt"
                }
            },
            {
                "eq": {
                    "run.machine.platform": "windows10-64"
                }
            },
            {
                "regex": {
                    "result.test": ".*/.*"
                }
            },
            {
                "eq": {
                    "result.ok": "T"
                }
            }
        ]
    },
    "select": [
        {
            "aggregate": "cardinality",
            "value": "result.test"
        }
    ]
}

changeset.description is returning null

This is cloned from: mozilla/active-data-recipes#32

The try_usage recipe does not work using the new cluster. The data it needs is there, but {"select":"changeset.description"} is returning null.

{
"from":"repo",
"select":["push.user","changeset"],
"where":{"and":[
{"eq":{"branch.name":"try"}},
{"gte":{"push.date":{"date":"today-week"}}}
]},
"sort":{"push.user":"desc"},
"limit":10
}

The repo is not storing the description separate from the whole document.

post to put issue in elasticsearch.py

changes made to fix_put branch

metadata database gets corrupted

The database (or the code interacting with the database) will get corrupted. Deleted the database to solve the problem, but more research is needed.

Problem with nested query

The docs mention a query into unittest.run.files https://github.com/mozilla/ActiveData/blob/dev/docs/jx_tutorial.md#select-clause

{
    "from": "unittest.run.files",
    "select": ["run.stats.bytes","run.files.url"],
    "where": {"and": [
        {"eq": {"build.platform": "linux64"}},
        {"gt": {"run.stats.bytes": 600000000}}
    ]}
}

which appears to fail

Call to ActiveData failed
	File ESQueryRunner.js, line 33, in ActiveDataQuery
	File thread.js, line 247, in Thread_prototype_resume
	File thread.js, line 226, in retval
	File Rest.js, line 46, in Rest.send/ajaxParam.error
	File Rest.js, line 100, in Rest.send/request.onreadystatechange
caused by Error while calling /query
caused by Bad response (400)
caused by Should not happen
	File __init__.py, line 156, in _index
	File __init__.py, line 123, in __init__
	File jx_usingES.py, line 85, in __init__
	File __init__.py, line 64, in wrapper
	File __init__.py, line 92, in wrap_from
	File jx.py, line 60, in jx_query
	File __init__.py, line 55, in output
	File app.py, line 1625, in dispatch_request
	File app.py, line 1639, in full_dispatch_request
	File app.py, line 1988, in wsgi_app
	File app.py, line 2000, in __call__
	File serving.py, line 181, in execute
	File serving.py, line 193, in run_wsgi
	File serving.py, line 251, in handle_one_request
	File BaseHTTPServer.py, line 340, in handle
	File serving.py, line 216, in handle
	File SocketServer.py, line 655, in __init__
	File SocketServer.py, line 334, in finish_request
	File SocketServer.py, line 599, in process_request_thread
	File threading.py, line 766, in run
	File threading.py, line 813, in __bootstrap_inner
	File threading.py, line 786, in __bootstrap
caused by run.files
	File __init__.py, line 152, in _index
	File __init__.py, line 123, in __init__
	File jx_usingES.py, line 85, in __init__
	File __init__.py, line 64, in wrapper
	File __init__.py, line 92, in wrap_from
	File jx.py, line 60, in jx_query
	File __init__.py, line 55, in output
	File app.py, line 1625, in dispatch_request
	File app.py, line 1639, in full_dispatch_request
	File app.py, line 1988, in wsgi_app
	File app.py, line 2000, in __call__
	File serving.py, line 181, in execute
	File serving.py, line 193, in run_wsgi
	File serving.py, line 251, in handle_one_request
	File BaseHTTPServer.py, line 340, in handle
	File serving.py, line 216, in handle
	File SocketServer.py, line 655, in __init__
	File SocketServer.py, line 334, in finish_request
	File SocketServer.py, line 599, in process_request_thread
	File threading.py, line 766, in run
	File threading.py, line 813, in __bootstrap_inner
	File threading.py, line 786, in __bootstrap

Please check if this is a problem with the new cluster, and if so, ensure a test is made

_normalize and related methods are useless

The _normalize function is past its prime; it used to simplify Boolean expressions, but not that ES filter language has changed, it no longer works. Remove it, and remove any calls using it. Use .partial_eval() on an expression before converting to_esfilter.

default on ?script? not working?

verify the default is respected

		{
			"aggregate":"sum",
			"default":0,
			"name":"failures",
			"value":{"case":[{"then":1,"when":{"eq":{"result.ok":"F"}}}]}
		}

that is not the case right now

https://activedata.allizom.org/tools/query.html#query_id=WeUIHfSj

Tests - steps as specified in Readme.MD raised these errs

Tests - steps as specified in Readme.MD raised these errs
Errors are captured in
http://elasticsearchpy.blogspot.com/2017/03/running-tests-python-27.html

Ensure moves are replicated

example: u'6e545e9b883997bfacbdd20889f7bd6b4b8916b2'

Make es5 config file for running ActiveData in development

ActiveData's config file for development (and testing) may be different than the one used for v1.7

redash sql not working

This query does not work. The biggest problem being the (?parse?) error is not getting back to redash user

select 
    count("result.stats.s1") 
from 
    perf 
where 
    "run.timestamp">=date('today-month') and 
    "run.framework.name" = 'vcs' and 
    "run.suite"='clone' 
group by 
    floor("run.timestamp", 86400) as "day"

Verify {"eq":{"result.ok":"T"}} still works

Fix test_select_array_as_value

fix test_get_nested_columns

SQL query fails

Make tests and fix this problem. We suspect it's in the parsing of the in (...)

curl -XPOST http://activedata.allizom.org/sql -d "{\"sql\":\"select task.worker.type, repo.branch.name, run.machine.aws_instance_type from task where repo.branch.name in ('try', 'mozilla-central') LIMIT 10\"}"

Bad error message (?bad logic?) on query

{
	"from":"treeherder",
	"where":{"and":[
		{"in":{"run.result":["busted","exception","testfailed"]}},
		{"neq":{"failure.notes.failure_classification":"autoclassified intermittent"}}
	]},
	"limit":1
}

results in

caused by Expecting a Mapping
	File datas.py, line 546, in _iadd

here is full error

Call to ActiveData failed
	File ESQueryRunner.js, line 33, in ActiveDataQuery
	File thread.js, line 247, in Thread_prototype_resume
	File thread.js, line 226, in retval
	File Rest.js, line 46, in Rest.send/ajaxParam.error
	File Rest.js, line 100, in Rest.send/request.onreadystatechange
caused by Error while calling /query
caused by Bad response (400)
caused by problem
	File __init__.py, line 161, in query
	File jx.py, line 77, in run
	File query.py, line 63, in jx_query
	File flask_wrappers.py, line 55, in output
	File app.py, line 1461, in dispatch_request
	File app.py, line 1475, in full_dispatch_request
	File app.py, line 1817, in wsgi_app
	File app.py, line 1836, in __call__
	File sync.py, line 176, in handle_request
	File sync.py, line 135, in handle
	File sync.py, line 30, in accept
	File sync.py, line 68, in run_for_one
	File sync.py, line 124, in run
	File base.py, line 131, in init_process
	File arbiter.py, line 578, in spawn_worker
	File arbiter.py, line 611, in spawn_workers
	File arbiter.py, line 544, in manage_workers
	File arbiter.py, line 202, in run
	File base.py, line 72, in run
	File base.py, line 203, in run
	File wsgiapp.py, line 74, in run
	File gunicorn, line 11, in <module>
caused by Expecting a Mapping
	File datas.py, line 546, in _iadd
	File datas.py, line 189, in __iadd__
	File expressions.py, line 1513, in split_expression_by_depth
	File expressions.py, line 1550, in split_expression_by_path
	File setop.py, line 61, in es_setop
	File __init__.py, line 154, in query
	File jx.py, line 77, in run
	File query.py, line 63, in jx_query
	File flask_wrappers.py, line 55, in output
	File app.py, line 1461, in dispatch_request
	File app.py, line 1475, in full_dispatch_request
	File app.py, line 1817, in wsgi_app
	File app.py, line 1836, in __call__
	File sync.py, line 176, in handle_request
	File sync.py, line 135, in handle
	File sync.py, line 30, in accept
	File sync.py, line 68, in run_for_one
	File sync.py, line 124, in run
	File base.py, line 131, in init_process
	File arbiter.py, line 578, in spawn_worker
	File arbiter.py, line 611, in spawn_workers
	File arbiter.py, line 544, in manage_workers
	File arbiter.py, line 202, in run
	File base.py, line 72, in run
	File base.py, line 203, in run
	File wsgiapp.py, line 74, in run
	File gunicorn, line 11, in <module>

Verify adr test_durations is working

See mozilla/active-data-recipes#67

python27 active_data/app.py --settings=resources/config/simple_settings.json - raises err in ElasticSearch 5.2.2

Can not pull failure.notes.text value

Problem seen here:

{
	"from":"treeherder",
	"limit":50000,
	"select":["build.date","failure.notes.text"],
	"where":{"and":[
		{"lte":{"repo.push.date":{"date":"2018-10-07"}}},
		{"gte":{"repo.push.date":{"date":"2018-09-30"}}},
		{"in":{"build.branch":["mozilla-inbound","autoland"]}},
		{"in":{"job.type.group.symbol":["M","M-e10s","X"]}},
		{"neq":{"build.type":"asan"}},
		{"eq":{"run.machine.platform":"linux64"}},
		{"eq":{"failure.classification":"fixed by commit"}}
	]}
}

https://activedata.allizom.org/tools/query.html#query_id=rZSE3D7M

Fix Orange dashboard queries

The ES query language is not as flexible in this new version, activedata is busted for the Fresh and Neglected Oranges dashboards

treeherder still has duplicate redcords

This returns two records, it should return just one

{
	"from":"treeherder",
	"select":[
		"repo.index",
		"job.id",
		"job.type.name",
		"repo.push.date",
		"failure.notes.failure_classification",
		"failure.notes.created",
		"action.start_time",
		"action.end_time",
		"last_modified"
	],
	"orderby":"repo.index, job.type.name, action.start_time",
	"where":{"and":[
		{"eq":{"run.state":"completed"}},
		{"in":{"repo.branch.name":[
			"autoland",
			"mozilla-inbound",
			"mozilla-central",
			"mozilla-beta"
		]}},
		{"eq":{"run.tier":1}},
		{"in":{"run.result":["busted","exception","testfailed"]}},
		{"neq":{"failure.notes.failure_classification":"autoclassified intermittent"}},
		{"neq":{"failure.classify":"not classified"}},
		{"eq":["action.start_time",1541086422]},
		{"eq":{"repo.changeset.id12":"2659d4da0d78"}}
	]},
	"limit":1000000
}

fx-test is broken because of query

Boolean typed queries seem to have a problem

{
    "from": "fx-test",
    "edges": [
        {
            "name": "ok",
            "value": "result.ok"
        }
    ],
    "limit": 1000,
    "format": "table"
}

select _id is not working?

Used for replication:

    "select": ["_id", {"name": "_source", "value": "."}],
    "from": config.source.index

Pre-populate the query textbox with a value

Allow third parties to provide a URL which will help pre-populate the query textbox.

Example:

query.html?query_str=foobar

Same setup with Elastic Search 5.2.2 - to check how it works in ES 5.

Unit Test runs fine in Elastic Search V1.7.1 along with Python 2.7.12

Just upgraded Elastic Search to V5.2.2

Modified the yml file and the service doesn't start with these two config lines on.
Elasticsearch has a configuration file at config/elasticsearch.yml. You must modify it to turn on scripting. Add these two lines at the top of the file:
script.inline: on
script.indexed: on

Just removed these two lines for the sake of testing to test if the elastic service would start and it started.
Re-plugging these two lines in config/elasticsearch.yml doesn't let the service start.

So, for now... these two lines are commented out in config/elasticsearch.yml..

( I do understand that ES5.2.2 is the latest... just wanted to see if it works fine in the latest and thats why installed the latest release )

We should be able to query cardinality

...even if it is null

{
    "from": "meta.columns",
    "select": "cardinality",
    "where": {
        "and": [
            {
                "eq": {
                    "table": "fx-test"
                }
            },
            {
                "eq": {
                    "name": "result.ok"
                }
            }
        ]
    }
}