mozilla / activedata Goto Github PK
View Code? Open in Web Editor NEWProvide high speed filtering and aggregation over data
License: Mozilla Public License 2.0
Provide high speed filtering and aggregation over data
License: Mozilla Public License 2.0
The description.raw is not stored, plus the old cluster must be brought over to the new cluster
In an attempt to query treeherder.job_log.failure_line
, we can see that the query appears to a placing the filters inside the nested query instead at top level
{
"sort":{"action.start_time":"desc"},
"select":["test","created","status","expected"],
"from":"treeherder.job_log.failure_line",
"limit":1,
"where":{"and":[
{"eq":{"failure.notes.failure_classification":"fixed by commit"}},
{"exists":"failure.notes.text"},
{"in":{"build.branch":["mozilla-inbound","autoland"]}},
{"gte":{"created":{"date":"today-week"}}},
{"prefix":{"job.type.name":"test-"}}
]}
}
{
"from":0,
"query":{"bool":{"filter":[
{"bool":{"filter":[
{"terms":{"build.branch.~s~":["mozilla-inbound","autoland"]}},
{"prefix":{"job.type.name.~s~":"test-"}}
]}},
{"nested":{
"inner_hits":{"size":100000},
"path":"job_log.~N~.failure_line.~N~",
"query":{"bool":{"filter":[
{"term":{"failure.notes.failure_classification.~s~":"fixed by commit"}},
{"exists":{"field":"failure.notes.text.~s~"}}
]}}
}}
]}},
"size":1,
"sort":[{"action.start_time.~n~":"desc"}],
"stored_fields":["job_log.status.~n~"]
}
example: u'6e545e9b883997bfacbdd20889f7bd6b4b8916b2'
Unit Test runs fine in Elastic Search V1.7.1 along with Python 2.7.12
Just upgraded Elastic Search to V5.2.2
Modified the yml file and the service doesn't start with these two config lines on.
Elasticsearch has a configuration file at config/elasticsearch.yml. You must modify it to turn on scripting. Add these two lines at the top of the file:
script.inline: on
script.indexed: on
Just removed these two lines for the sake of testing to test if the elastic service would start and it started.
Re-plugging these two lines in config/elasticsearch.yml doesn't let the service start.
So, for now... these two lines are commented out in config/elasticsearch.yml..
( I do understand that ES5.2.2 is the latest... just wanted to see if it works fine in the latest and thats why installed the latest release )
When metadata management is on, calls to ES start timing out. This includes calls made by the ETL pipeline.
metadata management is off until this is fixed
{
"from":"treeherder",
"where":{"and":[
{"in":{"run.result":["busted","exception","testfailed"]}},
{"neq":{"failure.notes.failure_classification":"autoclassified intermittent"}}
]},
"limit":1
}
results in
caused by Expecting a Mapping
File datas.py, line 546, in _iadd
here is full error
Call to ActiveData failed
File ESQueryRunner.js, line 33, in ActiveDataQuery
File thread.js, line 247, in Thread_prototype_resume
File thread.js, line 226, in retval
File Rest.js, line 46, in Rest.send/ajaxParam.error
File Rest.js, line 100, in Rest.send/request.onreadystatechange
caused by Error while calling /query
caused by Bad response (400)
caused by problem
File __init__.py, line 161, in query
File jx.py, line 77, in run
File query.py, line 63, in jx_query
File flask_wrappers.py, line 55, in output
File app.py, line 1461, in dispatch_request
File app.py, line 1475, in full_dispatch_request
File app.py, line 1817, in wsgi_app
File app.py, line 1836, in __call__
File sync.py, line 176, in handle_request
File sync.py, line 135, in handle
File sync.py, line 30, in accept
File sync.py, line 68, in run_for_one
File sync.py, line 124, in run
File base.py, line 131, in init_process
File arbiter.py, line 578, in spawn_worker
File arbiter.py, line 611, in spawn_workers
File arbiter.py, line 544, in manage_workers
File arbiter.py, line 202, in run
File base.py, line 72, in run
File base.py, line 203, in run
File wsgiapp.py, line 74, in run
File gunicorn, line 11, in <module>
caused by Expecting a Mapping
File datas.py, line 546, in _iadd
File datas.py, line 189, in __iadd__
File expressions.py, line 1513, in split_expression_by_depth
File expressions.py, line 1550, in split_expression_by_path
File setop.py, line 61, in es_setop
File __init__.py, line 154, in query
File jx.py, line 77, in run
File query.py, line 63, in jx_query
File flask_wrappers.py, line 55, in output
File app.py, line 1461, in dispatch_request
File app.py, line 1475, in full_dispatch_request
File app.py, line 1817, in wsgi_app
File app.py, line 1836, in __call__
File sync.py, line 176, in handle_request
File sync.py, line 135, in handle
File sync.py, line 30, in accept
File sync.py, line 68, in run_for_one
File sync.py, line 124, in run
File base.py, line 131, in init_process
File arbiter.py, line 578, in spawn_worker
File arbiter.py, line 611, in spawn_workers
File arbiter.py, line 544, in manage_workers
File arbiter.py, line 202, in run
File base.py, line 72, in run
File base.py, line 203, in run
File wsgiapp.py, line 74, in run
File gunicorn, line 11, in <module>
'future==0.16.0' vs. 'future' for example
The ES query language is not as flexible in this new version, activedata is busted for the Fresh and Neglected Oranges dashboards
Boolean typed queries seem to have a problem
{
"from": "fx-test",
"edges": [
{
"name": "ok",
"value": "result.ok"
}
],
"limit": 1000,
"format": "table"
}
fix this
Commands Executed:
198 rm -rf ActiveData
199 git clone https://github.com/klahnakoski/ActiveData.git
200 git checkout master
201 cd ActiveData
202 git checkout master
206 python27 -m pip install -r requirements.txt
207 export PYTHONPATH=.
208 export PYTHONPATH=.
209 python27 active_data/app.py --settings=resources/config/simple_settings.json
Result:
$ python27 active_data/app.py --settings=resources/config/simple_settings.json
kabalidaa - 2017-04-08 18:45:34 - Main Thread - "threads.py:499" (join) - "Main Thread" waiting on thread "log thread"
Traceback (most recent call last):
File "active_data/app.py", line 192, in
setup()
File "active_data/app.py", line 125, in setup
Log.error("Serious problem with ActiveData service construction! Shutdown!", cause=e)
File "C:\ActiveData\pyLibrary\debugs\logs.py", line 375, in error
raise e
pyLibrary.debugs.exceptions.Except: ERROR: Serious problem with ActiveData service construction! Shutdown!
File "active_data/app.py", line 125, in setup
File "active_data/app.py", line 192, in
caused by
ERROR: Problem with call to http://localhost:9200/active_data_requests20170408_184534
{"mappings": {"request_log": {"properties": {"content_length": {"index": "not_analyzed", "type": "string"}, "http_user_agent": {"index": "not_analyzed", "type": "string"}, "from": {"index": "not_analyzed", "type": "string"}, "remote_addr": {"index": "not_analyzed", "type": "string"}, "timestamp": {"type": "double"}, "error": {"type": "object", "enabled": false, "store": "yes", "index": "no"}, "query": {"type": "object", "enabled": false, "store": "yes", "index": "no"}, "path": {"index": "not_analyzed", "type": "string"}, "data": {"index": "not_analyzed", "type": "string"}, "http_accept_encoding": {"index": "not_analyzed", "type": "string"}}, "dynamic_templates": [{"default_strings": {"match_mapping_type": "string", "mapping": {"index": "not_analyzed", "type": "string"}, "match": "*"}}], "_source": {"compress": true}}}, "settings": {"index": {"number_of_replicas": 0, "store": {"throttle": {"max_bytes_per_sec": "2mb", "type": "merge"}}, "number_of_shards": 3}}}
File "C:\ActiveData\pyLibrary\env\elasticsearch.py", line 760, in post
File "C:\ActiveData\pyLibrary\env\elasticsearch.py", line 635, in create_index
File "C:\ActiveData\pyLibrary\meta.py", line 140, in wrapper
File "C:\ActiveData\pyLibrary\env\elasticsearch.py", line 509, in get_or_create_index
File "C:\ActiveData\pyLibrary\meta.py", line 144, in wrapper
File "active_data/app.py", line 97, in setup
File "active_data/app.py", line 192, in
caused by
ERROR: Bad Request: No handler found for uri [/active_data_requests20170408_184534] and method [POST]
File "C:\ActiveData\pyLibrary\env\elasticsearch.py", line 738, in post
File "C:\ActiveData\pyLibrary\env\elasticsearch.py", line 635, in create_index
File "C:\ActiveData\pyLibrary\meta.py", line 140, in wrapper
File "C:\ActiveData\pyLibrary\env\elasticsearch.py", line 509, in get_or_create_index
File "C:\ActiveData\pyLibrary\meta.py", line 144, in wrapper
File "active_data/app.py", line 97, in setup
File "active_data/app.py", line 192, in
The Orange dashboards must first be fixed
There is a small machine that keeps a backup of saved_queries and repo, fix it.
Problem seen here:
{
"from":"treeherder",
"limit":50000,
"select":["build.date","failure.notes.text"],
"where":{"and":[
{"lte":{"repo.push.date":{"date":"2018-10-07"}}},
{"gte":{"repo.push.date":{"date":"2018-09-30"}}},
{"in":{"build.branch":["mozilla-inbound","autoland"]}},
{"in":{"job.type.group.symbol":["M","M-e10s","X"]}},
{"neq":{"build.type":"asan"}},
{"eq":{"run.machine.platform":"linux64"}},
{"eq":{"failure.classification":"fixed by commit"}}
]}
}
https://activedata.allizom.org/tools/query.html#query_id=rZSE3D7M
Used for replication:
"select": ["_id", {"name": "_source", "value": "."}],
"from": config.source.index
So, from moz-sql-parser (not sure if this is a bug in that or here)
>>> parse("""
... select build.product from tasks where foo != "firefox"
... """)
{'select': {'value': 'build.product'}, 'from': 'tasks', 'where': {'neq': ['foo', 'firefox']}}
While activedata supports "ne".. https://github.com/mozilla/ActiveData/blob/dev/docs/jx_expressions.md#ne-operator
Specifically in this case I'd have expected it to be ... 'where': {'ne': {'foo', 'firefox'}}}
@klahnakoski thoughts?
ActiveData's config file for development (and testing) may be different than the one used for v1.7
{
"in":{
"result.ok":[
"F"
]
}
}
verify the default is respected
{
"aggregate":"sum",
"default":0,
"name":"failures",
"value":{"case":[{"then":1,"when":{"eq":{"result.ok":"F"}}}]}
}
that is not the case right now
https://activedata.allizom.org/tools/query.html#query_id=WeUIHfSj
All the select
properties are from the internal parser. They should be from the original expression.
{
"name":".",
"pull":"<function output at 0x7fca8a164de8>",
"put":{"child":"etl.source.machine.os","index":0,"name":"."},
"value":"etl.source.machine.os.~s~"
}
strict-transport-security: max-age=31536000
services.mozilla.com
, it must be manually added to Firefox's preloaded pins. This only applies to production services, not short-lived experiments.npm audit
for node.js (see usage in FxA) (NB: there are open issues for handling exceptions)pip list --outdated
or requires.io or pyup outdated checkscargo update
and cargo upgrade when changing versions/__cspreport__
endpointdefault-src 'none'; frame-ancestors 'none'; base-uri 'none'; report-uri /__cspreport__
to disallowing all content rendering, framing, and report violationsnone
, frame-src, and object-src should be none
or only allow specific originsextensions.webextensions.restrictedDomains
. This will prevent a malicious extension from being able to steal sensitive information from it, see bug 1415644.target="_blank"
in external links unless you also use rel="noopener noreferrer"
(to prevent Reverse Tabnabbing)This Elasticsearch java plugin elasticsearch-readonlyrest-plugin could replace esFrontline, and offer several advantages:
codecoverage
user allowing RO access to coverage
& repo
only)@klahnakoski What do you think ? I can test this solution on a local cluster & report back with a documentation & integration in your deployment
Since Flask uses whitelisting for HTTP methods ActiveData doesn't respond correctly to CORS protections that come from a browser that attempt to preflight the request. Namely ActiveData doesn't include the OPTIONS method in its catchall route.
CORS headers are new to me so this is my source:
https://stackoverflow.com/questions/1256593/why-am-i-getting-an-options-request-instead-of-a-get-request#13030629
Curl script generated from my browser that reproduces the issue:
curl 'http://activedata.allizom.org/query' -X OPTIONS -H 'Pragma: no-cache' -H 'Access-Control-Request-Method: POST' -H 'Origin: http://localhost:8080' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36' -H 'Accept: */*' -H 'Cache-Control: no-cache' -H 'Referer: http://localhost:8080/' -H 'Connection: keep-alive' -H 'DNT: 1' -H 'Access-Control-Request-Headers: content-type' --compressed -I
Gives you a response headers of
HTTP/1.1 200 OK
Server: nginx
Date: Fri, 14 Oct 2016 07:52:06 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 0
Connection: keep-alive
Which as you can tell does not contain the "Access-Control-Allow-Origin" header which is entered on this line
I believe the patch needs to be made to the accepted HTTP methods for this route.
It may also be worth adding all the rest of the well known headers to the catch all at the same time:
http://flask.pocoo.org/docs/0.11/quickstart/#routing
LASTLY:
You get style points if you change the header to "Access-Control-Allow-Origin"
I would have wrote the patch myself but I'm not sure just adding OPTIONS is going to be sufficient and I wont be able to test the code on my side.
Logging to ES has serialization problem, It has been disabled in prod, but it is left on in dev so we can see it.
This returns two records, it should return just one
{
"from":"treeherder",
"select":[
"repo.index",
"job.id",
"job.type.name",
"repo.push.date",
"failure.notes.failure_classification",
"failure.notes.created",
"action.start_time",
"action.end_time",
"last_modified"
],
"orderby":"repo.index, job.type.name, action.start_time",
"where":{"and":[
{"eq":{"run.state":"completed"}},
{"in":{"repo.branch.name":[
"autoland",
"mozilla-inbound",
"mozilla-central",
"mozilla-beta"
]}},
{"eq":{"run.tier":1}},
{"in":{"run.result":["busted","exception","testfailed"]}},
{"neq":{"failure.notes.failure_classification":"autoclassified intermittent"}},
{"neq":{"failure.classify":"not classified"}},
{"eq":["action.start_time",1541086422]},
{"eq":{"repo.changeset.id12":"2659d4da0d78"}}
]},
"limit":1000000
}
Tests - steps as specified in Readme.MD raised these errs
Errors are captured in
http://elasticsearchpy.blogspot.com/2017/03/running-tests-python-27.html
{
"from": "unittest",
"format": "cube",
"edges": [
{
"domain": {
"type": "range",
"key": "name",
"partitions": [
{
"max": 1,
"min": 0,
"dataIndex": 0,
"name": "1sec"
},
{
"max": 2,
"min": 1,
"dataIndex": 1,
"name": "2sec"
},
{
"max": 5,
"min": 2,
"dataIndex": 2,
"name": "5sec"
},
{
"max": 10,
"min": 5,
"dataIndex": 3,
"name": "10sec"
},
{
"max": 20,
"min": 10,
"dataIndex": 4,
"name": "20sec"
},
{
"max": 30,
"min": 20,
"dataIndex": 5,
"name": "30sec"
},
{
"max": 45,
"min": 30,
"dataIndex": 6,
"name": "45sec"
},
{
"max": 60,
"min": 45,
"dataIndex": 7,
"name": "60sec"
},
{
"max": 90,
"min": 60,
"dataIndex": 8,
"name": "90sec"
},
{
"max": 120,
"min": 90,
"dataIndex": 9,
"name": "120sec"
},
{
"max": 150,
"min": 120,
"dataIndex": 10,
"name": "150sec"
},
{
"max": 600,
"min": 150,
"dataIndex": 11,
"name": "600sec"
}
]
},
"value": "result.duration"
}
],
"limit": 10000,
"where": {
"and": [
{
"in": {
"repo.branch.name": [
"mozilla-central"
]
}
},
{
"gte": [
"repo.push.date",
{
"date": "today-week"
}
]
},
{
"lte": [
"repo.push.date",
{
"date": "eod"
}
]
},
{
"eq": {
"build.type": "opt"
}
},
{
"eq": {
"run.machine.platform": "windows10-64"
}
},
{
"regex": {
"result.test": ".*/.*"
}
},
{
"eq": {
"result.ok": "T"
}
}
]
},
"select": [
{
"aggregate": "cardinality",
"value": "result.test"
}
]
}
The database (or the code interacting with the database) will get corrupted. Deleted the database to solve the problem, but more research is needed.
{
"sort":"date",
"from":"perf",
"edges":[{
"domain":{
"max":"tomorrow",
"interval":"day",
"type":"time",
"min":"today-month"
},
"name":"date",
"value":"run.timestamp"
}],
"limit":2000,
"where":{"and":[
{"gte":{"run.timestamp":{"date":"today-month"}}},
{"eq":{"run.framework.name":"vcs"}},
{"eq":{"run.suite":"clone"}}
]},
"select":[
{
"aggregate":"count",
"name":"count",
"value":"result.stats.s1"
},
{
"aggregate":"median",
"name":"median",
"value":"result.stats.s1"
},
{
"aggregate":"percentile",
"percentile":0.9,
"name":"90th",
"value":"result.stats.s1"
}
],
"meta":{"save":true}
}
ActiveData has slow startup. It is caused by the metadata scan it does; specifically pulling the cardinality and "many"ness of the various columns. This is required to perform queries correctly, and to provide caps on query resources (not yet implemented).
Startup can be made much faster by storing the metadata in a Sqlite database. The database can be shared with sibling instances (gunicorn creates multiple AD instances to serve requests) and future instances of ActiveData.
The metadata is managed in meta.py. The current ES metadata, which upon which the latter is based, is held in pylibrary.env.elasticsearch._meta.
This is cloned from: mozilla/active-data-recipes#32
The try_usage recipe does not work using the new cluster. The data it needs is there, but {"select":"changeset.description"} is returning null.
{
"from":"repo",
"select":["push.user","changeset"],
"where":{"and":[
{"eq":{"branch.name":"try"}},
{"gte":{"push.date":{"date":"today-week"}}}
]},
"sort":{"push.user":"desc"},
"limit":10
}
The repo is not storing the description separate from the whole document.
build.type
and run.type
are multivalued, and the metadata is not updating.
...even if it is null
{
"from": "meta.columns",
"select": "cardinality",
"where": {
"and": [
{
"eq": {
"table": "fx-test"
}
},
{
"eq": {
"name": "result.ok"
}
}
]
}
}
i guess the subdomain people.mozilla.org doesn't exist. So, json formatter link isn't working in this page, https://github.com/klahnakoski/ActiveData/blob/master/docs/GettingStarted.md
The _normalize
function is past its prime; it used to simplify Boolean expressions, but not that ES filter language has changed, it no longer works. Remove it, and remove any calls using it. Use .partial_eval()
on an expression before converting to_esfilter
.
The docs mention a query into unittest.run.files
https://github.com/mozilla/ActiveData/blob/dev/docs/jx_tutorial.md#select-clause
{
"from": "unittest.run.files",
"select": ["run.stats.bytes","run.files.url"],
"where": {"and": [
{"eq": {"build.platform": "linux64"}},
{"gt": {"run.stats.bytes": 600000000}}
]}
}
which appears to fail
Call to ActiveData failed
File ESQueryRunner.js, line 33, in ActiveDataQuery
File thread.js, line 247, in Thread_prototype_resume
File thread.js, line 226, in retval
File Rest.js, line 46, in Rest.send/ajaxParam.error
File Rest.js, line 100, in Rest.send/request.onreadystatechange
caused by Error while calling /query
caused by Bad response (400)
caused by Should not happen
File __init__.py, line 156, in _index
File __init__.py, line 123, in __init__
File jx_usingES.py, line 85, in __init__
File __init__.py, line 64, in wrapper
File __init__.py, line 92, in wrap_from
File jx.py, line 60, in jx_query
File __init__.py, line 55, in output
File app.py, line 1625, in dispatch_request
File app.py, line 1639, in full_dispatch_request
File app.py, line 1988, in wsgi_app
File app.py, line 2000, in __call__
File serving.py, line 181, in execute
File serving.py, line 193, in run_wsgi
File serving.py, line 251, in handle_one_request
File BaseHTTPServer.py, line 340, in handle
File serving.py, line 216, in handle
File SocketServer.py, line 655, in __init__
File SocketServer.py, line 334, in finish_request
File SocketServer.py, line 599, in process_request_thread
File threading.py, line 766, in run
File threading.py, line 813, in __bootstrap_inner
File threading.py, line 786, in __bootstrap
caused by run.files
File __init__.py, line 152, in _index
File __init__.py, line 123, in __init__
File jx_usingES.py, line 85, in __init__
File __init__.py, line 64, in wrapper
File __init__.py, line 92, in wrap_from
File jx.py, line 60, in jx_query
File __init__.py, line 55, in output
File app.py, line 1625, in dispatch_request
File app.py, line 1639, in full_dispatch_request
File app.py, line 1988, in wsgi_app
File app.py, line 2000, in __call__
File serving.py, line 181, in execute
File serving.py, line 193, in run_wsgi
File serving.py, line 251, in handle_one_request
File BaseHTTPServer.py, line 340, in handle
File serving.py, line 216, in handle
File SocketServer.py, line 655, in __init__
File SocketServer.py, line 334, in finish_request
File SocketServer.py, line 599, in process_request_thread
File threading.py, line 766, in run
File threading.py, line 813, in __bootstrap_inner
File threading.py, line 786, in __bootstrap
Please check if this is a problem with the new cluster, and if so, ensure a test is made
This query does not work. The biggest problem being the (?parse?) error is not getting back to redash user
select
count("result.stats.s1")
from
perf
where
"run.timestamp">=date('today-month') and
"run.framework.name" = 'vcs' and
"run.suite"='clone'
group by
floor("run.timestamp", 86400) as "day"
changes made to fix_put branch
The following query shows a count > 0 in the null
row when it should not. The cause appears to be in the missing
ES filter; which is not inverting scripted values properly.
Please
{
"from":"coverage",
"where":{"and":[
{"prefix":{"source.file.name":"mfbt/"}},
{"eq":{"repo.changeset.id12":"752465b44c79"}}
]},
"groupby":[{
"name":"subdir",
"value":{
"then":{"between":{"source.file.name":["mfbt/","/"]}},
"when":{"start":5,"find":{"source.file.name":"/"}},
"else":{"not_left":{"source.file.name":5}}
}
}],
"select":[
{"aggregate":"count"},
],
"limit":10000
}
Make tests and fix this problem. We suspect it's in the parsing of the in (...)
curl -XPOST http://activedata.allizom.org/sql -d "{\"sql\":\"select task.worker.type, repo.branch.name, run.machine.aws_instance_type from task where repo.branch.name in ('try', 'mozilla-central') LIMIT 10\"}"
Since a column can have a nested and an inner version, verify that insertion uses the nested version for inner singleton objects.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.