gmcquillan / firetower Goto Github PK
View Code? Open in Web Editor NEWError Classification and Aggregation System
License: MIT License
Error Classification and Aggregation System
License: MIT License
We store metadata on the backend about each category. Some of this is meant to be modified by the user. We need to provide those form elements so that the following elements can be modified:
I'd like to include time to classify on each line along with the method of classification.
It'd be nice to group categories together which are related to the same failure. That way you can view them among the other categories as an aggregate, but also break them up within the group view to see which sub-category has how many errors.
For simplifying what's going on, it seems handy.
We need some method for determining how internally consistent a category is. My first thought for this effort is to construct an arithmetic mean of all the events within a category when compared against the category signature. Our efficacy could be as simple as looking at the standard deviation of this number.
Ideally, this would be an offline process so that the classification efforts aren't encumbered. Results could be saved within the category_ids
key, which already houses most of the other category metadata.
I'd like to be able to process events from Kafka as well as logs from disk and emails from imap.
For an example check out: https://gist.github.com/7db14f27c4d115b355b3
Put together some ideas for fancy Firetower logo & page design. (I've got some stuff in the works already, but it's still a little rough - will definitely have something to show you in the next couple days, though.)
We at least want the ability to keep a category relatively stable. Letting them wander was an interesting experiment; but I think it'd be better to just stick to the category signatures.
The latest set of changes to the archive_feature branch seem to be passing a Redis object where we expect a Redis.Connection() instance. At least, that's what I suspect. See below.
I did pass in a Redis.Connection instance directly into the static method, which is what the constructor comments lead me to believe it expects. List output of the categories does include three category objects (which is about average for this level of accuracy).
In [1]: from firetower import category
In [2]: import redis
In [3]: conn = redis.Redis()
In [4]: cats = category.Category.get_all_categories
category.Category.get_all_categories
In [4]: cats = category.Category.get_all_categories(conn)
In [5]: for cat in cats:
...: cat.timeseries.range(0, -1)
...:
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (31, 0))
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/Users/gavin/src/firetower/<ipython console> in <module>()
/Users/gavin/src/firetower/firetower/category.pyc in range(self, start, end)
12
13 def range(self, start, end):
---> 14 return self.redis_conn.zrevrangebyscore(
15 "ts_%s" % self.cat_id, end, start, withscores=True
16 )
AttributeError: 'Redis' object has no attribute 'zrevrangebyscore'
This is a bit of a controversial proposal, however at some point we need to figure out how to move old events out of redis-server, which requires that the data fit into memory.
We can go two routes with this:
I'm strongly leaning toward the second procedure, but I'm looking for comments.
Right now, Firetower at UA is taking up about 4G of memory, and generating about as much in logfiles everyday.
Some categories are mostly duplicate. To fix this:
Niall committed this feature, but I've had trouble getting the regex module installed.
Right now the error data is sent to the front end by populating a page template variable - I think it would be more flexible/powerful to be able to make an ajax call that returns the json here. (If nothing else, makes it easier for me - this is how I'm used to pulling in chart data.) I'll take a stab at setting this up, if there are no objections.
I noticed something strange on my larger installation of Firetower (which has about 3380 categories) -- the server is consistently taking up about 16% of a CPU, even when no new events are coming in to be categorized.
This is the strace output for that process:
write(4, "[2012-01-25 20:42] DEBUG: Fireto"..., 108) = 108
sendto(3, "_2\r\n$7\r\nHGETALL\r\n$48\r\ncounter_9d"..., 72, 0, NULL, 0) = 72
recvfrom(3, "_0\r\n", 8192, 0, NULL, NULL) = 4
write(4, "[2012-01-25 20:42] DEBUG: Fireto"..., 108) = 108
sendto(3, "_2\r\n$7\r\nHGETALL\r\n$48\r\ncounter_bb"..., 72, 0, NULL, 0) = 72
recvfrom(3, "*0\r\n", 8192, 0, NULL, NULL) = 4
select(0, NULL, NULL, NULL, {1, 0}^C <unfinished ...>
Clearly, it's spending all its time doing HGETALL calls for the various categories to check and see if they have any data to archive. This is hardcoded to happen every 2 seconds. The problem now is that this process takes more than 2 seconds to complete, so it's pretty much constant.
Two pronged solution:
So far, we have a couple of input methods for inserting items into the firetower queue.
It'd be nice to have some sort of general purpose service to run on client machines which intercepts events emitted by various logging systems and sends them to Firetower.
An alternative view point would be to create specific logging to Firetower options for each language we'd like to use with this. I'm not certain which would be more work, but this problem needs to get solved one way or another.
Right now the form values at the bottom of the page are just sitting there without any styling or any protective js for values (i.e. to warn the user about crashing her browser).
Feel free to add more items, but the things that could really use help are these:
Right now, when the statistics are calculated for a series of events, comparisons are run for each and every event in that series. We could save lots of CPU cycles by just sampling a significant fraction of the events in the series randomly.
Right now, we have some automatic reductions in accuracy depending on the size of a signature. We do this by using a faster, but less accurate algorithm. The result is that we maintain performance, but large signatures basically get randomly placed with large categories.
To fix this, I'm proposing that we also increase the default similarity threshold at each step of the size escalation. Default of 0.5 gets turned into 0.75 for quick, and 0.86 for really_quick.
To make sure we're not making things worse, I'd like to have the category efficacy issue completed first.
It'd be useful to know a few pieces of metadata about how the system is running:
More?
Here is a tentative overview of the data structures to be exposed by the firetower api, with corresponding URLs:
Firetower API URLs:
/api/
categories/ - dictionary of categories, indexed by category id, with a dictionary of metadata for each category
categories/timeseries/ - all categories, last 5 minutes' worth of data for each category
categories/timeseries/<start/end get params> - all categories, number of instances returned per category over a specified time range
categories/timeseries/?all=1 - all categories, all data (computationally intensive, should not be default behavior)
category/<cat_id> - category metadata
category/<cat_id>/timeseries/ - last 5 minutes' worth of data for specified category
category/<cat_id>/timeseries/<start/end get params> - specified category, specified time range
category/<cat_id>/events/ - raw text of events/errors/tracebacks for specified category
Data to be exposed by API:
(returned as JSON objects)
/api/categories/ - dictionary of categories, indexed by category id, with a dictionary of metadata for each category
{
'cat1': {
"human_readable": "Name of the Category",
"signature": "Full text of most recent event/traceback classified into this category",
"threshold": Threshold defined for this category (number)
}
'cat2': {
"human_readable": "Name of the Category",
"signature": "Full text of most recent event/traceback classified into this category",
"threshold": Threshold defined for this category (number)
}
}
/api/categories/timeseries/ - all categories, last 5 minutes' worth of data for each category
Dictionary with cat id's as keys, listing number of events that were classified in that category over the last 5 minutes,
indexed by timestamp (currently per second)
5 minutes is an arbitrary amount of time, to be defined as a default (could be 30 minutes, or any other value).
{
'cat1': [ (1, 10), (2, 5), (3, 15) ]
'cat2': [ (1, 7), (2, 14), (3, 8) ]
}
For category 1:
---
cat1: category id
1, 2, 3: timestamps (indexed by second) - will include 5 minutes' worth of data
10, 5, 15: number of instances of errors returned per second for this category
/api/categories/timeseries/<start/end get params> - all categories, number of instances returned per category over a specified time range
Dictionary with cat id's as keys, listing number of events that were classified in that category between specified start and end points,
indexed by timestamp (currently per second) - same format as categories/timeseries/, different time range
If time range specified included 5 seconds:
{
'cat1': [ (1, 10), (2, 5), (3, 15), (4, 20), (5, 2) ]
'cat2': [ (1, 7), (2, 14), (3, 8), (4, 7), (5, 12) ]
}
For category 1:
---
cat1: category id
1, 2, 3, 4, 5: timestamps (indexed by second)
10, 5, 15, 20, 2: number of instances of errors returned per second for this category
/api/categories/timeseries/?all=1 - all categories, all data (computationally intensive, should not be default behavior)
Same format as above, just with way more data. :)
/api/category/<cat_id> - dictionary of metadata for a single, specified category
{
"human_readable": "Name of the Category",
"signature": "Full text of most recent event/traceback classified into this category",
"threshold": Threshold defined for this category (number)
}
/api/category/<cat_id>/timeseries/ - last 5 minutes' worth of data for specified category
For a single category, specified in url by category id, returns 5 minutes' worth of data as a series of tuples
of timestamp (currently seconds) and number of instances returned in given category per timestamp.
[ (1, 10), (2, 5), (3, 15) ]
1, 2, 3: timestamps (indexed by second)
10, 5, 15: number of instances of events in given category returned per second
5 minutes is an arbitrary amount of time, to be defined as a default (could be 30 minutes, or any other value).
/api/category/<cat_id>/timeseries/<start/end get params> - number of events returned in specified category over a specified time range
For a single category, specified in url by category id, returns a series of tuples of timestamp (currently seconds)
and number of instances returned in given category per timestamp, between start and end points as defined by get params.
category x (specified in url with a cat_id)
5pm to 8pm (specified in url with start/end timestamp get params)
^^^ if no start/end time defined, returns some set default time range (e.g. 5min)
how many instances were returned per second
[ (1, 10), (2, 5), (3, 15) ]
1, 2, 3: timestamps (indexed by second)
10, 5, 15: number of instances of errors of type x returned per second
/api/category/<cat_id>/events/ - dictionary of raw text of all events/errors/tracebacks for specified category
{
'cat1': [ 'I am some error text for cat1', 'I am some related error text in the same category' ]
'cat2': [ 'I am some error text for cat2', 'I am some related error text in the same category' ]
'cat3': [ 'I am some error text for cat3', 'I am some related error text in the same category' ]
}
We should allow people to use the same redis installation as other projects, or other firetower installs. There needs to be a 'db' key that specifies which integer db value we're using for our keyspace.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.