gmcquillan / firetower Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 6.0 632 KB

Error Classification and Aggregation System

License: MIT License

JavaScript 47.41% Python 52.59%

firetower's People

Contributors

Stargazers

Watchers

Forkers

zbskii ezragorman dcolish dukesam schmichael guniorobot

firetower's Issues

Create Form Elements for Modifying Category Meta Data

We store metadata on the backend about each category. Some of this is meant to be modified by the user. We need to provide those form elements so that the following elements can be modified:

human_name -- the human readable version of this category. We might consider making this default to the first 30 characters from the signature.
threshold -- the ratio of sameness required for a successful match for this category.

Log Classification Time For Each Incoming Event

I'd like to include time to classify on each line along with the method of classification.

Feature: group categories together

It'd be nice to group categories together which are related to the same failure. That way you can view them among the other categories as an aggregate, but also break them up within the group view to see which sub-category has how many errors.

For simplifying what's going on, it seems handy.

Measure Efficacy of a Category

We need some method for determining how internally consistent a category is. My first thought for this effort is to construct an arithmetic mean of all the events within a category when compared against the category signature. Our efficacy could be as simple as looking at the standard deviation of this number.

Ideally, this would be an offline process so that the classification efforts aren't encumbered. Results could be saved within the category_ids key, which already houses most of the other category metadata.

Add Kafka Consumer to Collector set

I'd like to be able to process events from Kafka as well as logs from disk and emails from imap.

For an example check out: https://gist.github.com/7db14f27c4d115b355b3

Firetower logo/mockups

Put together some ideas for fancy Firetower logo & page design. (I've got some stuff in the works already, but it's still a little rough - will definitely have something to show you in the next couple days, though.)

Switch or Make Configuration Option to Use Signature for All Matches

We at least want the ability to keep a category relatively stable. Letting them wander was an interesting experiment; but I think it'd be better to just stick to the category signatures.

archive_feature branch TimeRanges Broken

The latest set of changes to the archive_feature branch seem to be passing a Redis object where we expect a Redis.Connection() instance. At least, that's what I suspect. See below.

I did pass in a Redis.Connection instance directly into the static method, which is what the constructor comments lead me to believe it expects. List output of the categories does include three category objects (which is about average for this level of accuracy).

In [1]: from firetower import category

In [2]: import redis

In [3]: conn = redis.Redis()

In [4]: cats = category.Category.get_all_categories
category.Category.get_all_categories  

In [4]: cats = category.Category.get_all_categories(conn)

In [5]: for cat in cats:
   ...:   cat.timeseries.range(0, -1)
   ...: 
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (31, 0))

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)

/Users/gavin/src/firetower/<ipython console> in <module>()

/Users/gavin/src/firetower/firetower/category.pyc in range(self, start, end)
     12 
     13     def range(self, start, end):
---> 14         return self.redis_conn.zrevrangebyscore(
     15             "ts_%s" % self.cat_id, end, start, withscores=True
     16         )

AttributeError: 'Redis' object has no attribute 'zrevrangebyscore'

Flush Old Events from Eventstreams

This is a bit of a controversial proposal, however at some point we need to figure out how to move old events out of redis-server, which requires that the data fit into memory.

We can go two routes with this:

Just remove items older than X, or items older than the latest Y items.
Use those same rules, but instead of deleting them, we serialize to disk.

I'm strongly leaning toward the second procedure, but I'm looking for comments.

Right now, Firetower at UA is taking up about 4G of memory, and generating about as much in logfiles everyday.

Add Ability to Re-categorize All Events in a Category

Some categories are mostly duplicate. To fix this:

delete the category reference in the '''categories''' key.
push each event from that data_<cat_id> into input.
delete data_<cat_id> when it's empty.

Regex-Based Classification

Niall committed this feature, but I've had trouble getting the regex module installed.

Create JSON APIs

Right now the error data is sent to the front end by populating a page template variable - I think it would be more flexible/powerful to be able to make an ajax call that returns the json here. (If nothing else, makes it easier for me - this is how I'm used to pulling in chart data.) I'll take a stab at setting this up, if there are no objections.

Even Idle Firetower Server Spends All Its Time Archiving

I noticed something strange on my larger installation of Firetower (which has about 3380 categories) -- the server is consistently taking up about 16% of a CPU, even when no new events are coming in to be categorized.

This is the strace output for that process:

write(4, "[2012-01-25 20:42] DEBUG: Fireto"..., 108) = 108
sendto(3, "_2\r\n$7\r\nHGETALL\r\n$48\r\ncounter_9d"..., 72, 0, NULL, 0) = 72
recvfrom(3, "_0\r\n", 8192, 0, NULL, NULL) = 4
write(4, "[2012-01-25 20:42] DEBUG: Fireto"..., 108) = 108
sendto(3, "_2\r\n$7\r\nHGETALL\r\n$48\r\ncounter_bb"..., 72, 0, NULL, 0) = 72
recvfrom(3, "*0\r\n", 8192, 0, NULL, NULL) = 4
select(0, NULL, NULL, NULL, {1, 0}^C <unfinished ...>

Clearly, it's spending all its time doing HGETALL calls for the various categories to check and see if they have any data to archive. This is hardcoded to happen every 2 seconds. The problem now is that this process takes more than 2 seconds to complete, so it's pretty much constant.

Two pronged solution:

make the default a little longer (say 60 seconds), and configurable.
This many categories isn't ideal. Need to address the category dispersion.

Create Multi-Use Client Event Consumer

So far, we have a couple of input methods for inserting items into the firetower queue.

It'd be nice to have some sort of general purpose service to run on client machines which intercepts events emitted by various logging systems and sends them to Firetower.

An alternative view point would be to create specific logging to Firetower options for each language we'd like to use with this. I'm not certain which would be more work, but this problem needs to get solved one way or another.

Fix Up Form Elements For Aggregate and Category Views

Right now the form values at the bottom of the page are just sitting there without any styling or any protective js for values (i.e. to warn the user about crashing her browser).

Feel free to add more items, but the things that could really use help are these:

Visual delineation between view form elements and the rest of the page.
A submit button. It's really distracting to have each letter you type get submitted and needing to wait for the page to reload.
Some input checking to make sure someone doesn't try to select the last several years' worth of minutely data-points, etc. I'm not sure what sane defaults would look like.
Category views should have the same form elements. Right now they only have a subset.

Admin module should only run stats on a sample of events for a category

Right now, when the statistics are calculated for a series of events, comparisons are run for each and every event in that series. We could save lots of CPU cycles by just sampling a significant fraction of the events in the series randomly.

Automatically Increase Default Similarity Threshold If Using (Really) Quick Classification

Right now, we have some automatic reductions in accuracy depending on the size of a signature. We do this by using a faster, but less accurate algorithm. The result is that we maintain performance, but large signatures basically get randomly placed with large categories.

To fix this, I'm proposing that we also increase the default similarity threshold at each step of the size escalation. Default of 0.5 gets turned into 0.75 for quick, and 0.86 for really_quick.

To make sure we're not making things worse, I'd like to have the category efficacy issue completed first.

Display Meta Information

It'd be useful to know a few pieces of metadata about how the system is running:

How much memory is being used by redis
How much CPU time, or current percentage that the firetower-server process(es) use.
Current, max, mean, number of classifications/sec.
Number of categories, categories/hour, categories/day, categories/week.
...

More?

Hook up API to return JSON data

Here is a tentative overview of the data structures to be exposed by the firetower api, with corresponding URLs:

Firetower API URLs:

/api/
categories/ - dictionary of categories, indexed by category id, with a dictionary of metadata for each category

categories/timeseries/  -  all categories, last 5 minutes' worth of data for each category

categories/timeseries/<start/end get params>  -  all categories, number of instances returned per category over a specified time range

categories/timeseries/?all=1  -  all categories, all data  (computationally intensive, should not be default behavior)

category/<cat_id>  -  category metadata

category/<cat_id>/timeseries/  -  last 5 minutes' worth of data for specified category

category/<cat_id>/timeseries/<start/end get params>  -  specified category, specified time range

category/<cat_id>/events/  -  raw text of events/errors/tracebacks for specified category

Data to be exposed by API:
(returned as JSON objects)

/api/categories/  -  dictionary of categories, indexed by category id, with a dictionary of metadata for each category

{
    'cat1': { 
        "human_readable": "Name of the Category", 
        "signature": "Full text of most recent event/traceback classified into this category",
        "threshold": Threshold defined for this category (number) 
    }
    'cat2': { 
        "human_readable": "Name of the Category", 
        "signature": "Full text of most recent event/traceback classified into this category",
        "threshold": Threshold defined for this category (number) 
    }
}


/api/categories/timeseries/  -  all categories, last 5 minutes' worth of data for each category

Dictionary with cat id's as keys, listing number of events that were classified in that category over the last 5 minutes, 
indexed by timestamp (currently per second)

5 minutes is an arbitrary amount of time, to be defined as a default (could be 30 minutes, or any other value).


{ 
    'cat1': [ (1, 10), (2, 5), (3, 15) ] 
    'cat2': [ (1, 7), (2, 14), (3, 8) ] 
}

For category 1:

---
cat1: category id
1, 2, 3: timestamps (indexed by second) - will include 5 minutes' worth of data
10, 5, 15: number of instances of errors returned per second for this category


/api/categories/timeseries/<start/end get params>  -  all categories, number of instances returned per category over a specified time range

Dictionary with cat id's as keys, listing number of events that were classified in that category between specified start and end points, 
indexed by timestamp (currently per second) - same format as categories/timeseries/, different time range

If time range specified included 5 seconds:

{ 
    'cat1': [ (1, 10), (2, 5), (3, 15), (4, 20), (5, 2) ] 
    'cat2': [ (1, 7), (2, 14), (3, 8), (4, 7), (5, 12)  ] 
}

For category 1:

---
cat1: category id
1, 2, 3, 4, 5: timestamps (indexed by second)
10, 5, 15, 20, 2: number of instances of errors returned per second for this category


/api/categories/timeseries/?all=1  -  all categories, all data  (computationally intensive, should not be default behavior)

Same format as above, just with way more data. :)


/api/category/<cat_id>  -  dictionary of metadata for a single, specified category

{ 
    "human_readable": "Name of the Category", 
    "signature": "Full text of most recent event/traceback classified into this category",
    "threshold": Threshold defined for this category (number) 
}


/api/category/<cat_id>/timeseries/  -  last 5 minutes' worth of data for specified category

For a single category, specified in url by category id, returns 5 minutes' worth of data as a series of tuples 
of timestamp (currently seconds) and number of instances returned in given category per timestamp.

[ (1, 10), (2, 5), (3, 15)  ]

1, 2, 3: timestamps (indexed by second)
10, 5, 15: number of instances of events in given category returned per second

5 minutes is an arbitrary amount of time, to be defined as a default (could be 30 minutes, or any other value).


/api/category/<cat_id>/timeseries/<start/end get params>  -  number of events returned in specified category over a specified time range

For a single category, specified in url by category id, returns a series of tuples of timestamp (currently seconds) 
and number of instances returned in given category per timestamp, between start and end points as defined by get params.

category x (specified in url with a cat_id)
5pm to 8pm (specified in url with start/end timestamp get params)
^^^ if no start/end time defined, returns some set default time range (e.g. 5min)
how many instances were returned per second

[ (1, 10), (2, 5), (3, 15)  ]

1, 2, 3: timestamps (indexed by second)
10, 5, 15: number of instances of errors of type x returned per second


/api/category/<cat_id>/events/  -  dictionary of raw text of all events/errors/tracebacks for specified category

{
    'cat1': [ 'I am some error text for cat1', 'I am some related error text in the same category' ]
    'cat2': [ 'I am some error text for cat2', 'I am some related error text in the same category' ]
    'cat3': [ 'I am some error text for cat3', 'I am some related error text in the same category' ]
}

Allow Specification of Redis DB (int between 0 and 99) in Config

We should allow people to use the same redis installation as other projects, or other firetower installs. There needs to be a 'db' key that specifies which integer db value we're using for our keyspace.