opentsdb / opentsdb Goto Github PK

A scalable, distributed Time Series Database.

License: GNU Lesser General Public License v2.1

Emacs Lisp 0.01% Shell 0.51% Java 98.02% Batchfile 0.01% Makefile 0.63% M4 0.05% Python 0.76% Dockerfile 0.02%

opentsdb's Introduction

       ___                 _____ ____  ____  ____
      / _ \ _ __   ___ _ _|_   _/ ___||  _ \| __ )
     | | | | '_ \ / _ \ '_ \| | \___ \| | | |  _ \
     | |_| | |_) |  __/ | | | |  ___) | |_| | |_) |
      \___/| .__/ \___|_| |_|_| |____/|____/|____/
           |_|    The modern time series database.

OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on
top of HBase.  OpenTSDB was written to address a common need: store, index
and serve metrics collected from computer systems (network gear, operating
systems, applications) at a large scale, and make this data easily accessible
and graphable.

Thanks to HBase's scalability, OpenTSDB allows you to collect thousands of
metrics from tens of thousands of hosts and applications, at a high rate
(every few seconds). OpenTSDB will never delete or downsample data and can
easily store hundreds of billions of data points.

OpenTSDB is free software and is available under both LGPLv2.1+ and GPLv3+.
Find more about OpenTSDB at http://opentsdb.net

opentsdb's People

Contributors

Stargazers

Watchers

Forkers

eric rberger tsuna justinpitts jehud inmobi kryton tch rodders northisup aravind topolino zorkian marcuswestin liutaihua nikolaybotev spark404 mpapply michael-gannon jessegonzalez danirayan davemiller bluesalt vadims sunnygleason hyperionriaz thaingo posix4e jgoelen jedws shrijeet mj wmoss remotesyssupport seanrees voisintotoro charles-cai cyrux004 rkushwaha deusaquilus gutefrage manolama rkroll atlassian nickmbailey nikhilpal yuanke akms17 nareshov skuehn eswdd nivertech huckphin chrismoos nilswalk gtowey deephacks pierreaubert mhittesdorf simon-og ajehang scalextremeinc rajeshkp srisatish sigma420 oloed fengzanfeng betfair tanbamboo zhuomingliang kkaczmarski xchenum dblundell hongsoft jonzhang708 zhzf xiaojun-liu sunjaec helebest robinbowes juneng603 adrien-mogenet octo47 stumbleuponarchive ind9 jarajapu strategist922 looztra mduszyk dancm jekey duaneverbright mmozum boogabee mapr nirmoy kiddinho1412 pspeybro sohu001 yuejun

opentsdb's Issues

my "make" couldn't work with error "No rule to make target"

[root@phxrueidb04 stumbleupon-opentsdb-33dff14]# make
make: *** No rule to make target `.git/HEAD', needed by `src/BuildData.java'.  Stop.

[root@phxrueidb04 stumbleupon-opentsdb-33dff14]# git --version
git version 1.7.5.4
[root@phxrueidb04 stumbleupon-opentsdb-33dff14]# java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
[root@phxrueidb04 stumbleupon-opentsdb-33dff14]# uname -a
Linux phxrueidb04 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
[root@phxrueidb04 stumbleupon-opentsdb-33dff14]# gcc -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux
Thread model: posix
gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)

Add a metric for number of UniqueIds created

Whenever a TSD creates a UniqueId, it should increment a counter for this kind of ID so we can keep track of how much IDs are getting created. UniqueId needs an AtomicInteger and a collectStats method that gets called during stats requests.

Can't do a graceful restart after in-place update of the .jar

I don't know if this is required by the JLS, but at least on the Sun JVM, classes are loaded lazily. I know that classes are initialized lazily (i.e. their <cinit> method gets called when the class is first referenced by the code), but the Sun JVM seems to read the class definition from the .jar lazily too. The problem is that, apparently, the Sun JVM doesn't keep a file descriptor open on the .jar, because when the .jar gets unlinked and re-created with a different contents, the JVM attempts to read from the new file. Because the file is probably different, the JVM will fail to load the classes. If you're lucky, you'll get an unexpected java.lang.NoClassDefFoundError during the graceful shutdown. If you're not so lucky, the JVM will just outright segfault (yay).

In order to work around this JVM bug, the code needs to make sure it references all the classes that are needed for the shutdown sequence when it's doing the initialization sequence. For instance, after updating a .jar and doing a graceful restart, I get a java.lang.NoClassDefFoundError: net/opentsdb/tsd/RpcHandler$DieDieDie$1ShutdownTSDB, so presumably that class needs to be referenced during the startup sequence in order to make sure it gets loaded early on. I think by default the Sun JVM will not unload / garbage collect unused classes, IIRC an explicit flag is required to trigger this behavior.

I think asynchbase needs a similar workaround.

Add an API and RPC to empty the UniqueId caches

When renaming a UniqueId, one needs to make sure it wasn't already in use by a TSD, otherwise the TSD is going to dislike the fact that a UID is shared by two different names. In such a situation, one is forced to restart the TSD, which is !cool. The UniqueId class needs a method to empty the cache, and the TSD needs an RPC to call this method for a given kind of UniqueId.

Switch to HTML5 doctype

Right now the TSD uses this doctype:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
It should instead be changed to use the HTML5 doctype:
<!DOCTYPE html>

Improper error handling when there's no data on one of two axes

When plotting a graph involving 2 axes, Gnuplot will fail with an error if there's no data on either axes (e.g. "all points y2 value undefined!"). When there's no data at all, the code properly serves an empty graph with a "No Data" label in the middle. The code should be modified to gracefully handle the cases where only one of the two axes has no data. Ideally we'd want to see the graph with whatever data there is for the first axis.

Put data points over HTTP

Some people would like to have the choice to "put" their data points over HTTP instead of using the simple telnet-like protocol of the TSD. This issue is about adding a /put HTTP handler to receive data points.

To be defined: what format to put the data points in? Should we JSONify them? Should we just use the body of a POST request with the same format as we use in the telnet-like protocol?

For sake of simplicity and code re-use, I'm leaning towards the last option. Discussion open.

Customize the format of the X axis

Right now the TSD automatically picks a format string based on the time range covered by the query. While this is a nice default, we should also allow users to pass in a custom format string.

Automatically fix URLs mangled by Gmail

It seems that Gmail is mangling certain TSD links when people compose emails in HTML. Specifically it seems that it urlencodes characters after the first { and the last ], so m=my.metric{host=foo}&yrange=[0:]&wxh=1475x585&png erroneously becomes m=my.metric{host%3Dfoo}%26yrange%3D[0%3A]&wxh=1475x585&png. The goal of this issue is to automatically detect links mangled by Gmail and 301 them to the correct URL.

lower boundary (1200) on query-based datapoints retrieval

When performing the following query:

tagsAny.put("type", "*");
Query qAAAAny = tsdb.newQuery();  qAAAAny.setTimeSeries("AAA", tagsAny, null, false);
qAAAAny.setStartTime(1199);  qAAAAny.setEndTime(11000000);
DataPoints[] dps = qAAAAny.run();

The query returns the correct set of datapoints. However, when you set start time to 1199 the query does not return any datapoints.

Graph beginning and ending can have strange values

[starting to document known bugs]

When graphing data, TSD grabs datapoints before and after the graph interval in order to correctly handle aggregation and downsampling. For some graphs though this doesn't seem to be working correctly. What this looks like is what should be plotted as a flat line will look like a butte, i.e. with a ramp up at the beginning and a large ramp down at the end. Note if you're plotting against "now" as your right side it's very hard to do the right thing anyway since not all of your datapoints will have arrived yet. For example if you are plotting hits on your 10 webservers that report data every minute, for that final minute you'll not have all your datapoints. If you are doing Aggregator=avg it won't look too bad, but if you use Aggregator=sum then for the most current minute you'll see the right side fall off toward zero. This is not a bug. Although we do sometimes see the right end of the graph spike up high, which looks to be possibly a separate bug.

Browser history support

It would be nice if the UI created a new history token so that people could pass around the URL to the UI instead of the URL to the graph itself.

In addition to this, when no format is requested on a query (no &json or &png in the URL), the TSD should 302 (or maybe 301) back to the UI with the appropriate history token set. This way it would be possible to fire up the dashboard of a graph based on the URL to the image.

Aggregation of specific tag combinations

Right now if you specify tag=v1|v2 you always get two lines, one for v1 and one for v2. It would be nice to be able to specify specific tag combinations but still have them be aggregated. So sum:foo{tag=v1|v2} would somehow still produce a single line, the sum of foo{tag=v1} and foo{tag=v2}. We'd need some way to express this in a query.

It was suggested on the mailing list to use a semicolon to separate tag values that need to be aggregated together. Quoting the email:

On Sat, Jan 22, 2011 at 3:17 PM, pgillan wrote:

As far as the syntax goes, my thought was that you wouldn't
necessarily need a ${MAGIC} variable, you could just use a different
delimeter between the tags to represent how the tags should be
graphed. m=avg:cpu{server_id=server1|server2|server3} would result in
three distinct lines, while
m=avg:cpu{server_id=server1;server2;server3} would give you a single
line that represented all three server, where the avg could also be
sum. That also opens up the possibility of doing
m=avg:cpu{server_id=server1;server2|server3;server4}, which would give
you two lines, one that was the average of server1 and server2, and
another that was the average of server3 and server4. I don't think
there'd be issues with precedence as long as ; is always processed
before |. Note that I like + better than ;, but + in urls get
interpreted as spaces, so it probably wouldn't work right.

Stacked graph

Need the ability to stack multiple lines together on a graph. One way of doing this with Gnuplot is explained here: comp.graphics.apps.gnuplot How to make stacked area charts?

Rate per variable unit of time

When querying the data points for a specific metric, one has the ability to indicate that the metric represents a rate.
The results then represent the rate per second and currently one can't customize this behavior.

It would be very useful to be able to also specify the time unit and interval to apply for the rate calculation.
Examples of rate parameters:

rate per 15 minutes
rate per hour
rate per day

Comments welcome.

Bug in the server-side filtering code for UIDs containing the byte 0x5C

Whenever a UniqueId happens to contain the byte 0x5C, the server-side scanner filter created by TsdbQuery.addId mistakenly escapes it. It shouldn't do that, because that's not how \Q...\E sequences work. There is no escaping possible in them. The only tricky case to handle is when there's a literal \E inside the sequence to escape.

Cache query data separately from image

The current cache seems to be for the full query including the custom labels, positioning of keys, size of image etc. etc. Can we cache the result set obtained after summarizing the hbase resultset so that things like repositioning the legend does not query hbase again?

incorrect X-axis value plotted

In my case, the start time is set to 2011/02/06-12:00:00 and the To is "now".

However, in the result graph, the X-axis is plotted like 2:00, 3:00, 4:00 instead of 12:00, 13:00 and 14:00.

The field for the format strings is too short

In the UI the Y / Y2 format fields don't allow enough characters.

Need a command-line flag to allow changing HBase scanning behavior

Reference discussion: Understanding query performance on the mailing list.

We need a command-line knob to tweak the number of rows HBase scanners fetch at once. The default value is a good estimate but can lead to sub-optimal performance on very sparse tables (lots of very small rows).

Add a "compare with <n> <period> ago" feature to the UI

Main feature: Often when I look at a graph I want to do something like compare the current graph with the same period 1 day/week/month/year ago to either compare growth, look for patterns, see whether the graph I'm looking at shows abnormal behaviour, ...

It should probably work similarly to the tags on the UI--when you fill in one period to compare, it creates another box where you can add a second period. There would be two fields: one for the period and one for the count (the "1" in my basic description).

I'm not sure anything less than "day" is useful, but I guess it depends on the graphs people look at.

Supplementary feature: Possibly less useful, but probably easy to implement while doing the above, is being able to easily select multiple so you could compare a graph for today with the same graph 1 week ago, 2 weeks ago, ... If you select "7" "days" you get 7 comparison overlays with the same graph for each day while if you choose "1" "weeks" you get one comparison overlay, being the same graph 1 week ago.

Maybe from a UI minimalist perspective this would be a checkbox "repeat" (default off) that changes the interpretation of the fields for period and count?

ERROR HBaseClient: Lost connection with the -ROOT- region

clone the newest version.
use --zkquorum, it run ERROR: HBaseClient: Lost connection with the -ROOT- region,

but use the old version(about 3 mouth ago fork from this), have no error, and running good.
some version info:

hbase cluster: hadoop-hbase-0.90.1
hadoop-zookeeper-server-3.3.3+12.1-1
hadoop-zookeeper-3.3.3+12.1-1

os platform: centos 5.4 x86_64

this is error info:

2011-06-24 14:21:14,826 INFO  [main] ZooKeeper: Initiating client connection, connectString=loghub sessionTimeout=5000 watcher=org.hbase.async.HBaseClient$ZKClient@7d95d4fe
2011-06-24 14:21:14,841 INFO  [main-SendThread()] ClientCnxn: Opening socket connection to server loghub/10.65.10.112:2181
2011-06-24 14:21:14,847 INFO  [main-SendThread(loghub:2181)] ClientCnxn: Socket connection established to loghub/10.65.10.112:2181, initiating session
2011-06-24 14:21:14,896 INFO  [main-SendThread(loghub:2181)] ClientCnxn: Session establishment complete on server loghub/10.65.10.112:2181, sessionid = 0x1303e6e1cd50033, negotiated timeout = 5000
2011-06-24 14:21:14,913 INFO  [main-EventThread] HBaseClient: Connecting to -ROOT- region @ 10.65.10.112:60020
2011-06-24 14:21:14,967 INFO  [main-EventThread] ZooKeeper: Session: 0x1303e6e1cd50033 closed
2011-06-24 14:21:14,968 INFO  [New I/O client worker #1-1] HBaseClient: Added client for region RegionInfo(table=".META.", region_name=".META.,,1", stop_key=""), which was added to the regions cache.  Now we know that RegionClient@1748234462(chan=[id: 0x2e00e753, /10.65.10.112:62873 => /10.65.10.112:60020], #pending_rpcs=0, #edits=0, #rpcs_inflight=0) is hosting 1 region.
2011-06-24 14:21:14,970 INFO  [New I/O client worker #1-1] HBaseClient: Added client for region RegionInfo(table="tsdb", region_name="tsdb,,1307527656400.5c3932cb176044d6898f315f0244383e.", stop_key=[0, 0, 7, 77, -36, -41, -48, 0, 0, 1, 0, 0, 58, 0, 0, 4, 0, 0, 4, 0, 0, 5, 0, 0, 6]), which was added to the regions cache.  Now we know that RegionClient@1748234462(chan=[id: 0x2e00e753, /10.65.10.112:62873 => /10.65.10.112:60020], #pending_rpcs=0, #edits=0, #rpcs_inflight=0) is hosting 2 regions.
2011-06-24 14:21:14,971 INFO  [New I/O client worker #1-1] HBaseClient: Lost connection with the -ROOT- region
2011-06-24 14:21:15,399 INFO  [Hashed wheel timer #1] HBaseClient: Need to find the -ROOT- region
2011-06-24 14:21:15,399 INFO  [Hashed wheel timer #1] ZooKeeper: Initiating client connection, connectString=loghub sessionTimeout=5000 watcher=org.hbase.async.HBaseClient$ZKClient@7d95d4fe
2011-06-24 14:21:15,400 INFO  [Hashed wheel timer #1-SendThread()] ClientCnxn: Opening socket connection to server loghub/10.65.10.112:2181
2011-06-24 14:21:15,401 INFO  [Hashed wheel timer #1-SendThread(loghub:2181)] ClientCnxn: Socket connection established to loghub/10.65.10.112:2181, initiating session
2011-06-24 14:21:15,432 INFO  [Hashed wheel timer #1-SendThread(loghub:2181)] ClientCnxn: Session establishment complete on server loghub/10.65.10.112:2181, sessionid = 0x1303e6e1cd50034, negotiated timeout = 5000
2011-06-24 14:21:15,433 INFO  [Hashed wheel timer #1-EventThread] HBaseClient: Connecting to -ROOT- region @ 10.65.10.112:60020
2011-06-24 14:21:15,436 INFO  [New I/O client worker #1-2] HBaseClient: Added client for region RegionInfo(table=".META.", region_name=".META.,,1", stop_key=""), which was added to the regions cache.  Now we know that RegionClient@1687192366(chan=[id: 0x36f0b7f8, /10.65.10.112:62875 => /10.65.10.112:60020], #pending_rpcs=0, #edits=0, #rpcs_inflight=0) is hosting 1 region.
2011-06-24 14:21:15,437 INFO  [New I/O client worker #1-2] HBaseClient: Added client for region RegionInfo(table="tsdb", region_name="tsdb,,1307527656400.5c3932cb176044d6898f315f0244383e.", stop_key=[0, 0, 7, 77, -36, -41, -48, 0, 0, 1, 0, 0, 58, 0, 0, 4, 0, 0, 4, 0, 0, 5, 0, 0, 6]), which was added to the regions cache.  Now we know that RegionClient@1687192366(chan=[id: 0x36f0b7f8, /10.65.10.112:62875 => /10.65.10.112:60020], #pending_rpcs=0, #edits=0, #rpcs_inflight=0) is hosting 2 regions.
2011-06-24 14:21:15,438 INFO  [New I/O client worker #1-2] HBaseClient: Lost connection with the -ROOT- region
2011-06-24 14:21:15,446 INFO  [Hashed wheel timer #1-EventThread] ZooKeeper: Session: 0x1303e6e1cd50034 closed
2011-06-24 14:21:16,457 INFO  [Hashed wheel timer #1] HBaseClient: Need to find the -ROOT- region

Allow returning data points in JSON / JSONP

Right now the only way to extract data from a TSD is to make an &ascii query. This bug is about adding a way of returning data in JSON and JSONP to make it easier to integrate the TSD with other web apps.

Arithmetic expressions in queries

OK this one is non-trivial but much wanted. Instead of plotting one time series, I'd like to be able to do some simple arithmetic operations on multiple time series. For instance divide the rate of errors by the rate of requests to get the ratio of failures. Or find the percentage of disk space used using the disk space available and the disk space used.

This requires major changes in the query handling path, so it's going to happen in some sort of "v1.5".

Custom color palettes and other style customizations

We should allow people to explicitly pass a list of color names (e.g. red,green,blue,cyan) to cycle through. In addition, Gnuplot offers various style customizations (e.g. various thickness levels for the lines, lines without the little symbols, dashed/dotted lines etc.) and we should expose these settings.

Caching bug for queries with an end time in the future

Scenario we ran into today:

Yesterday, at 7pm, someone requested: /q?start=2011/04/19-00:00:00&end=2011/04/20-00:00:00&m=my.metric&ascii
TSD fetched the data up until 7pm and wrote it to its disk cache.
Because the end time (midnight) was 5h "in the future", the TSD marked the ASCII response as un-cacheable.
Today, someone made the same request again.
This time, the end time was "in the past", so the max_age was set to 86400 seconds (1 day).
GraphHandler.isDiskCacheHit found the cached .txt file and called staleCacheFile to verify whether it could use it.
GraphHandler.staleCacheFile saw that there was and end date specified and incorrectly returned false.

The problem here is that GraphHandler.staleCacheFile should do something like this:

If the end time of the query is in the past, ensure that the mtime of the cached file is greater than or equal to the end time of the query.
If the end time of the query is in the future, ensure that the current time minus the mtime of the file is less than max_age.

Thanks to Anoakie Turner for reporting the bug.

Command-line tool doesn't work if invoked from the src directory

Because src/tsdb changes the current working directory, it seems that it does it wrong when invoked in certain ways, e.g. if you run ./tsdb ... inside the src/ directory, it doesn't find the file tsdb.local and doesn't source it.

OpenTSDB has no proper way to get installed: need to autotoolize it

Right now the "recommended" way of deploying OpenTSDB is to clone the repo from GitHub, compile it and run some startup script... Far from ideal :-/

I would like to add a small configure script (generated by GNU Autoconf, obviously) and move the Makefile to a GNU Automake Makefile.am in order to re-use all the existing generated rules in Automake (e.g. make install, etc). This will also make it trivial to build distro packages (such as .deb or .rpm packages) since most distros can automagically package projects that are properly autotoolized.

The goals of this issue are to:

Add a configure script (to detect where java, javac, md5 or md5sum and whatnot are).
Convert the Makefile to GNU Automake.
Make sure that make install Does The Right Thing.

Bonus points for jarjar'ing all the dependencies into one .jar, although this could be done as part of a separate issue.

Bugs when different types of UniqueIds use different width

Right now, the 3 types of unique IDs (managed by the UniqueId class) are hard-coded to be on 3 bytes. We ran into some bugs at StumbleUpon when using 8 bytes for the tag value IDs. This issue is about merging the bugfixes into the public tree. Eventually each unique ID type should be configurable through command-line flags.

Invalid Content-Type header on non-HTML error responses

Since StumbleUponArchive/opentsdb@41cdce4 and StumbleUponArchive/opentsdb@f571bd1 error pages can be sent back in JSON or PNG, but HttpQuery.sendBuffer wasn't updated accordingly and still adds Content-Type: text/html; charset=UTF-8 for all error pages.

Return uncached results

Please provide a way wherein consumers can request for uncached results i.e. process the data from hbase and provide the result.

Need better throttling logic during batch imports

TextImporter has some logic to throttle itself when HBase isn't keeping up. This typically happens when a region splits, because a whole key range that's actively being written to goes offline for several seconds. The existing logic in TextImporter is flawed because when asynchbase throws a PleaseThrottleException, it waits for the next edit to complete successfully before moving forward. The problem with that is that approach the next edit might not hit the region that's being split, so the next edit might complete almost immediately. The other issue is that with high write throughput (e.g. over 150k edits/s), throttling needs to kick in immediately otherwise the application will run itself out of memory by buffering too many edits.

One mitigation strategy is to throttle the whole import until the RPC that triggered the PleaseThrottleException completes, instead of until whichever one is the next edit completes. This strategy isn't applicable for a user-facing application server, but it's suitable for TextImporter as stalling the whole batch import is acceptable.

Caching bug with relative end time

Apparently StumbleUponArchive/opentsdb@b85f760 didn't fully fix the issue, some graphs are still not getting regenerated when requested with a relative end time (e.g. end=7d-ago).

Putting time series values with decimals and without decimals causes exception

Request failed: Internal Server Error java.lang.ClassCastException: value #14 is not a float in RowSeq(0, 0, 4, 77, 2, 17, -112, 0, 0, 1, 0, 0, 1, 0, 0, 3, 0, 0, 9, base_time=1291981200 (Fri Dec 10 03:40:00 PST 2010), [+468:float(84.2750015258789), +469:float(90.9000015258789), +470:float(78.1500015258789), +471:float(69.3499984741211), +472:float(67.66666412353516), +474:float(59.849998474121094), +475:float(62.04999923706055), +476:float(51.900001525878906), +477:float(76.5250015258789), +478:float(91.88333129882812), +479:float(94.07777404785156), +480:float(83.0999984741211), +481:float(68.36666870117188), +482:float(81.5199966430664), +483:long(77), +484:float(68.5111083984375), +485:float(33.29999923706055), +486:float(68.43333435058594), +487:float(77.9800033569336), +488:float(81.5), +489:float(83.71111297607422), +490:float(94.19999694824219), +491:float(75.19999694824219), +492:float(76.45999908447266), +493:float(81.32857513427734), +494:float(77.9888916015625), +495:float(97.4000015258789), +496:long(77), +497:float(95.0999984741211), +498:float(92.78333282470703), +499:float(87.82499694824219), +500:float(84.1500015258789), +501:float(67.67500305175781), +502:float(64.80000305175781), +504:float(73.88999938964844), +506:float(110.19999694824219), +507:float(106.82499694824219), +508:float(88.63333129882812), +509:float(84.2125015258789), +510:float(76.7699966430664), +511:float(52.75), +512:float(51.32500076293945), +513:float(69.25), +514:float(67.0999984741211), +516:float(82.5), +517:float(88.18000030517578), +518:float(83.80000305175781), +519:float(87.69999694824219), +520:float(116.5), +521:float(113.30000305175781), +522:long(98), +523:float(94.04285430908203), +524:float(91.25555419921875), +525:float(74.5), +526:float(68.5999984741211), +527:float(63.959999084472656), +528:float(72.77143096923828), +529:float(71.34444427490234)])

at net.opentsdb.core.RowSeq.doubleValue(RowSeq.java:288) ~[tsdb-1.0.jar:50d2c35]
at net.opentsdb.core.Span$DownsamplingIterator.nextDoubleValue(Span.java:508) ~[tsdb-1.0.jar:50d2c35]
at net.opentsdb.core.Aggregators$Avg.runDouble(Aggregators.java:172) ~[tsdb-1.0.jar:50d2c35]
at net.opentsdb.core.Span$DownsamplingIterator.next(Span.java:418) ~[tsdb-1.0.jar:50d2c35]
at net.opentsdb.core.SpanGroup$SGIterator.<init>(SpanGroup.java:462) ~[tsdb-1.0.jar:50d2c35]
at net.opentsdb.core.SpanGroup.iterator(SpanGroup.java:224) ~[tsdb-1.0.jar:50d2c35]
at net.opentsdb.core.SpanGroup.iterator(SpanGroup.java:49) ~[tsdb-1.0.jar:50d2c35]
at net.opentsdb.graph.Plot.dumpToFiles(Plot.java:169) ~[tsdb-1.0.jar:50d2c35]
at net.opentsdb.tsd.GraphHandler.runGnuplot(GraphHandler.java:590) ~[tsdb-1.0.jar:50d2c35]
at net.opentsdb.tsd.GraphHandler$RunGnuplot.execute(GraphHandler.java:244) ~[tsdb-1.0.jar:50d2c35]
at net.opentsdb.tsd.GraphHandler$RunGnuplot.run(GraphHandler.java:231) ~[tsdb-1.0.jar:50d2c35]
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [na:1.6.0_22]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [na:1.6.0_22]
at java.lang.Thread.run(Thread.java:662) [na:1.6.0_22]

To reproduce:

Start a new metric, and put values like 1.1135, 5.135, etc. Then put a value like 5 (no decimal). If you try to graph it you get the above error. This is an issue if you have a collector pushing stats which sends floats when there is a remainder but doesn't add .0 to any even numbers. It's also very hard to recover from without blowing away the whole metric.

source file layout

A lot of classes are in the net.opentsdb.tsd package but the files reside in src/tsd/*.

Creating src/net/opentsdb and having the files in there would make the layout more standard and also IDE friendly.

inverse coloration support

please add support for white (or other arbitrary color) graphs and labels on black background, for low-energy / low-brightness use.

Allow counter increments over TSDs

People frequently need to count discrete observations over time. Let's say you wanna track ad-clicks. The recommend way of doing this is to have your application server maintain these counters and write the value of the counter every N seconds to TSDB (preferably through tcollector). In certain cases, this is impractical. For instance people writing stuff in PHP are having trouble maintaining state across requests to achieve something like this. They end up maintaining the counter in another system (such as memcache) and having an out-of-band mechanism to collect the counters from that other system and hand them to a TSD every N seconds. This feature request is about allowing people to do without that separate system and allow them to send counter increments directly to a TSD (any TSD) and let the TSD figure it out and efficiently store the counters in HBase.

What's needed here is a new RPC type for counter increments. The RPC need to specify the usual stuff (metric, tags) but in addition to the timestamp the TSD need to know, somehow, what interval between data points is wanted. This way the TSD will maintain the counter (in HBase, using atomic increments) and store its value as a datapoint every N seconds. We need to define how we want this to work exactly.

Allow the UI to give up on a slow query

In the UI, when a query takes too long to run, you can't tell the UI to give up and run another query instead. If you change networks and your browser loses the connection, it seems that the AJAX call never completes or that no callback is triggered to tell the UI about the failure or that something somewhere is eating the error. When this happens, you have to refresh the entire page, very annoying.

The idea is that whenever a query takes more than a few seconds to run, add a little button next to the "Loading" message that will cancel the query. Note the cancellation is only for the UI, there's no way to tell a remote HTTP server to cancel a query (you can always close the socket on your side, but I don't even know if it's possible to ask the browser to close the socket on which we made the AJAX call, and the server might not realize that the socket is closed until it's ready to send the response anyway).

Offer a jarjar'ed version of OpenTSDB

For simpler deployments, allow the build system to build a jarjar'ed version of OpenTSDB, with everything in a single .jar.

Don't drop data points with duplicate but identical tags

The TSD check that a tag only appears once on each data point. In the even that a data point would come in with a duplicate tag (e.g. my.metric 1234567890 42 foo=bar foo=qux) the TSD will throw an IllegalArgumentException (which is relayed back to the client) and discard the data point. But when the duplicate is entirely identical (e.g. my.metric 1234567890 42 foo=bar foo=bar) the TSD could simply log a warning every once in a while (not for every data point as it risks flooding the logs) and still keep the data point.

Counter handling is incomplete

The way TSD graphs rate counters right now is suboptimal. It's kind of a hack. The Y-range of the graph is started at 0, so negative values are not seen. The problem is that if a counter resets, the rate of that time series will go hugely negative for that data point. If you have a bunch of time series sum-aggregated together and one resets, the aggregation will result in a huge false drop in value. This is especially bad if you have counters which you often do a rolling restart of (e.g. restarting all your apache web servers).

Basically "Rate" is being abused here.. it is intended to be used to say "calculate the rate of change of this metric", but it also sort of works also as a quick way to plot a counter.

What I think the ideal proposed solution is to have two buttons in the UI

One which does 'Rate', which keeps the current behavior.

Add a 'Counter' button which says "this metric is a counter". If set, during the pass through the datapoints, if any datapoint is less than the previous one, it replaces that value with "NaN" or throws away the datapoint so it is missing.

The Rate calculator thing is still handy, as it allows you to plot the rate of change of any metric. Adding counter handling first allows you to plot the rate of change of even counter metrics. For example, you can plot the rate of change of apache hits, or network packets, etc.

This doesn't fix all of the usual problems with handling counters, but does handle most of them. The rest are probably best handled by the collector. For example, a common problem with SNMP counters with things like network devices set in failover mode is that you can't always guarantee that you're polling the same device. Some network devices are brain-dead and can't let you configure a management IP that stays fixed per device (Cisco ASA firewalls, I'm looking at you). The only proper fix is to prefix every SNMP counter lookup with a deviceid lookup (and tag the time series with deviceid instead of IP address). There's still a race condition here, though. I should probably add this to an FAQ.

Graphing functions

Currently there's no way to plot one time series against another. There are some obvious use cases for this. For example iostat's method of computing disk utilization involves comparing disk io time against idle cpu time.

The plan is to have a general purpose way to do functional plots of metrics against another.

Annotated timelines

We would like to be able to annotate time series and show annotations on the graphs somehow. We need both a programmatic API (so that for instance release tools can automatically label on the graphs whenever things are pushed) and a bit of UI in the TSD's web interface to allow people to add annotations.

TBD:

Where and how to store the annotations (could be stored along with the time series data to improve locality)
Do the annotations annotate one specific time series (one combination of metric and tags) or to one metric (regardless of the tags)?

Separate data sink from web interface

Currently, the same port (default 4242) is used both for the HTTP interface and also for uploading data. There needs to be a way to start either of these services independently.

Inconsistent/odd behavior with aggregating to 'max'

This may affect other aggregation methods, but I noticed it with max. In my data series, I have one timeseries that has a value of 9 at a certain time. This value occurs once.

I use: Aggregation method max, downsample for 1 minute intervals with the max function.

The data return shows that the value of 9 is repeated for 45 minutes. This is wrong, because the data point only existed once. When you downsample with 1m intervals, only one such bucket should show a value of 9.

Derived iostat metrics

iostat derives a few values from the counters we already collect in iostat.py such as (from the iostat manual):

await: The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
svctm: The average service time (in milliseconds) for I/O requests that were issued to the device.
%util: Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.

These are to be taken with a grain of salt (a lot of people really believe that %util accurately reflect the percentage of utilization of the I/O subsystem) but they would be useful to have anyway.

For reference, the code of iostat is here.

Allow the character `/' in tag values.

There's not reason why / cannot be allowed in tag values, and there are a few places where it would be useful (e.g. the df collector where we store mount points). To keep the implementation simple, we'll probably also allow / in metric names and tag names a I believe they share the same validation regexp.

SuggestBox won't show all the metric names when those metrics only differ in case

Currently, if there exist two metrics only different in case, the latter one will NEVER get suggested on TSD's UI.

For example, the proc.meminfo.active from tcollector's procstat.py collector was mistakenly created as proc.meminfo.Active(with a capital A). Then TSD's UI will always suggest first metric ignoring there is another metric with lowercase, the only way to plot the graph is using full metric name which is a lot of inconvenient.

Axis labelling, 2nd axis

The current graphing stuff makes it hard to distinguish when a given plot is using the first or second Y axis if you're just looking at the resulting image.

To make this easier to understand, the plan is to let you specify a text label to apply to each axis so that based on the metric name in the legend you can figure out which axis is the appropriate one.

Tags.validateString gives unecessary work to the GC

The call to s.toCharArray() in Tags.validateString is causing lots of memory copies in the fast path. Instead, take the length of the string and use charAt(i) to avoid copying all the strings to char arrays.