sourcegraph / appdash Goto Github PK
View Code? Open in Web Editor NEWApplication tracing system for Go, based on Google's Dapper.
Home Page: https://sourcegraph.com
License: Other
Application tracing system for Go, based on Google's Dapper.
Home Page: https://sourcegraph.com
License: Other
Anywhere where key/value pairs of spans are displayed, they're being ordered randomly.
Hi, not really and issue, but we have created a Ruby client and were wondering if you wanted to mention that in your documentation. Cheers, dim
I've yet to try it out for real, but if the OpenTracing API turns out to be nice in real applications, we could in theory make Appdash just operate as if it was just an implementation of the OpenTracing API. I.e. there would not be an Appdash API, it would just be that Appdash is a tracer designed specifically for OpenTracing.
I've marked this issue as long-term, because this would be a serious change and would probably be done in multiple phases (unexporting the current API, simplification of the internals w.r.t marshaling events, etc). I'm also not 100% convinced yet that this would be most ideal or appropriate, but it is something I am considering longer term.
AggregateStore
is the most complex of the Appdash storage backends. Unlike MemoryStore
which just collects traces, AggregateStore
both collects traces and aggregates some data to provide some useful stats within the Appdash Dashboard page (like slowest average trace, etc).
It became clear to me after @bg451's comment that I have not done a great job conveying the overall direction or problems of storage backends in Appdash. The major issues seen today is that:
MemoryStore
is very simple / lightweight, but doesn't support the Dashboard page (no aggregated data / metrics about traces). It's good, for example, in testing or within a lightweight CLI application.AggregateStore
has a number of serious problems:
In contrast, InfluxDBStore
:
AggregateStore
, it can be embedded within your Go process entirely (no external InfluxDB setup is required), this is the default setup.Due to the above reasons, and after much hard thought, I can only come to the conclusion that AggregateStore
would slow us down by making the codebase more complex, would mislead new users into using it and thinking Appdash isn't for real-world work, etc.
The intent is to bring this project forward for all Appdash users, and make app tracing better than ever before. I don't take the decision to remove existing code in an incompatible way lightly, but do find this to be the best path forward.
We are not currently using v2 of the client API.
Every once in awhile Travis CI fails due to a flaky test, we should investigate why this is. It seems that most PRs fail due to this right off the bat, but requeuing the build seems to fix it.
I noticed when visiting the /traces
page that the order of traces changes randomly upon refreshing the page. It would be nice if it was consistent upon multiple reloads, so long as the data doesn't change.
Probably is due to a map being used somewhere (which have no specific iteration order) within the template. Sorting would be a good option to fix this.
Right now there's no details or examples about how apptrace works and ho it is implemented.
It would be good to have some docs so people like me can check how useful the package would be for our use cases and what it would take to integrate it.
Ideally, InfluxDBStore.Collect
is as fast as reasonably possible. Right now, with examples/cmd/webapp-influxdb
I've noticed Collect
times right now in the range of 60-200ms (just eyeballing it, I could be off by a bit).
If Collect
cannot complete in under 50ms we lose trace data, because we cannot have trace data build up in memory forever (memory leak), nor can it block pending HTTP requests. 50ms for Collect
is, ideally, an upper time bound (hopefully most Collect
are much quicker).
To measure this, I've added some hacky timing debug information to influxdb_store.go
and changed the webapp-influxdb
command, you can try my test branch issue131
for example, or see my changes here: https://github.com/sourcegraph/appdash/compare/issue131 (note: this branch is just PR #127 and #131 merged, then f552611 applied on top).
Run the example app cleanly:
go install ./examples/cmd/webapp-influxdb/ && rm -rf ~/.influxdb/ && webapp-influxdb
Then using vegeta HTTP profiling tool, perform 1 HTTP request/sec for 8s:
echo "GET http://localhost:8699/" | vegeta attack -duration=8s -rate=1 | vegeta report
You should observe some logs that look like:
InfluxDBStore.Collect -> in.con.Write took 364.948424ms
InfluxDBStore.Collect -> took 367.329577ms
appdash: 2016/04/05 10:55:54 ChunkedCollector: queue entirely dropped (trace data will be missing)
appdash: 2016/04/05 10:55:54 ChunkedCollector: queueSize:3 queueSizeBytes:2133
...
InfluxDBStore.Collect -> in.con.Write took 65.655258ms
InfluxDBStore.Collect -> took 68.063186ms
appdash: 2016/04/05 10:56:00 ChunkedCollector: queue entirely dropped (trace data will be missing)
appdash: 2016/04/05 10:56:00 ChunkedCollector: queueSize:3 queueSizeBytes:3440
Note that in.con.Write
takes most of the time spent during Collect
, i.e. the Collect
function itself is not very expensive, but writing to InfluxDB via in.con.Write
is!
I think this is because Collect
is inherently a very small operation, at most it will be writing a single InfluxDB data point. Consider our code:
// A single point represents one span.
pts := []influxDBClient.Point{*p}
bps := influxDBClient.BatchPoints{
Points: pts,
Database: in.dbName,
}
_, writeErr := in.con.Write(bps)
if writeErr != nil {
return writeErr
}
return nil
We only write a single point to InfluxDB, and this becomes very expensive because InfluxDB cannot handle small writes very easily, and adds a large overhead like 50-200ms to them. However, InfluxDB can write a very large number of points (500+, a batch of points) in almost the same amount of time (50-200ms) from my tests.
I think the solution here is to make InfluxDBStore.Collect
append to an internal slice, such that it queues up an entire batch of points, and then after some period of time writes them to InfluxDB in a background goroutine. Important aspects would be:
The apptrace project is being renamed to appdash, to avoid naming conflicts with other things out there.
CCing everyone who's posted an issue or contributed so far. Sorry for the abrupt change, but we figured it's better to do it quick and early instead of waiting.
The new import path is sourcegraph.com/sourcegraph/appdash
.
I will close this issue when the rename is complete (in an hour or so).
/cc @ernesto-jimenez @thoward @slimsag @samertm @beyang @gbbr
If my assumptions are correct that both webapp
and webapp-influxdb
do the same thing and should generate a basically identical trace structure, I think the hierarchy that InfluxDBStore reassembles traces/spans in is incorrect.
Steps to reproduce:
webapp
and visit http://localhost:8699, click on the trace, click on Verbose Data View tab.webapp
, run webapp-Influxdb
.What happens?
webapp-influxdb
displays Client.Foo
fields when viewing the root of the trace (at e.g. http://localhost:8700/traces/419d201ab0ce01d4)What should happen?
Client.Foo
fields if you have clicked to view a sub-span of the trace, e.g. by clicking localhost:8699/endpoint (290ms)
/ on the http://localhost:8700/traces/<trace-id>/<sub-span-id>
page.Notes:
I just tested to ensure fuzzy searching works with most HTTP-based metadata, and it looks like it's broken when keys contain periods e.g. Response.StatusCode
. It seems to be because Fuse identifies those keys as accessors into sub-data. I.e. obj.Response.StatusCode
in JS.
Probably need to replace periods in the keys with underscores, e.g. Response_StatusCode
, so that Fuse doesn't misinterpret our intent.
I haven't run very far with this idea, so this is just a thought dump for now. But I've been considering how Appdash could reach the most languages.. it stands to reason that we would gather support for a large number of languages by exposing a gRPC service which would literally be the OpenTracing API itself.
Then, from a user's perspective, they could either directly use this gRPC client from their language of choice OR we could even implement opentracing-python, opentracing-java, etc by simply calling out to this gRPC service.
This is interesting, because it could give many other tracers that do not wish to spend a significant amount of time implementing tracing clients in various languages automatic support if they were to expose the same OpenTracing-compatible gRPC service. This could also be done over, e.g. HTTP or others, I just chose gRPC because I am most familiar with it.
The httptrace server middleware writes to the HTTP response at
Line 85 in 9dd479d
Track issue reported by @chris-ramon, with a fix already sent upstream at influxdata/influxdb#6202 -- close once merged.
Hi, there. I think it would be helpful to have a BUILD or otherwise file that could allow folks who are new to go a quick-start on how to build this. Ex. check version is at least X, watch out for Y (like package internal), etc.
Currently it's very tricky to utilize Appdash from a Martini application because httptrace
primarily exposes a Negroni HTTP middleware which Martini doesn't seem to support at all.
We need to determine: what is the best way to use a Negroni HTTP middleware from within a Martini app? I imagine many others will run into this question in the future.
A hacky workaround for now: https://gist.github.com/slimsag/a7e1de60844656ec6a65
EDIT: Updated (Mar 7) with recent findings.
Running appdash demo
or examples/cmd/webapp
and clicking on a sub-span from the trace, brings you to the page for that sub-span. Strangely, there appear to be double results for meta-data present:
I tracked down why and how this occurs, and in fact it's not duplicate data at all (it's data from the client sending the request and the server responding to it) -- it's mostly working-as-intended:
/endpoint
with a Span-ID
header (correct)./endpoint
checks for the Span-ID
header and records directly to it (correct).The issue is that the keys do not make a distinction between Client versus Server -- which is confusing when looking at the data. Consider trying to track down a bug in which the Go HTTP client for some unknown reason always reports that the server responded with a 404
status code:
Key | Value |
---|---|
Response.StatusCode |
200 |
Response.StatusCode |
404 |
It's not clear to the reader whether the server responded with 404 or the client received a 404. A better presentation would be:
Key | Value |
---|---|
Server.Response.StatusCode |
200 |
Client.Response.StatusCode |
404 |
Aha! Now we know the server responded with 200, but the client for some reason got a 404.
I've put the Verbose Data View into a gist for prying eyes (see here).
TL;DR: We should probably prefix client vs. server annotations with such as clarification.
In Go 1.5, the internal/
package rule is enforced. When trying to build appdash, the build errors with a message:
package sourcegraph.com/sourcegraph/appdash/cmd/appdash
imports github.com/cznic/mathutil
imports github.com/elazarl/go-bindata-assetfs
imports github.com/gogo/protobuf/io
imports github.com/gogo/protobuf/proto
imports github.com/gorilla/context
imports github.com/gorilla/mux
imports github.com/jessevdk/go-flags
imports sourcegraph.com/sourcegraph/appdash
imports sourcegraph.com/sourcegraph/appdash/internal/wire
imports sourcegraph.com/sourcegraph/appdash/internal/wire
imports sourcegraph.com/sourcegraph/appdash/internal/wire: use of internal package not allowed
I think this means you'd have to move the cmd/appdash
somewhere to be a root of internal
. Perhaps the easiest thing is to move ./internal
to ./cmd/appdash/internal
although it's not very elegant.
I used grpc in my micro services, can u use appdash trace the grpc request?
Would it be possible to implement a tracing client in other languages? e.g. Could I use this in a Scala/Finatra app? What would it take to implement that sort of client?
Observation: A lot of traces we have in Sourcegraph originate from machine actions (bots), rather than a user action. Machine traces are less valuable than what happens from a direct user action. (ie a pageload/XHR request trace is usually more valuable than a trace of a worker asking for a job). Machine traces are also higher volume. We get into situations of appdash service degradation or traces expiring soon. Dashboards / lists of traces can be overpopulated with traces originating from machines.
What can we do so we get a better experience when investigating or discovering traces from user actions?
sourcegraph.com/sourcegraph/appdash/cmd/appdash $ go build
../../../../../github.com/influxdata/influxdb/tsdb/engine/tsm1/int.go:260: cannot use d.values[:](type []uint64) as type *[240]uint64 in argument to simple8b.Decode
../../../../../github.com/influxdata/influxdb/tsdb/engine/tsm1/timestamp.go:204: cannot use simple8b.NewDecoder(nil) (type *simple8b.Decoder) as type simple8b.Decoder in field value
Perhaps I'm misunderstanding the code or for some other reason its not a big deal, but it seems to me like the model that is being used to collect traces and keep track of spans means that connection pooling with persistent connections would not be easy to do.
I'm looking at the example application at https://github.com/sourcegraph/appdash/blob/master/examples/cmd/webapp/main.go#L103
httpClient := &http.Client{
Transport: &httptrace.Transport{
Recorder: appdash.NewRecorder(span, collector),
SetName: true,
},
}
and a new client is being creating each time the handler is being called, and I don't see any way around this because the span is tightly tied in with the transport.
We never had pagination on the Traces page because our DB was so naive; with InfluxDBStore we should be able to add this easily.
Depending on how it evolves, it'd be great to adhere to the gokit tracing spec. Looks like appdash hits all of the points so far except Zipkin compat, which should be doable.
This issue can remain open as a tracking issue.
Use a continuous query to downsample our data and inherently make the Dashboard much much faster.
Steps to reproduce:
.influxdb
directory (download influxdb.tar.gz) into user home directory. This DB was generated by starting a Sourcegraph server instance and visiting the homepage of the app, i.e. it contains legitimate trace data (which works with MemoryStore etc).webapp-influxdb
and visit http://localhost:8700/tracesWhat is seen?
maximum number of retries
from InfluxDBStore.What is expected?
Notes:
findTraceParent
returns nil
: https://github.com/sourcegraph/appdash/blob/master/influxdb_store.go#L628root.Sub
does not yet exist! root.Sub
does not yet exist because that is what addChildren is trying to solve :) (a contradiction)It was not clear to me what timespans meant.
When running appdash
on the command line, custom events don't render. This is because nobody in the entire application can invoke RegisterEvent
with the new event type.
I have a few ideas on how we can fix this.
As a fast-and-easy hack for now: you can copy the source for cmd/appdash
(or just modify it directly) such that a call to RegisterEvent
for your type is made.
Reported by @joeshaw over Slack
Here is what running cmd/appdash
should look like:
And here is what it ends up looking like (because the custom event type is not registered):
Hi, if I'm not misreading anything, the httptrace
package only supports Negroni? Is there a way to directly return a HTTP handleFunc or I have to wrap up myself?
Thanks!
It would be very useful if AggregateStore also provided functionality similar to LimitStore/RecentStore.
I want to try out appdash but encountered a bug in influxdb influxdata/influxdb#6445
It would be great if there were release binaries available and a container up on quay.io for me to use as well.
Thanks for the great looking project, looking forward to trying it out.
From CI logs:
--- FAIL: TestCancelRequest (0.02s)
client_test.go:142: got &url.Error{Op:"Get", URL:"http://example.com/foo", Err:(*http.httpError)(0xc82000b860)}, want Get http://example.com/foo: net/http: request canceled while waiting for connection
If it pops up again we can investigate.
The example demonstrates a 1d retention policy, but the Dashboard needs 72hr (as in, has a hard-coded default of a 72/hr timeline / GUI widgets).
When the slider reads "X-Y hours ago", the query that is sent to the server is really "(72-X)-(72-Y)" hours ago. So for example, when the slider is set to "0-2 hours ago" the data that is reported is from 72 to 70 hours ago.
This is a possible enhancement, non-critical and low priority.
Right now it seems there's no handling of the case when Flash is not available in user's browser (either it's not installed, not available, or disabled).
Clicking "Copy as JSON" does nothing in such scenario.
If Flash is not available, perhaps something else should happen. Disable copying as json? Display a message saying "Sorry, Flash is needed for this functionality"? Display a text box popup that user can copy/paste? Do something else?
As an example of one way to handle such scenario, this is how GitHub looks like when you don't have Flash:
And this is how it appears when you do:
GET /dashboard/data?start=0&end=72: error: handler panic
runtime error: invalid memory address or nil pointer dereference
goroutine 22 [running]:
runtime/debug.Stack(0x0, 0x0, 0x0)
C:/Go/src/runtime/debug/stack.go:24 +0x87
sourcegraph.com/sourcegraph/appdash/traceapp.handlerFunc.ServeHTTP.func1(0x16a6ff0, 0xc08219f860, 0xc0824d89a0)
E:/RDLL/Golang/src/sourcegraph.com/sourcegraph/appdash/traceapp/handler.go:18 +0x5a
panic(0xa95e60, 0xc082008060)
C:/Go/src/runtime/panic.go:426 +0x4f7
sourcegraph.com/sourcegraph/appdash/traceapp.(_App).serveDashboardData(0xc08200fa90, 0x16a7070, 0xc082194000, 0xc0824d89
a0, 0x0, 0x0)
E:/RDLL/Golang/src/sourcegraph.com/sourcegraph/appdash/traceapp/dashboard.go:60 +0x295
sourcegraph.com/sourcegraph/appdash/traceapp.(_App).(sourcegraph.com/sourcegraph/appdash/traceapp.serveDashboardData)-fm
(0x16a7070, 0xc082194000, 0xc0824d89a0, 0x0, 0x0)
E:/RDLL/Golang/src/sourcegraph.com/sourcegraph/appdash/traceapp/app.go:65 +0x53
sourcegraph.com/sourcegraph/appdash/traceapp.handlerFunc.ServeHTTP(0xc08215a640, 0x16a6ff0, 0xc08219f860, 0xc0824d89a0)
E:/RDLL/Golang/src/sourcegraph.com/sourcegraph/appdash/traceapp/handler.go:22 +0xbf
github.com/gorilla/mux.(_Router).ServeHTTP(0xc08200f220, 0x16a6ff0, 0xc08219f860, 0xc0824d89a0)
E:/RDLL/Golang/src/github.com/gorilla/mux/mux.go:103 +0x277
sourcegraph.com/sourcegraph/appdash/traceapp.(_App).ServeHTTP(0xc08200fa90, 0x16a6ff0, 0xc08219f860, 0xc0824d89a0)
E:/RDLL/Golang/src/sourcegraph.com/sourcegraph/appdash/traceapp/app.go:76 +0x4c
net/http.serverHandler.ServeHTTP(0xc0820b3080, 0x16a6ff0, 0xc08219f860, 0xc0824d89a0)
C:/Go/src/net/http/server.go:2081 +0x1a5
net/http.(_conn).serve(0xc0821ac980)
C:/Go/src/net/http/server.go:1472 +0xf35
created by net/http.(_Server).Serve
C:/Go/src/net/http/server.go:2137 +0x455
Golang 1.6
Windows 10
PowerShell
Running with just 1.3 and 1.4 currently, we can drop 1.3 now that it's older.
The top-level appdash package is vendoring a bunch of libraries, including InfluxDB. Since the appdash package is a library, and not a command, it shouldn't vendor libraries. Vendoring is the sole responsibility of the project owner. For example, it would be fine for appdash/cmd/appdash to vendor the libraries it needs.
The problem with libraries that vendor code is that one dependency might be introduced with different versions to the final binary. This results in two possible symptoms.
Now, it's of course possible for me to pull your vendored libs into my vendor directory, and delete yours. However that means that I'm not just tracking your repository anymore, but that I am maintaining a patched version of it. This creates a maintenance burden for me.
It would be nice to be able to add important annotations via the opentracing API.
I haven't spent much time thinking about it, but it would work by prefixing a tag's key, i.e. span.SetTag("impt:status", resp.StatusCode)
. This would add an important annotation with key="status"
.
(inclusive of continuous queries)
The traces on the traces page are ordered randomly and change their order after refreshs. This was fixed once (issue #47) but it's broken again. If I had to guess, I'd start by looking at 1ddd075 which removed the call to sort.Sort.
Related, I'm not sure if sorting by ID is the best approach, or if it should be sorted by date instead?
Reported on Slack by @rhastamasta who had trouble running Appdash.
Currently we load the template data from disk by default, but it doesn't seem to work in all cases, as he was getting:
template [root.html layout.html]: Asset root.html can't read by error: Error reading asset root.html at ../go/src/sourcegraph.com/sourcegraph/appdash/traceapp/tmpl/root.html: open ../go/src/sourcegraph.com/sourcegraph/appdash/traceapp/tmpl/root.html: no such file or directory
MacBook-Pro:src adu$ pwd
/Users/adu/Workspace/go/src
MacBook-Pro:src adu$ l
total 0
0 drwxr-xr-x 6 adu 1955793164 204B Apr 3 10:47 .
0 drwxr-xr-x 5 adu 1955793164 170B Mar 11 14:20 ..
0 drwxr-xr-x 3 adu 1955793164 102B Mar 11 14:20 bitbucket.org
0 drwxr-xr-x 3 adu 1955793164 102B Mar 11 14:20 code.google.com
0 drwxr-xr-x 21 adu 1955793164 714B Apr 6 10:09 github.com
0 drwxr-xr-x 3 adu 1955793164 102B Apr 3 10:47 sourcegraph.com
The solution in this case is to determine why Appdash doesn't load assets relative to the proper $GOPATH
variable (/Users/adu/Workspace/go
above) or simply require go generate
to be ran during development / modification of template files.
I'm trying to get a tackle on what the timeline UI looks like when there are a lot of cascading spans and sub-spans; it seems that running:
apptrace serve --sample-data
does not work (the timeline doesn't show at all). Chrome's console outputs:
Uncaught TypeError: Cannot read property 'forEach' of null d3-timeline.js:82
(anonymous function) d3-timeline.js:73
(anonymous function) d3.js:884
(anonymous function) d3.js:890
d3_selection_each d3.js:883
d3_selectionPrototype.each d3-timeline.js:72
timeline d3.js:897
d3_selectionPrototype.call 00b74083bb122981:119
timelineHover 00b74083bb122981:122 (anonymous function)
I will report back here with more info as I find it.
Tracking issue for potentially using InfluxDB as a storage backend for Appdash.
/cc @chris-ramon who will be investigating this
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.