financial-times / tapper Goto Github PK

View Code? Open in Web Editor NEW

69.0 33.0 8.0 283 KB

Zipkin client for Elixir

Elixir 100.00%

elixir-lang zipkin

tapper's Introduction

Tapper - Zipkin client for Elixir.

Implements an interface for recording traces and sending them to a Zipkin server.

Synopsis

A Client

A client making a request:

# start a new, sampled, trace, and root span;
# creates a 'client send' annotation on root span
# (defaults to type: :client) and a 'server address' (sa)
# binary annotation (because we pass the remote option with
# an endpoint)

# prepare remote endpoint metadata
service_host = %Tapper.Endpoint{service_name: "my-service"}

id = Tapper.start(name: "fetch", sample: true, remote: service_host, annotations: [
  Tapper.http_host("my.server.co.uk"),
  Tapper.http_path("/index"),
  Tapper.http_method("GET"),
  Tapper.tag("some-key", "some-value"),
  Tapper.client_send()
])

# ... do remote call ...

# add response details to span
Tapper.update_span(id, [
    Tapper.http_status_code(status_code),
    Tapper.client_receive()
])

# finish the trace (and the top-level span), with some detail about the operation
Tapper.finish(id, annotations: [
    tag("result", some_result)
])

A Server

A server processing a request (usually performed via integration e.g. Tapper.Plug):

# use propagated trace context (e.g. from Plug integration) and incoming Plug.Conn;
# adds a 'server receive' (sr) annotation (defaults to type: :server)
id = Tapper.join(trace_id, span_id, parent_span_id, sample, debug, annotations: [
  Tapper.client_address(%Tapper.Endpoint{ip: conn.remote_ip}), # equivalent to 'remote:' option
  Tapper.http_path(conn.request_path)
])

# NB because the server joined the trace, rather than starting it, 
# it must always start child spans for tracing anything it does, 
# rather than using the incoming span

# call another service in a child span, now as a client
id = Tapper.start_span(id, name: "fetch-details", annotations: [
    Tapper.http_path("/service/xx"),
    Tapper.http_host("a-service.com")
])
# ...
Tapper.update_span(id, Tapper.client_send())

# ... call service ...

Tapper.update_span(id, Tapper.client_receive())

# finish child span with some details about response
id = Tapper.finish_span(id, annotations: [
    Tapper.tag("userId", 1234),
    Tapper.http_status_code(200)
])

# perform some expensive local processing in a named local span:
id = Tapper.start_span(id, name: "process", local: "compute-result") # adds 'lc' binary annotation

# ... do processing ...

id = Tapper.finish_span(id)

# ... send response to client ...

# finish trace as far as this process is concerned
Tapper.finish(id, annotations: Tapper.server_send())

NB Tapper.start_span/2 and Tapper.finish_span/2 return an updated id, whereas all other functions return the same id, so you don't need to propagate the id backwards down a call-chain to just add annotations, but you should propagate the id forwards when adding spans, and pair finish_span/2 with the id from the corresponding start_span/2. Parallel spans can share the same starting id.

The Alternative Contextual API

The above API is the functional API: you need the Tapper.Id on-hand whenever you use it. You may complain that this pollutes your API, or creates difficulties for integrations.

Whilst you may mitigate this yourself using process dictionaries, ETS, or pure functional approaches using closures, the Tapper.Ctx interface provides a version of the API that tracks the Tapper.Id for you, using Erlang's process dictionary. Erlang purists might hate it, but it does get the id out of your mainstream code:

def my_main_function() do
  # ...
  Tapper.Ctx.start(name: "main", sample: true)
  # ...
  x = do_something_useful(a_useful_argument)
  # ...
  Tapper.Ctx.finish()
end

def do_something_useful(a_useful_argument) do  # no Tapper.Id!
  Tapper.Ctx.start_span(name: "do-something", annotations: tag("arg", a_useful_argument))
  # ...
  Tapper.Ctx.update_span(Tapper.wire_receive())
  # ...
  Tapper.Ctx.finish_span()
end

It's nearly identical to the functional API, but without explicitly passing the Tapper.Id around.

Behind the scenes, the Tapper.Id is managed using Tapper.Ctx.put_context/1 and Tapper.Ctx.context/0. Use these functions directly to propagate the Tapper.Id across process boundaries.

See the Tapper.Ctx module for details, including details of options for debugging the inevitable incorrect usage in your code!

API Documentation

The API documentation can be found at https://hexdocs.pm/tapper.

Implementation

The Tapper API starts, and communicates with a supervised GenServer process (Tapper.Tracer.Server), with one server started per trace; all traces are thus isolated.

Once a trace has been started, all span operations and updates are performed asynchronously by sending a message to the server; this way there is minimum processing on the client side. One message is sent per Tapper.start_span/2, Tapper.finish_span/2 or Tapper.update_span/2, tagged with the current timestamp at the point of the call.

When a trace is terminated with Tapper.finish/2, the server sends the trace to the configured collector (e.g. a Zipkin server), and exits normally.

If a trace is not terminated by an API call, Tapper will time-out after a pre-determined time since the last API operation (ttl option on trace creation, default 30s), and terminate the trace as if Tapper.finish/2 was called, annotating the unfinished spans with a timeout annotation. Timeout will will also happen if the client process exits before finishing the trace.

If the API client starts spans in, or around, asynchronous processes, and exits before they have finished, it should call Tapper.start_span/2 or Tapper.update_span/2 with a Tapper.async/0 annotation, or Tapper.finish/2 passing the async: true option or annotation; async spans should be closed as normal by Tapper.finish_span/2, otherwise they will eventually be terminated by the TTL behaviour.

The API client is not effected by the termination, normally or otherwise, of a trace-server, and the trace-server is likewise isolated from the API client, i.e. there is a separate supervision tree. Thus if the API client crashes, then the span can still be reported. The trace-server monitors the API client process for abnormal termination, and annotates the trace with an error (TODO). If the trace-server crashes, any child spans and annotations registered with the server will be lost, but subsequent spans and the trace itself will be reported, since the supervisor will re-start the trace-server using the initial data from Tapper.start/1 or Tapper.join/6.

The id returned from the Tapper API tracks the trace id, enabling messages to be sent to the right server, and span nesting, to ensure annotations are added to the correct span.

Tapper ids have an additional, unique, identifier, so if a server receives parallel requests within the same client span, the traces are recorded separately: each will start their own trace-server. In practice this should not happen, since clients should use a separate span for each remote call, however this protects against unconformant clients.

Installation

For the latest pre-release (and unstable) code, add github repo to your mix dependencies:

def deps do
  [{:tapper, github: "Financial-Times/tapper"}]
end

For release versions, the package can be installed by adding tapper to your list of dependencies in mix.exs:

def deps do
  [{:tapper, "~> 0.6"}]
end

Under Elixir 1.4+ the :tapper application will be auto-discovered from your dependencies, so there is no need to add :tapper to your application's extra_applications etc.

Configuration

Tapper looks for the following application configuration settings under the :tapper key:

attribute	type	description
`system_id`	String.t	This application's id; used for `service_name` in default endpoint host used in annotations.
`ip`	tuple	This application's principle IPV4 or IPV6 address, as 4- or 8-tuple of ints; defaults to IP of first non-loopback interface, or `{127.0.0.1}` if none.
`port`	integer	This application's principle service port, for endpoint port in annotations; defaults to 0
`reporter`	`atom` \| `{atom, any}` \| `function/1`	Module implementing `Tapper.Reporter.Api` ^[1], or function of arity 1 to use for reporting spans; defaults to `Tapper.Reporter.Console`.

All keys support the Phoenix-style {:system, var} format, to allow lookup from shell environment variables, e.g. {:system, "PORT"} to read PORT environment variable^[2].

^[1] If the reporter is given as {module, arg} it is expected to specify an OTP server to be started under Tapper's main supervisor.
^[2] Tapper uses the DeferredConfig library to resolve all configuration under the :tapper key, so see its documention for more resolution options.

Zipkin Reporter

The Zipkin reporter (Tapper.Reporter.Zipkin) has its own configuration:

attribute	description
`collector_url`	full URL of Zipkin server api for receiving spans
`client_opts`	additional options for `HTTPoison` client, see `HTTPoison.Base.request/5`

e.g. in config.exs (or prod.exs etc.)

config :tapper,
    system_id: "my-application",
    reporter: Tapper.Reporter.Zipkin

config :tapper, Tapper.Reporter.Zipkin,
    collector_url: "http://localhost:9411/api/v1/spans"

Other Reporters

Module	Description
`Tapper.Reporter.AsyncReporter`	collects spans before sending them to another reporter
`Tapper.Reporter.Console`	just logs JSON spans
`Tapper.Reporter.Null`	reports and logs nothing

Custom Reporters

You can implement your own reporter module by implementing the Tapper.Reporter.Api behaviour.

This defines a function ingest/1 that receives spans in the form of Tapper.Protocol.Span structs, with timestamps and durations in microseconds. For JSON serialization, see Tapper.Encoder.Json which encodes to a format compatible with Zipkin server.

The configuration's reporter property is usually either an atom specifying a simple module, or a supervisor-child-style {module, args} tuple specifying an OTP server to be started under Tapper's main supervisor. Additionally, it may be a 1-argument function which is useful for testing.

Logging

Tapper adds a trace_id key to the Logger metadata on Tapper.start/1 or Tapper.join/6, so if you want this in your logs, configure your logger formatter/backend to output this key, e.g.

config :logger,
  format: "[$level] $metadata$message\n",
  metadata: [:trace_id],

Will output something like:

[info] trace_id=b1db8e59c0f02152130c3fbb317d57fb  Something to log home about

Note that trace_id metadata is added regardless of whether the trace is sampled, so when you propagate the trace context for unsampled traces, you can still at least see the trace id in the logs, and track it across your system, which may be useful!

Erlang and Time

It is recommended that you run the Erlang VM in multi-time-warp mode for greater timing accuracy. This is achieved by setting the +C multi_time_warp command line option, e.g. by using the ERL_FLAGS environment var or erl_opts in your Distillary release.

The default time mode (no_time_warp) works well enough, but may introduce an error of up to 1% in time stamp and time duration measurements, due to the way it keeps the Erlang monotonic clock in sync with the system clock.

Why 'Tapper'?

Dapper (Dutch - original Google paper) - Brave (English - Java client library) - Tapper (Swedish - Elixir client library)

Because Erlang, Ericsson 🇸🇪.

RIP Joe Armstrong - a glorious exception to the rule that you should never meet your heros.

tapper's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger fishcakez tkasekamp thesquad chad-g-adams haljin adambraimbridge

tapper's Issues

Any chances to port it to Erlang or how to go about it?

Hi,

Would like to see if the same can be ported to Erlang?

Currently there does not exist any instrumentation library for Erlang in Zipkin. There exist one for Opentracing: https://github.com/Bluehouse-Technology/otter
and Opencensus: https://github.com/census-instrumentation/opencensus-erlang
(but not zipkin, again)

Also quoting from Adrian Cole form zipkin gitter:

@akamanocha I would raise an issue in https://github.com/Financial-Times/tapper to ask to extract the erlang library. this will save you time, if they accept

I am saying if this can be done, how easy or difficult it would be? And where to start, may be I can help...

Use process dictionary to make Tapper.Id passing less onerous

Keeping it nice and functional means your code ends up littered with tapper_id parameters being passed to down-stream functions, and reduces the normal 'hackability' of Elixir in IEx (since you now have to provide an id for everything, and not all shims in applications are written to work with :ignore which is a bug in their implementation, rather than Tapper's).

Without something like Scala's implicit parameters, the only option seems to be using the process dictionary, which is roughly equivalent to a Java thread-local, but is scoped to a process, and as most processes in Elixir request processing are short-lived (at least under Cowboy/Phoenix), doesn't have the quite the same sticky/icky semantics. Although Erlang programmers apparently hate it, the process dictionary is cheap and useful.

The API could be changed to look for a process dictionary entry when no Tapper Id is supplied, e.g.

Tapper.update_span(id, annotations)

could then also be called as:

Tapper.update_span(annotations)

which produces the Tapper.Id from the process dictionary, or is a no-op if there is none there.

We'd also add an API for submerging/retrieving the Tapper.Id when passing into processes, e.g.

id = Tapper.contextual_id()
pid = spawn(fn -> 
   Tapper.contextual_id(id)
   Tapper.start_span(annotations: [...])
   ...
end)

etc.

This also opens the door for the use of module attributes (as annotations) to allow easier instrumentation of code, since we can now 'magically' get the Tapper Id anywhere in the code, if it exists.

We'd probably want a config option to throw an exception or log an error if the id is not present, for getting rid of the gremlins.

Add to zipkin instrumentations list?

Hi, we have a list to help users discover things. If interested, please add tapper here https://github.com/openzipkin/openzipkin.github.io/blob/master/_data/community_instrumentations.yml

Timings using non-monotonic time

Should use monotonic time and make adjustments:

http://erlang.org/doc/apps/erts/time_correction.html

Binary Annotations Decode Error

Hi all, was trying to send Zipkin V1 traces and my Jaeger server is balking at something in the binary annotations being emitted from the application. Nothing really crazy in mine besides endpoints (IPv4 and IPv6, ports, that sort of thing) in the emitted JSON in the request (filtered IPs to localhost)

    "binaryAnnotations" => [
      %{
        "endpoint" => %{
          "ipv4" => "127.0.0.1",
          "port" => 0,
          "serviceName" => "healthchecks"
        },
        "key" => "http.status_code",
        "type" => "I16",
        "value" => "200"
      },
      %{
        "endpoint" => %{
          "ipv4" => "127.0.0.1",
          "port" => 0,
          "serviceName" => "healthchecks"
        },
        "key" => "http.path",
        "value" => "/versions"
      },
      %{
        "endpoint" => %{
          "ipv4" => "127.0.0.1",
          "port" => 0,
          "serviceName" => "healthchecks"
        },
        "key" => "http.method",
        "value" => "GET"
      },
      %{
        "endpoint" => %{
          "ipv4" => "127.0.0.1",
          "port" => 0,
          "serviceName" => "healthchecks"
        },
        "key" => "http.host",
        "value" => "localhost"
      },
      %{
        "endpoint" => %{"ipv6" => "::1", "port" => 0, "serviceName" => ""},
        "key" => "ca",
        "type" => "BOOL",
        "value" => true
      }
    ],

Jaeger complains here about:

{"level":"error","ts":1622732435.007283,"caller":"recoveryhandler/zap.go:33","msg":"interface conversion: interface {} is string, not float64", ...}

and results in a 500.

In my sender, dropping binary annotations makes traces appear as expected, here, but without the data I should be propagating, for obvious reasons:

  def process_request_body(spans) do
    spans
    |> Enum.map(fn span -> %{span | binary_annotations: []} end)
    |> Tapper.Encoder.Json.encode!()
  end

Am I doing something incorrect here?

Support global tags in configuration

Would you be open to supporting a global configuration for tags to be included in all spans sent by the application?

Consider using ETS instead of GenServer

@fishcakez on Gitter:

@ellispritchard Did you consider using ETS and heirs for more efficiency and failure handling?
it would require OTP 20 though
so i guess tapper had no chance
In OTP 20 allows almost as many ETS tables as you want and they all gets a reference (if unnamed), so unique, this would mean an ETS table could be started per trace id, with heir set as {recovery_pid, start_monotonic+ ttl} - so the recovery pid can use :erlang.start_timer(...., [abs: true]) to know when to cleanup/report the spans
it would be possible to report out of band using :ets.giveaway too on Tapper.finish

Integer value for binary annotations incompatible with many zipkin collectors

Generating a trace with Tapper is resulting in a json payload that has binary annotations with a value field having integer value (example: http.status_code). See a full example from Tapper here:
https://gist.github.com/chad-g-adams/213aaeca0a9912b7adaf5314ed999038

Sending such a trace to zipkin collectors is giving issues:

Zipkin UI ignores all integer annotations (simply not showing any of them in the UI)
Jaeger UI is showing large random negative numbers instead of the proper http status code
Honeycomb's opentracing proxy is reporting errors: error="json: cannot unmarshal number into Go struct field binaryAnnotation.value of type string"

More details copied from the gitter chat room:

According to this zipkin v1 API documentation, binary annotation's value field must be a string:
https://github.com/apache/incubator-zipkin-api/blob/master/zipkin-api.yaml
when zipkin handles the json the annotation value is expected to be boolean or string (not integer): https://github.com/apache/incubator-zipkin/blob/master/zipkin/src/main/java/zipkin2/internal/V1JsonSpanReader.java#L148...L154
when honeycomb proxy handles the annotation it's expected to be string: https://github.com/honeycombio/honeycomb-opentracing-proxy/blob/36b46198efb278a7fc6c6879bf37631fc1f60ebd/types/json.go#L95

Could it mean Tapper is not fully compatible with the zipkin v1 API?

"unknown" is being sent as serviceName

According to the Thrift documentation, "unknown" should be sent when you don't know the service name, of, say an incoming trace, leaving it to be filled in using data from the originating service. Now we're using istio, we can see that this is incorrect behaviour (at least for the JSON format), since you end up with "unknown" as the service name, rather than the upstream name. Java Brave seems to handle this correctly. In the v2 JSON API, this field is specifically marked as being able to be omitted:

 Endpoint:
    type: object
    title: Endpoint
    description: The network context of a node in the service graph
    properties:
      serviceName:
        type: string
        description: |
                    Lower-case label of this node in the service graph, such as "favstar". Leave
                    absent if unknown.
                  
                    This is a primary label for trace lookup and aggregation, so it should be
                    intuitive and consistent. Many use a name from service discovery.

implement "b3 single" header format

As discussed on openzipkin/b3-propagation#21 and first implemented here: https://github.com/openzipkin/brave/blob/master/brave/src/main/java/brave/propagation/B3SingleFormat.java https://github.com/openzipkin/brave/blob/master/brave/src/test/java/brave/propagation/B3SingleFormatTest.java

Let's support at least reading "b3" header from a single string, most commonly traceid-spanid-1
It would also be nice to support optionally writing this, especially in message providers or others with constrained environments.

Brave currently has a property like this, but its name could change with feedback:

    /**
     * When true, only writes a single {@link B3SingleFormat b3 header} for outbound propagation.
     *
     * <p>Use this to reduce overhead. Note: normal {@link Tracing#propagation()} is used to parse
     * incoming headers. The implementation must be able to read "b3" headers.
     */
    public Builder b3SingleFormat(boolean b3SingleFormat) {
      this.b3SingleFormat = b3SingleFormat;
      return this;
}

Post-fact sampling

Tracing in Tapper is cheap since the instrumented code is basically just sending a message or two (erm, I should probably write some benchmarks), so it's no biggie if we trace everything and then choose to report some of it.

Sampling is currently only available via the Plug integration, i.e. it's not even in the core Tapper library, because it's done up-front when the request comes in.

Adding a post-fact sampling stage to Tapper would allow some module to be called before sending a trace to the Reporter, which could then see if the trace was 'interesting' enough to sample.

This might include a default "have I sent 10% of traces in the last minute" to maintain current behaviour, but also could somehow categorise spans by annotations or span duration, and spot anomalies.

Adding the 'hook' is easy, implementing something that actually generically useful that does the sampling on anything other than a percentage basis is a bit harder. Would be good anyway for a general way of dynamically turning up (or down) the percentage of reported spans.

This could be implemented at current using 'debug' mode and a custom Reporter, but would be better to add this as a separate concern (we need a batching Reporter implementation ideally anyway, so don't want to complicate that).

Foundation Observability Working Group

Hi, I wanted to see if you had interest in taking part in the work we are doing in the Erlang Foundation's Observability working group, https://github.com/erlef/eef-observability-wg/.

For tracers like Tapper in particular checkout OpenCensus (https://github.com/census-instrumentation/opencensus-erlang) -- being merged with OpenTracing and renamed OpenTelemetry (https://opentelemetry.io/). We've been working with the Spandex (https://github.com/spandex-project) team to consolidate the community on one underlying tracer library. See https://github.com/opencensus-beam/ for existing instrumentation.

For events, whether for metrics or traces, we are focusing on https://github.com/beam-telemetry

Parsing tapper generated json logs throws an exception

By default Tapper Trace IDs are encoded to json as an object {"value":[169513317762981674876240685479724028020,25469]}. Such object can't be parsed by Fluentd (the number is too long, Caused by: com.fasterxml.jackson.core.JsonParseException: Numeric value (13237579633808872962) out of range of long (-9223372036854775808 - 9223372036854775807)) and log messages are being discarded massively.

As a workaround we have been adding this to our projects:

# Proper JSON encoding of TraceId generated by Tapper
for protocol <- [Poison.Encoder, Jason.Encoder] do
  defimpl protocol, for: Tapper.TraceId do
    def encode(trace, options) do
      trace
      |> String.Chars.to_string()
      |> @protocol.encode(options)
    end
  end
end

TraceId seems to be put to metadata in https://github.com/Financial-Times/tapper/blob/master/lib/tapper/tracer.ex#L139 . Would it make more sense to use to_hex and make it usable with different loggers and parsers?

Support api/v2 JSON spans

New lighter-weight JSON:

openzipkin/zipkin-api@682de48

Log clutter

Tapper logs when it starts and finishes a log, for example here https://github.com/Financial-Times/tapper/blob/master/lib/tapper/tracer/server.ex#L67

These logs create a lot of noise when in production env. Could there be a config to turn off these logs?

Pluggable trace/span id generation for Datadog

Hi!

Thanks for providing the Tapper library! I have to say that I really like the architecture and how easy it is to understand and dive-in!

I'm currently looking into using tapper for collecting traces and spans to be sent to Datadog APM instead of Zipkin... and for the most part, it looks like it is relatively straightforward, i.e. a custom reporter which formats traces in a datadog-compatible way and then forwards them to the Datadog agent seems to work well enough.

The only problem I've stumbled on so far, is the format of trace and span IDs, because Datadog expects them to be of type int64 and does not accept those 128bit hex IDs which are currently used...

I'm assuming that the 128bit hex format is something that works well with Zipkin, but for datadog I would simply generate two random int64 numbers...

Would you be open to making the "generation of ids" pluggable? (e.g. via a IdGenerator behavior + configurable implementation?). I'd be happy to work on a respective PR if that's something that makes sense for Tapper.