Giter VIP home page Giter VIP logo

slog-agent's Introduction

slog-agent

A log agent designed to process and filter massive amounts of logs in reat-time and forward to upsteam (fluentd)

What we built this for

We have hundreds of thousands of application logs per second that need to be processed or filtered as quickly as possible, for each server.

At the target rate of one million logs per second, every steps could be bottlenecks and conventional log processors are not designed to handle that sort of traffic. This agent is built to be extremely efficient, both memory and CPU wise, and also to be able to scale up to multiple CPU cores efficiently, at the cost of everything else.

A possibly baised and unfair comparison of this vs Lua transform with fluent-bit, is roughly 0.5M log/s from network input, processed and gzipped at 1:20-50 ratio (2 cores), vs 50K log/s from file and uncompressed (one core) for the same processing steps. We also tested Vector with similar but worse results.

What you need to adopt this

You need basic understanding of Go, to be ready to write new transforms and dig into profiling reports.

Things are slow on generic log processors for very good reasons - For example, a simple matching by regular expression could be 50 times slower than a special glob pattern, and allocates tons of buffers in memory heap which then need more CPU time to be GC'ed. The boundary crossing scripting interface is another bottleneck, with marshalling and unmarshalling of each records that could cost more than the script execution itself.

Without any of such generic and flexible transforms and parsers, everything needs to be done in manually written code, or blocks of code that can be assembled together - which is essentially what this log agent provides, a base and blocks of code for you to build high performance log processors - but only if you need that kind of performance. The design is pluggable and the program is largely configurable, but you're going to run into situations which can only be solved by writing new code.

Features

  • Input: RFC 5424 Syslog protocol via TCP, with experimental multiline support
  • Transforms: field extraction and creations, drop, truncate, if/switch, email redaction
  • Buffering: hybrid disk+memory buffering - compressed and only persisted when necessary
  • Output: Fluentd Forward protocol, both compressed and uncompressed. Single output only.
  • Metrics: Prometheus metrics to count logs and log size by key fields (e.g. vhost + log level + filename)

Dynamic fields are not supported - All fields must be known in configuration because they're packed in arrays that can be accessed without hashmap lookup.

"tags" or similar concept doesn't exist here. Instead there are "if" and "switch-case" matching field values.

See the sample configurations for full features.

Performance and Backpressure

Logs are compressed and saved in chunk files if output cannot clear the logs fast enough. The maximum numbers of pending chunks for each pipeline (key field set) are limited and defined in defs/params.go.

Input would be paused if logs cannot be processed fast enough - since RFC 5424 doesn't support any pause mechanism, it'd likely cause internal errors on both the agent and the logging application, but would not affect other applications' logging if pipelines are properly set-up / isolated (e.g. by app-name and vhost).

For a typical server CPU (e.g. Xeon, 2GHz), a single pipeline / core should be able to handle at least:

  • 300-500K log/s for small logs, around 100-200 bytes each including syslog headers
  • 200K log/s or 400MB/s for larger logs

Note on servers with more than a few dozens of CPU cores, an optimal GOMAXPROCS has to be measured and set for production workload, until golang/go#28808 is resolved

Build

Requires gotils which provides build tools

make
make test

Operation manual

Configuration

See sample configurations.

Experimental configuration reloading is supported by starting with --allow_reload and sending SIGHUP; See [testdata/config_sample.yml] for details on which sections may be reconfigured. In general everything after inputs are re-configurable. If reconfiguration fails, errors are logged and the agent would continue to run with old configuration, without any side-effect.

Note after successful reloading, some of previous logs may be sent to upstream again if they hadn't been acknowledged in time.

The metric family slogagent_reloads_total counts sucesses and failures of reconfigurations.

Currently it is not possible to recover previously queued logs if orchestration/keys have been changed.

Runtime diagnosis

  • SIGHUP aborts and recreates all pipelines with new config loaded from the same file. Incoming connections are unaffected.
  • SIGUSR1 recreates all outgoing connections or sessions gracefully.
  • http://localhost:METRICS_PORT/ provides Golang's builtin debug functions in addition to metrics, such as stackdump and profiling.

Development

Mark inlinable code

Add xx:inline comment on the same line as function declaration

func (s *xLogSchema) GetFieldName(index int) string { // xx:inline

If this function is too complex to be inlined, build would fail with a warning.

Re-generate templated source (.tpl.go)

make gen

Re-generate expected output in integration tests

make test-gen

Runtime Diagnosis

Prometheus listener address (default :9335) exposes go's debug/pprof in addition to metrics, which can dump goroutine stacks.

Options:

  • --cpuprofile FILE_PATH: enable GO CPU profiling, with some overhead
  • --memprofile FILE_PATH: enable GO CPU profiling
  • --trace FILE_PATH: enable GO tracing

Benchmark & Profiling

Example:

LOG_LEVEL=warn time BUILD/slog-agent benchmark agent --input 'testdata/development/*.log' --repeat 250000 --config testdata/config_sample.yml --output null --cpuprofile /tmp/agent.cpu --memprofile /tmp/agent.mem
go tool pprof -http=:8080 BUILD/slog-agent /tmp/agent.cpu

--output supports several formats:

  • `` (empty): default to forward to fluentd as defined in config. Chunks may not be fully sent when shutdown and unsent chunks would be saved for next run.
  • null: no output. Results are compressed as in normal routine, counted and then dropped.
  • .../%s: create fluentd forward message files each of individual chunks at the path (%s as chunk ID). The dir must exist first.
  • .../%s.json: create JSON files each of individual chunks at the path (%s as chunk ID). The dir must exist first.

fluentd forward message files can be examined by fluentlibtool

Internals

See DESIGN

Key dependencies

  • fluentlib for fluentd forward protocol, fake server and dump tool for testing.
  • klauspost' compress library for fast gzip compression which is absolutely critical to the agent: always benchmark before upgrade. compression takes 1/2 to 1/3 of CPU time in our environments
  • YAML v3 required for custom tags in configuration. KnownFields is still not working and it cannot check non-existent or misspelled YAML properties.

Authors

Special thanks to Henrik Sjöström for his guiding on Go optimization, integration testing and invaluable suggestions on performant design.

slog-agent's People

Contributors

jiping-s avatar sgtxn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

slog-agent's Issues

config: Check for invalid keys in YAML config file

YAML v3 is being used to support custom tags in the config file. It's WIP and there is no checking for invalid keys for now. As a result most transforms simply require all keys to be specified to reduce the chance of human errors.

It needs to be fixed whether in this repository or in go yaml lib.

all: Limit amounts of logs from the agent itself

Background

Excessive amounts of logs from the agent could be triggered by I/O error (ex: disk quota) or broken input logs. In the worst case, we could assume one malformed input log would at least trigger one warning level log from the agent. This is unacceptable as it could easily cause TBs of logs in a few minutes.

What to do

Set up log limit per component and apply it to the smallest unit possible: a TCP connection, a pipelines (not sub-pipelines) or a set of metric keys. The limit should be configurable, for example:

  • max_logging_count: 500
  • max_logging_duration: 1h (reset counter afterwards)

all: Document full integration testing

Document our full integration testing once we have a prototype setup

The scope of full integration testing should cover:

  • Input: mixed normal log records (200b) and huge log records (~50K), with a sizable part of them to be dropped
  • Processing: hit every transforms except deprecated or not-recommended-for-production
  • Output: network output
  • Upstream: fake server from https://github.com/relex/fluentlib with random failures to trigger recovery
  • Kill agent in the middle and restart to test shutdown & recovery: note this only works for catchable signals such as SIGTERM

Automatic verification needs to cover:

  • Final output logs from upstream in JSON
    • Order of output logs should be the same as input, per each of pipeline (key-set)
      • Except when recovery happens, which should be resent and possibly duplicated in the original order. For example logs 1 2 3 4 5 may be sent as 1 2 3 | 2 3 | 4 5
  • Metric values except buffer metrics regarding on-disk vs in-memory chunks

config: Reorganize defs/params and make them configurable

Background

defs/params.go contains advanced parameters. Some of the values are shared while others are specific to certain packages or individual modules. Some of them also have dependencies on others and cannot be set to any arbitrary values.

What to do

  • Move parameters to corresponding packages or individual units and add them to configuration
  • For parameters that have inter-dependencies, either calculate the values automatically or check the values when loading config

Simplify internal communication / flows, starting with baseoutput

Currently all internal communications and code flows are built around primitive go channels and signals, as a result the complexity has exploded especially in baseoutput related to shutdown mechanism:

  • There are async sender and receiver to enable pipelining / reduce latency
  • There are soft stop, hard stop (abort) to reconnect, and hard stop to shut down.
  • Network reading and writing can fail independently and the other side may continue to work (unintended)
  • All pending chunks whether in queue or being processed in functions need to be collected for recovery
  • The code is difficult to test. Currently the only tests involving real output are in run package and they can't cover any exceptional situations.

Instead of a free form that each piece of code decides how to use channel and signals, there could be some unified framework or library that help making:

  • The top-level flow should be visualized, in a declarative form. It needs to cover all exceptional situations because dealing with them is the main cause of complexity.
  • The components (e.g. acknowledger or sub unit) should have clear input, output, responsibilities and no implicit side-effects

The internal communication in baseoutput is not performance critical as it works on compressed log chunks

all: Upgrade to go 1.18 and generics

Same process as 1.16:

  • Running benchmarks with fixed GOMAXPROCS=12 against different inputs to make sure there is no slowdown
  • Running benchmarks with GOMAXPROCS=1000 against different inputs to make sure there is no slowdown
  • Running benchmarks with default GOMAXPROCS on a server with at least 500 cores to make sure there is no slowdown
  • Try different defs.IntermediateBufferedChannelSize values to see if it should be changed.

And replace code generation with generics

all: Dump current stats and log samples on signal

Background

We need to be able to get very detailed stats of the program when it hangs or crashes

What to do

  • Make current stats and latest logs-in-process available: note log fields are backed by mutable arrays and they may be changed or dropped when accessed out of scope.
  • Dump the information on fatal/panic and specific signals

output: Repeated failures in poor network environment

Reproduce

Start the test server in fluentlibtool with random failure chances (30% for all stages is enough) to simulate poor network environment, and run benchmark with minimal repeating - sometimes the client fails to end and keeps reconnecting only for sending pings (which then fails), while there are still chunks queued in the buffer.

Expected behavior

The buffer & client should be able to push chunks through even under such situation. Also periodic pings aren't supposed to start since there are still many chunks queued.

Notes

The feeding from hybrid buffer to forwarder client's channel may be delayed for unknown reason, or maybe they should be in "leftover" queue which gets priority processing.

output: Simplify tracking of chunks in acknowledger

The way chunks are tracked for acknowledgement and recovered at the end seems unnecessarily overcomplicated and it cannot survive a BUG situation.

A redesign is needed here. It's probably enough to track them in a sync map instead of having acknowledger “returning” them at the end.

output: Limit max duration of outgoing connection to help load balancing

Background

The outgoing connections from fluentd forward client have been designed to be persistent, unlike the async request approach used in fluent-bit with pooling. It works fine except it's very poor for load balancing on the server side, as connections stick to the same server node forever until either side is restarted.

What to do

Automatically close the connections after certain duration of time (e.g. 30m) should be able to address this, and also avoid pausing before next connection attempt in such situations.

Note a connection shouldn't be closed if there are still pending acknowledgements, which could be a problem if traffic never stops (need better ideas). Meanwhile a new connection cannot be opened until the old one is closed (current behavior), or the order of logs cannot be guaranteed.

output: Add fast shutdown and make it optional if possible

Background

The current shutdown process in fluentd forward client is to attempt waiting for all sent chunks to be acknowledged from server, in order to reduce the numbers of re-sent chunks in the next start.

When there are network troubles, it could take minutes until timeout and during the whole time the input has been shut down, meaning nothing is listening on the incoming port to receive logs from applications. A minute downtime could mean GBs of logs lost, queued up, or blocking applications' operations or triggering even more logs due to logging failure.

What to do

One of the two:

  1. Make fast shutdown a global option and support it in fluentd forward client: close everything and assume ongoing buffers lost when the end signal is received - the buffer above output clients should persist all of them for recovery next time.
  2. If making it optional is difficult, just replace the old approach completely. The current code may still be needed for #6

Not in Scope

Shutdown handling in other parts should be fine and need no change

orchestrate: Investigate bottleneck in parallel sub-pipelines

For unknown reasons the sub-pipelines under byKeySet orchestrator offers minimal performance improvement. In most recent tests num=2 could give about 30% time saving at the cost +30% CPU time, while higher values has zero effect.

The problem could be anywhere: in the Distributor, generic buffering in PipelineChannel (called from byKeySetOrchestrator) or maybe even test input - the current benchmark code feeds input records by itself and may be affected by or affect the normal operations.

What is known is that chunks locks used to ensure each resulting chunk is outputted in the original order are rarely touched, and disabling them changed nothing.

all: Upgrade to go 1.16

go 1.16 should offer significant performance improvement in the scheduler when GOMAXPROCS is high (e.g. 1000), due to partial addressing of golang/go#28808

What need to be done:

  • Running benchmarks with fixed GOMAXPROCS=12 against different inputs to make sure there is no slowdown
  • Running benchmarks with GOMAXPROCS=1000 against different inputs to make sure there is no slowdown
  • Running benchmarks with default GOMAXPROCS on a server with at least 500 cores to make sure there is no slowdown
  • Try different defs.IntermediateBufferedChannelSize values to see if it should be changed.

Input log types:

  • Common log records with around 200 bytes for each record, including syslog headers
  • Huge log records for error dumps, 10-50K. The new errors-input.log may be good enough for this.

orchestrate: improve visibility on incoming connections

It's been impossible to tell which clients and what they are logging through slog-agent.

The existing information:

  • client addresses
  • orchestration keys

are not sufficient in situations where the keys cannot differentiate clients (e.g. multi-instance services), and full information cannot be represented in metrics due to concerns about metric label cardinality (M inputs * N pipelines * T steps * U outputs).

A workaround is to log key fields when an incoming connection requests to dispatch logs to a previously-unused pipeline (id'ed by by orchestration keys), done in major: Use Go 1.18 generics, but turned out useless since it logs only the orchestration keys.

An option is to log the metrics keys in addition to orchestration keys, and / or dump raw headers in one of the new logs.

A better solution would be to save samples of incoming logs with information on destination pipelines assigned by orchestrator, and print them on certain signals, which may be done as part of #8

output: datadog & HTTP2

HTTP/2 should offer builtin multiplexing and avoid the need for custom pipelining implementation. In the ideal case merely changing the protocol to HTTP/2 should remove the latency issue.

There might still be buffer size problem with Go's HTTP2 implementation: golang/go#47840

Load testing will be needed against real Datadog API.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.