relex / slog-agent Goto Github PK

View Code? Open in Web Editor NEW

8.0 5.0 1.0 579 KB

High performance log agent/processor to be used with fluentd

License: MIT License

Makefile 0.43% Go 98.91% Shell 0.66%

logging log-parser fluentd

slog-agent's Introduction

slog-agent

A log agent designed to process and filter massive amounts of logs in reat-time and forward to upsteam (fluentd)

What we built this for

We have hundreds of thousands of application logs per second that need to be processed or filtered as quickly as possible, for each server.

At the target rate of one million logs per second, every steps could be bottlenecks and conventional log processors are not designed to handle that sort of traffic. This agent is built to be extremely efficient, both memory and CPU wise, and also to be able to scale up to multiple CPU cores efficiently, at the cost of everything else.

A possibly baised and unfair comparison of this vs Lua transform with fluent-bit, is roughly 0.5M log/s from network input, processed and gzipped at 1:20-50 ratio (2 cores), vs 50K log/s from file and uncompressed (one core) for the same processing steps. We also tested Vector with similar but worse results.

What you need to adopt this

You need basic understanding of Go, to be ready to write new transforms and dig into profiling reports.

Things are slow on generic log processors for very good reasons - For example, a simple matching by regular expression could be 50 times slower than a special glob pattern, and allocates tons of buffers in memory heap which then need more CPU time to be GC'ed. The boundary crossing scripting interface is another bottleneck, with marshalling and unmarshalling of each records that could cost more than the script execution itself.

Without any of such generic and flexible transforms and parsers, everything needs to be done in manually written code, or blocks of code that can be assembled together - which is essentially what this log agent provides, a base and blocks of code for you to build high performance log processors - but only if you need that kind of performance. The design is pluggable and the program is largely configurable, but you're going to run into situations which can only be solved by writing new code.

Features

Input: RFC 5424 Syslog protocol via TCP, with experimental multiline support
Transforms: field extraction and creations, drop, truncate, if/switch, email redaction
Buffering: hybrid disk+memory buffering - compressed and only persisted when necessary
Output: Fluentd Forward protocol, both compressed and uncompressed. Single output only.
Metrics: Prometheus metrics to count logs and log size by key fields (e.g. vhost + log level + filename)

Dynamic fields are not supported - All fields must be known in configuration because they're packed in arrays that can be accessed without hashmap lookup.

"tags" or similar concept doesn't exist here. Instead there are "if" and "switch-case" matching field values.

See the sample configurations for full features.

Performance and Backpressure

Logs are compressed and saved in chunk files if output cannot clear the logs fast enough. The maximum numbers of pending chunks for each pipeline (key field set) are limited and defined in defs/params.go.

Input would be paused if logs cannot be processed fast enough - since RFC 5424 doesn't support any pause mechanism, it'd likely cause internal errors on both the agent and the logging application, but would not affect other applications' logging if pipelines are properly set-up / isolated (e.g. by app-name and vhost).

For a typical server CPU (e.g. Xeon, 2GHz), a single pipeline / core should be able to handle at least:

300-500K log/s for small logs, around 100-200 bytes each including syslog headers
200K log/s or 400MB/s for larger logs

Note on servers with more than a few dozens of CPU cores, an optimal GOMAXPROCS has to be measured and set for production workload, until golang/go#28808 is resolved

Build

Requires gotils which provides build tools

make
make test

Operation manual

Configuration

See sample configurations.

Experimental configuration reloading is supported by starting with --allow_reload and sending SIGHUP; See [testdata/config_sample.yml] for details on which sections may be reconfigured. In general everything after inputs are re-configurable. If reconfiguration fails, errors are logged and the agent would continue to run with old configuration, without any side-effect.

Note after successful reloading, some of previous logs may be sent to upstream again if they hadn't been acknowledged in time.

The metric family slogagent_reloads_total counts sucesses and failures of reconfigurations.

Currently it is not possible to recover previously queued logs if orchestration/keys have been changed.

Runtime diagnosis

SIGHUP aborts and recreates all pipelines with new config loaded from the same file. Incoming connections are unaffected.
SIGUSR1 recreates all outgoing connections or sessions gracefully.
http://localhost:METRICS_PORT/ provides Golang's builtin debug functions in addition to metrics, such as stackdump and profiling.

Development

Mark inlinable code

Add xx:inline comment on the same line as function declaration

func (s *xLogSchema) GetFieldName(index int) string { // xx:inline

If this function is too complex to be inlined, build would fail with a warning.

Re-generate templated source (.tpl.go)

make gen

Re-generate expected output in integration tests

make test-gen

Runtime Diagnosis

Prometheus listener address (default :9335) exposes go's debug/pprof in addition to metrics, which can dump goroutine stacks.

Options:

--cpuprofile FILE_PATH: enable GO CPU profiling, with some overhead
--memprofile FILE_PATH: enable GO CPU profiling
--trace FILE_PATH: enable GO tracing

Benchmark & Profiling

Example:

LOG_LEVEL=warn time BUILD/slog-agent benchmark agent --input 'testdata/development/*.log' --repeat 250000 --config testdata/config_sample.yml --output null --cpuprofile /tmp/agent.cpu --memprofile /tmp/agent.mem
go tool pprof -http=:8080 BUILD/slog-agent /tmp/agent.cpu

--output supports several formats:

`` (empty): default to forward to fluentd as defined in config. Chunks may not be fully sent when shutdown and unsent chunks would be saved for next run.
null: no output. Results are compressed as in normal routine, counted and then dropped.
.../%s: create fluentd forward message files each of individual chunks at the path (%s as chunk ID). The dir must exist first.
.../%s.json: create JSON files each of individual chunks at the path (%s as chunk ID). The dir must exist first.

fluentd forward message files can be examined by fluentlibtool

Internals

See DESIGN

Key dependencies

fluentlib for fluentd forward protocol, fake server and dump tool for testing.
klauspost' compress library for fast gzip compression which is absolutely critical to the agent: always benchmark before upgrade. compression takes 1/2 to 1/3 of CPU time in our environments
YAML v3 required for custom tags in configuration. KnownFields is still not working and it cannot check non-existent or misspelled YAML properties.

Authors

Special thanks to Henrik Sjöström for his guiding on Go optimization, integration testing and invaluable suggestions on performant design.

slog-agent's People

Contributors

Stargazers

Watchers

slog-agent's Issues

redactemail: skip URLs with usernames

skip redactions of URLs with usernames as emails, e.g. ftp://[email protected] and s3://foo:bar@nowhere

config: Check for invalid keys in YAML config file

YAML v3 is being used to support custom tags in the config file. It's WIP and there is no checking for invalid keys for now. As a result most transforms simply require all keys to be specified to reduce the chance of human errors.

It needs to be fixed whether in this repository or in go yaml lib.

all: Limit amounts of logs from the agent itself

Background

Excessive amounts of logs from the agent could be triggered by I/O error (ex: disk quota) or broken input logs. In the worst case, we could assume one malformed input log would at least trigger one warning level log from the agent. This is unacceptable as it could easily cause TBs of logs in a few minutes.

What to do

Set up log limit per component and apply it to the smallest unit possible: a TCP connection, a pipelines (not sub-pipelines) or a set of metric keys. The limit should be configurable, for example:

max_logging_count: 500
max_logging_duration: 1h (reset counter afterwards)

all: Document full integration testing

Document our full integration testing once we have a prototype setup

The scope of full integration testing should cover:

Input: mixed normal log records (200b) and huge log records (~50K), with a sizable part of them to be dropped
Processing: hit every transforms except deprecated or not-recommended-for-production
Output: network output
Upstream: fake server from https://github.com/relex/fluentlib with random failures to trigger recovery
Kill agent in the middle and restart to test shutdown & recovery: note this only works for catchable signals such as SIGTERM

Automatic verification needs to cover:

Final output logs from upstream in JSON
- Order of output logs should be the same as input, per each of pipeline (key-set)
  - Except when recovery happens, which should be resent and possibly duplicated in the original order. For example logs 1 2 3 4 5 may be sent as 1 2 3 | 2 3 | 4 5
Metric values except buffer metrics regarding on-disk vs in-memory chunks

config: Reorganize defs/params and make them configurable

Background

defs/params.go contains advanced parameters. Some of the values are shared while others are specific to certain packages or individual modules. Some of them also have dependencies on others and cannot be set to any arbitrary values.

What to do

Move parameters to corresponding packages or individual units and add them to configuration
For parameters that have inter-dependencies, either calculate the values automatically or check the values when loading config

Simplify internal communication / flows, starting with baseoutput

Currently all internal communications and code flows are built around primitive go channels and signals, as a result the complexity has exploded especially in baseoutput related to shutdown mechanism:

There are async sender and receiver to enable pipelining / reduce latency
There are soft stop, hard stop (abort) to reconnect, and hard stop to shut down.
Network reading and writing can fail independently and the other side may continue to work (unintended)
All pending chunks whether in queue or being processed in functions need to be collected for recovery
The code is difficult to test. Currently the only tests involving real output are in run package and they can't cover any exceptional situations.

Instead of a free form that each piece of code decides how to use channel and signals, there could be some unified framework or library that help making:

The top-level flow should be visualized, in a declarative form. It needs to cover all exceptional situations because dealing with them is the main cause of complexity.
The components (e.g. acknowledger or sub unit) should have clear input, output, responsibilities and no implicit side-effects

The internal communication in baseoutput is not performance critical as it works on compressed log chunks

all: Upgrade to go 1.18 and generics

Same process as 1.16:

Running benchmarks with fixed GOMAXPROCS=12 against different inputs to make sure there is no slowdown
Running benchmarks with GOMAXPROCS=1000 against different inputs to make sure there is no slowdown
Running benchmarks with default GOMAXPROCS on a server with at least 500 cores to make sure there is no slowdown
Try different defs.IntermediateBufferedChannelSize values to see if it should be changed.

And replace code generation with generics

base: Update unsafe string usage

See https://stackoverflow.com/a/73608099/3488757 and https://www.sobyte.net/post/2022-09/string-byte-convertion/

Since go 1.19, StringHeader/SliceHeader has been deprecated and their usage is the code should be updated accordingly.

all: Dump current stats and log samples on signal

Background

We need to be able to get very detailed stats of the program when it hangs or crashes

What to do

Make current stats and latest logs-in-process available: note log fields are backed by mutable arrays and they may be changed or dropped when accessed out of scope.
Dump the information on fatal/panic and specific signals

output: Repeated failures in poor network environment

Reproduce

Start the test server in fluentlibtool with random failure chances (30% for all stages is enough) to simulate poor network environment, and run benchmark with minimal repeating - sometimes the client fails to end and keeps reconnecting only for sending pings (which then fails), while there are still chunks queued in the buffer.

Expected behavior

The buffer & client should be able to push chunks through even under such situation. Also periodic pings aren't supposed to start since there are still many chunks queued.

Notes

The feeding from hybrid buffer to forwarder client's channel may be delayed for unknown reason, or maybe they should be in "leftover" queue which gets priority processing.

output: Simplify tracking of chunks in acknowledger

The way chunks are tracked for acknowledgement and recovered at the end seems unnecessarily overcomplicated and it cannot survive a BUG situation.

A redesign is needed here. It's probably enough to track them in a sync map instead of having acknowledger “returning” them at the end.

output: Limit max duration of outgoing connection to help load balancing

Background

The outgoing connections from fluentd forward client have been designed to be persistent, unlike the async request approach used in fluent-bit with pooling. It works fine except it's very poor for load balancing on the server side, as connections stick to the same server node forever until either side is restarted.

What to do

Automatically close the connections after certain duration of time (e.g. 30m) should be able to address this, and also avoid pausing before next connection attempt in such situations.

Note a connection shouldn't be closed if there are still pending acknowledgements, which could be a problem if traffic never stops (need better ideas). Meanwhile a new connection cannot be opened until the old one is closed (current behavior), or the order of logs cannot be guaranteed.

output: Add fast shutdown and make it optional if possible

Background

The current shutdown process in fluentd forward client is to attempt waiting for all sent chunks to be acknowledged from server, in order to reduce the numbers of re-sent chunks in the next start.

When there are network troubles, it could take minutes until timeout and during the whole time the input has been shut down, meaning nothing is listening on the incoming port to receive logs from applications. A minute downtime could mean GBs of logs lost, queued up, or blocking applications' operations or triggering even more logs due to logging failure.

What to do

One of the two:

Make fast shutdown a global option and support it in fluentd forward client: close everything and assume ongoing buffers lost when the end signal is received - the buffer above output clients should persist all of them for recovery next time.
If making it optional is difficult, just replace the old approach completely. The current code may still be needed for #6

Not in Scope

Shutdown handling in other parts should be fine and need no change

orchestrate: Investigate bottleneck in parallel sub-pipelines

For unknown reasons the sub-pipelines under byKeySet orchestrator offers minimal performance improvement. In most recent tests num=2 could give about 30% time saving at the cost +30% CPU time, while higher values has zero effect.

The problem could be anywhere: in the Distributor, generic buffering in PipelineChannel (called from byKeySetOrchestrator) or maybe even test input - the current benchmark code feeds input records by itself and may be affected by or affect the normal operations.

What is known is that chunks locks used to ensure each resulting chunk is outputted in the original order are rarely touched, and disabling them changed nothing.

all: Upgrade to go 1.16

go 1.16 should offer significant performance improvement in the scheduler when GOMAXPROCS is high (e.g. 1000), due to partial addressing of golang/go#28808

What need to be done:

Running benchmarks with fixed GOMAXPROCS=12 against different inputs to make sure there is no slowdown
Running benchmarks with GOMAXPROCS=1000 against different inputs to make sure there is no slowdown
Running benchmarks with default GOMAXPROCS on a server with at least 500 cores to make sure there is no slowdown
Try different defs.IntermediateBufferedChannelSize values to see if it should be changed.

Input log types:

Common log records with around 200 bytes for each record, including syslog headers
Huge log records for error dumps, 10-50K. The new errors-input.log may be good enough for this.

orchestrate: improve visibility on incoming connections

It's been impossible to tell which clients and what they are logging through slog-agent.

The existing information:

client addresses
orchestration keys

are not sufficient in situations where the keys cannot differentiate clients (e.g. multi-instance services), and full information cannot be represented in metrics due to concerns about metric label cardinality (M inputs * N pipelines * T steps * U outputs).

A workaround is to log key fields when an incoming connection requests to dispatch logs to a previously-unused pipeline (id'ed by by orchestration keys), done in major: Use Go 1.18 generics, but turned out useless since it logs only the orchestration keys.

An option is to log the metrics keys in addition to orchestration keys, and / or dump raw headers in one of the new logs.

A better solution would be to save samples of incoming logs with information on destination pipelines assigned by orchestrator, and print them on certain signals, which may be done as part of #8

output: datadog & HTTP2

HTTP/2 should offer builtin multiplexing and avoid the need for custom pipelining implementation. In the ideal case merely changing the protocol to HTTP/2 should remove the latency issue.

There might still be buffer size problem with Go's HTTP2 implementation: golang/go#47840

Load testing will be needed against real Datadog API.

relex / slog-agent Goto Github PK

slog-agent's Introduction

slog-agent

What we built this for

What you need to adopt this

Features

Performance and Backpressure

Build

Operation manual

Configuration

Runtime diagnosis

Development

Mark inlinable code

Re-generate templated source (.tpl.go)

Re-generate expected output in integration tests

Runtime Diagnosis

Benchmark & Profiling

Internals

Key dependencies

Authors

slog-agent's People

Contributors

Stargazers

Watchers

slog-agent's Issues

Background

What to do

Background

What to do

Background

What to do

Reproduce

Expected behavior

Notes

Background

What to do

Background

What to do

Not in Scope

Recommend Projects

Recommend Topics

Recommend Org