Giter VIP home page Giter VIP logo

goprobe's Introduction

goProbe

Github Release GoDoc Go Report Card Build/Test Status CodeQL

This package comprises:

  • goProbe - A high-througput, lightweight, concurrent, network packet aggregator
  • goQuery - CLI tool for high-performance querying of goDB flow data acquired by goProbe
  • gpctl - CLI tool to interact with a running goProbe instance (for status and capture configuration)

Conversion tools:

  • goConvert - Helper binary to convert goProbe-flow data stored in csv files
  • legacy - DB conversion tool to convert .gpf files - needed for upgrade to a v4.x compatible format

Data backends:

  • goDB - A small, high-performance, columnar database for flow data (pkg)

As the name suggests, all components are written in Go.

Warning

Migrating to Version 4 - There are breaking changes for:

  • the database format
  • goProbe's configuration file format
  • goProbe's API endpoints
  • the JSON results format from goQuery

To convert your existing pre-v4 DB to a v4.x compatible format, please refer to the legacy conversion tool.

Introduction

Today, targeted analyses of network traffic patterns have become increasingly difficult due to the sheer amount of traffic encountered. To enable them, traffic needs to be captured and examined and broken down to key descriptors which yield a condensed explanation of the underlying data.

The NetFlow standard was introduced to address this reduction. It uses the concept of flows, which combine packets based on a set of shared packet attributes. NetFlow information is usually captured on one device and collected in a central database on another device. Several software probes are available, implementing NetFlow exporters and collectors.

goProbe deviates from traditional NetFlow as flow capturing and collection is run on the same device and the flow fields reduced. It was designed as a lightweight, standalone system, providing both optimized low-footprint packet capture and a storage backend tailored to the flow data in order to provide lightning-fast analysis queries.

Quick Start

Refer to the Releases section to install the software suite.

To start capturing, configure goProbe. To query data produced by it, run goQuery. To query across a fleet of hosts, deploy global-query.

goDB

The database is a columnar block-storage. The raw attribute data is captured in .gpf (goProbe file) files.

goDB is a package which can be imported by other go applications.

The .gpf File Structure

The database has two built-in partition dimensions: interfaces and time. These were chosen with the goal to drastically reduce the amount of data that has to be loaded during querying. In practice, most analyses are narrowed down to a time frame and a particular interface.

Time partitioning is done in two steps: per day, and within the files, per five minute intervals. The location of flow data for these intervals is specified (amongst other properties) in the .meta files.

.meta file

The .meta file can be thought of as a partition-index and a layout for how the data is stored. Next to storing the timestamps and positions of blocks of flow data, it also captures which compression algorithm was used and provides sizing information for block decompression.

The .meta files are vitally important and - if deleted, corrupted or modified in any way - will result in failed data reading for the day of data.

Compression

goDB natively supports compression. The design rationale was to sacrifice CPU cycles in order to decrease I/O load. This has proven to drastically increase performance, especially on queries involving several days and a high number of stored flow records.

Supported compression algorithms are:

Check encoder.go for the enumeration of supported compression algorithms and the definition fo the Encoder interface. Compression features are available by linking against system-level libraries (liblz4 and libzstd, respectively, so if CGO is used (default) those must be available at runtime and consequently their development libraries are required if the project is build from source).

Alternatively, native Go implementations can be used if CGO is unavailable or by disabling individual or all C library dependencies (in favor of their respective native implementations) by means of the following build overrides:

Build override Effect
CGO_ENABLED=0 go build <...> Use native compression (no external dependencies)
go build -tags=goprobe_noliblz4 Use native compression for LZ4
go build -tags=goprobe_nolibzstd Use native compression for ZSTD

All of the above can be combined arbitrarily.

Warning

Depending on OS / architecture using native compression can incur a significant performance penalty (in particular for write operations). While allowing for greater portability / ease of use it is not recommended in heavy load / throughput production environments.

Bash autocompletion

goQuery has extensive support for bash autocompletion. To enable autocompletion, you need to tell bash that it should use the goquery_completion program for completing goquery commands. How to do this depends on your distribution. On Debian derivatives, it is recommended to create a file goquery in /etc/bash_completion.d with the following contents:

_goquery() {
    case "$3" in
        -d) # the -d flag specifies the database directory.
            # we rely on bash's builtin directory completion.
            COMPREPLY=( $( compgen -d -- "$2" ) )
        ;;

        *)
            if [ -x /usr/local/share/goquery_completion ]; then
                mapfile -t COMPREPLY < <( /usr/local/share/goquery_completion bash "${COMP_POINT}" "${COMP_LINE}" )
            fi
        ;;
    esac
}

Supported Operating Systems

goProbe is currently set up to run on Linux based systems only (this might change in the future). Tested versions and their system level library dependencies include (but are most likely not limited to):

  • Debian >= 7.0 [=> liblz4-1,libzstd1]
  • Fedora >= 28 [=> lz4-libs,libzstd]
  • Ubuntu >= 14.04 [=> liblz4-1,libzstd1]
  • Alpine >= 3.14 [=> lz4-dev,zstd-dev]

Authors & Contributors

  • Lennart Elsen
  • Fabian Kohn
  • Lorenz Breidenbach
  • Silvan Bitterli

This software was initially developed at Open Systems AG in close collaboration with the Distributed Computing Group at the Swiss Federal Institute of Technology.

This repository has been forked off the Open Systems repository end of 2018 and has now been detached as a standalone project (September 2020). Bug fixes and development of new features is done in this repository.

It has undergone an almost complete re-write with version 4 in 2023.

Bug Reports & Feature Requests

Please use the issue tracker for bugs and feature requests (or any other matter).

Make sure to branch off the main branch with your feature branch.

License

See the LICENSE file for usage conditions.

goprobe's People

Contributors

fako1024 avatar els0r avatar dependabot[bot] avatar florianl avatar silvanbitterli avatar

Stargazers

Michael Schnyder avatar Wojciech avatar  avatar JamieJiang avatar Eliah Rusin avatar LinkTsang avatar  avatar  avatar Cyril Wanner avatar Ludovic Cintrat avatar 纯良小叔 avatar

Watchers

 avatar  avatar

goprobe's Issues

Protect raw/time queries

v4 Release

For queries involving the time attribute, set default -f to 1 day in the past to reduce

  • load on DB
  • size of output (terminal spam)

Post-v4 Release

  • Support building slices of results in special path of DBWorkManager for raw queries
  • Consider creation extension of printer interface that supports streaming chunks of results
  • ...

Improve CI / CD pipeline

With #59 we have a great basis for CI / CD already. However, there are a few things that could be improved:

  • Removal of dummy tag v3.0.1
  • Automatic upload of binaries only (as suggested here)
  • Automatic generation of better release comment (there should be a way to create a list of all commits since the last release, a.k.a. CHANGELOG)
  • Build / upload of rpm packages (and binaries for Fedora / CentOS)
  • Cross-check if tags (bug/feature) can be used in action, otherwise enforce
  • Add PR naming rule action
  • Re-test CD / release and confirm correct output

Logger Overhaul

I have strong reason to believe that the current logging implementation isn't thread-safe. Also, Google's team is developing https://pkg.go.dev/golang.org/x/exp/slog, which will probably make it into the standard library. Why not get ahead and use that as opposed to zap?

DoD:

  • synchronize map access
  • replace the zap logger with slog

Panic during DB writeout (invalid IPv4 / IPv6 key)

When building goProbe from branch 43-global-query (commit 6821b8b414f4afea98a6bd84ef1eb0576249724a) it crashed during (first) writeout:

panic: key is neither ipv4 nor ipv6

goroutine 38 [running]:
github.com/els0r/goProbe/pkg/types.Key.IsIPv4(...)
        /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/types/keyval.go:106
github.com/els0r/goProbe/pkg/types.Key.GetSip(...)
        /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/types/keyval.go:164
github.com/els0r/goProbe/pkg/types/hashmap.List.Sort.func1(0xc0001c2000?, 0xc000212f00?)
        /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/types/hashmap/list.go:43 +0x42b
sort.order2_func(...)
        /usr/local/go/src/sort/zsortfunc.go:299
sort.median_func({0xc0001432e8?, 0xc000212f00?}, 0x9e, 0x9f, 0xa0, 0xc000143188)
        /usr/local/go/src/sort/zsortfunc.go:308 +0x4a
sort.medianAdjacent_func(...)
        /usr/local/go/src/sort/zsortfunc.go:316
sort.choosePivot_func({0xc0001432e8?, 0xc000212f00?}, 0x0, 0x3c?)
        /usr/local/go/src/sort/zsortfunc.go:281 +0x110
sort.pdqsort_func({0xc0001432e8?, 0xc000212f00?}, 0x7f12e76065b8?, 0x18?, 0xc0001c2000?)
        /usr/local/go/src/sort/zsortfunc.go:89 +0xdb
sort.Slice({0x7eb240?, 0xc00021e330?}, 0xd5?)
        /usr/local/go/src/sort/slice.go:26 +0xfa
github.com/els0r/goProbe/pkg/types/hashmap.List.Sort({0xc0000bc000, 0xd5, 0xd5})
        /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/types/hashmap/list.go:39 +0x76
github.com/els0r/goProbe/pkg/goDB.dbData({0xc0001a61e0?, 0xb82a48?}, 0xc0001437a0?, 0x44dc35?)
        /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/goDB/db_writer.go:104 +0xd3
github.com/els0r/goProbe/pkg/goDB.(*DBWriter).Write(0xc000212ae0, 0xc000143d60?, {0xc00011f2f0?}, 0x4?)
        /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/goDB/db_writer.go:53 +0x23d
github.com/els0r/goProbe/pkg/goprobe/writeout.(*Handler).HandleWriteouts.func1()
        /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/goprobe/writeout/handler.go:217 +0x74d
created by github.com/els0r/goProbe/pkg/goprobe/writeout.(*Handler).HandleWriteouts
        /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/goprobe/writeout/handler.go:179 +0xaa

It's unclear where that comes from (and there's been a lot of changes to slimcap that are not yet merged / matched in goProbe), but still this should be investigated.

In addition, upon restarting goProbe without removing the DB directory there was an error upon next (and all consecutive) writeouts:

2023-04-05T11:30:00.091+0200    error   writeout/handler.go:222 Error during writeout: seek /tmp/goprobe_test/eth0/2023/04/1680652800/proto.gpf: invalid argument       {"app_name": "goProbe_alpine", "app_version": "872abf81"}

This indicates that the failure protection implemented in #60 doesn't work as intended (maybe because it failed on the first attempt, leaving an empty file that now can't be seeked:

fw-2:~# ll /tmp/goprobe_test/eth0/2023/04/1680652800/proto.gpf
-rw-r--r--    1 root     root             0 Apr  5 11:30 /tmp/goprobe_test/eth0/2023/04/1680652800/proto.gpf

DoD

  • Reproduce write failure on first attempt (and add case to tests) and handle accordingly
  • Track potential issue with invalid keys (as the panic indicates, this should not happen (tm))
  • Fix discovered hashmap iterator issue

Remove custom libpcap / gopacket dependency

We should consider options to get rid of the custom built static libpcap / gopacket dependency in favor of simply using the vanilla gopacket library. As a workaround to the incoming/outgoing flag provided via the custom library we could split up capturing for each interface into two goroutines, each limiting traffic via BPF filter to either incoming or outgoing. Consequently, capturing must (opaquely) "duplicate" each interface and finally merge the results into combined flows.

Layer goDB paths with year and month subdirectories

In order to make the folder structure more readable and less cluttered (image 10 years worth of data capture: doing an ls on any interface will yield >3500 entries with default DB write interval, all just unreadable epoch timestamps in one large list). In addition, it would make accesses slightly leaner (because the time range selection could already reduce the number of directories that require a GetDirent() (and subsequent sorting by filename).

DoD

  • Extend goDB disk structure to include year and month, so <interface>/<YYYY>/<MM>/<EPOCH>
  • Ensure year and month are "flipped" correctly upon transition
  • Allow for workload reduction in DBWorkManager based on time ranges (extend loop)
  • Add tests

DB summary not being written

It seems that after the recent changes (directory restructure?), the DB summary isn't written / updated anymore. I can see the DB files getting updated, and a query yields results for e.g. the last five minutes, but the summary.json has remained unchanged since the update to the newest version of goProbe.
A quick grep in the code shows that the functions from Summary.go do not seem to be called from anywhere:

[fako@fako-x1 goProbe]$ grep -r DBSummary pkg/
pkg/goDB/Summary.go:type DBSummary struct {
pkg/goDB/Summary.go:func NewDBSummary() *DBSummary {
pkg/goDB/Summary.go:    summ := new(DBSummary)
pkg/goDB/Summary.go:// LockDBSummary tries to acquire a lockfile for the database summary.
pkg/goDB/Summary.go:func LockDBSummary(dbpath string) (acquired bool, err error) {
pkg/goDB/Summary.go:// LockDBSummary removes the lockfile for the database summary.
pkg/goDB/Summary.go:func UnlockDBSummary(dbpath string) (err error) {
pkg/goDB/Summary.go:func ReadDBSummary(dbpath string) (*DBSummary, error) {
pkg/goDB/Summary.go:    result := NewDBSummary()
pkg/goDB/Summary.go:            result = NewDBSummary()
pkg/goDB/Summary.go:func WriteDBSummary(dbpath string, summ *DBSummary) error {
pkg/goDB/Summary.go:func ModifyDBSummary(dbpath string, timeout time.Duration, modify func(*DBSummary) (*DBSummary, error)) (modErr error) {
pkg/goDB/Summary.go:            acquired, err := LockDBSummary(dbpath)
pkg/goDB/Summary.go:                    if err := UnlockDBSummary(dbpath); err != nil {
pkg/goDB/Summary.go:            summ, err := ReadDBSummary(dbpath)
pkg/goDB/Summary.go:                            summ = NewDBSummary()
pkg/goDB/Summary.go:            return WriteDBSummary(dbpath, summ)
pkg/goDB/Summary.go:func (s *DBSummary) Update(u InterfaceSummaryUpdate) {

@els0r : Did I miss something here, or is it possible that it was forgotten to adapt the calls to these updates during the restructuring?

Micro-optimization collection issue

This issue tracks potential micro- (assuming those are the only ones left in the end) optimizations to add prior to release (probably in conjunction with #74 ).

Rationale: The linked issue already improves performance by about 14% with only few changes. It should be noted though that the zero-allocation approach using NextPacketFn would be even better, but unfortunately using that via the capture.Source interface prevents the compiler from keeping the GPPacket on the stack (by design, there's nothing we can do about it), which kills any potential performance improvements by almost an order of magnitude. Maybe this could be avoided by skipping GPPacket entirely and just interacting directly with the flow map (see also below).

  • (Re-)Add ZeroCopy() option / mode for ring buffer source (issue)

Rationale: Currently data is copied from the ring buffer, no matter if a buffer has been provided to NextPacket() / NextIPPacket(). Since the data in the ring buffer is invalidated only upon the next call to those methods we can re-add a real zero-copy mode and then doing:

// Populate the packet
data = s.curTPacketHeader.payloadNoCopyAtOffset(uint32(s.ipLayerOffset), uint32(snapLen))
//s.curTPacketHeader.payloadCopyPutAtOffset(data, uint32(s.ipLayerOffset))

where

func (t tPacketHeader) payloadNoCopyAtOffset(offset, to uint32) []byte {
	mac := uint32(*(*uint16)(unsafe.Pointer(&t.data[t.ppos+24])))
	return t.data[t.ppos+mac+offset : t.ppos+mac+to]
}
  • Calculate reverse EPHash on demand only

Rationale: Right now we determine both EPHash and EPHashReverse in parallel when populating GPPacket. However, not only is it probably faster to just swap the relevant parts of the hash to "convert" the forward hash to the reverse hash (in place), it could also be done if required only (i.e. if there is no match for the forward hash in the flow map, because otherwise there's no reason to calculate it at all). We of course can't know which one is more likely to occur (at least not without additional tracking), but statistically speaking we can skip the reverse hash calculation in ~50% of the cases (ok, that's probably not the right number, but we can save a substantial amount of calculations / memory operations).

  • Remove capture source interface

Rationale: We are currently using a capture.Source (respectively capture.SourceZeroCopy) interface type for the capture.Capture type. While being (semi-)generic it also add the interface overhead for each call to any method of the underlying source. At the level we are currently at it might be reasonable to use an explicit type instead (maybe wrapped in a local named type in order to support future changes quickly but allow for specific compiler-level checks & optimizations).

  • Remove main capture routine select{} statement

Rationale: Prior to implementing the two-way lock for the capture routine we had to continuously check if a lock was pending / had to be acted on. Technically this is not required anymore (because the lock is signified via a channel with depth 1, so although being atomic the command can be consumed out-of-band in the capture routine, making a synchronous select{} statement obsolete.

  • Remove debugging statements

Rationale: Even though final removal of debugging statements will be done in fako1024/slimcap#6 at a later point in time, it would be good to know how this change will impact performance (positively, I assume) once completed.

...

Make OS integration leaner

Currently, OS integration is not up to date with the latest changes, e.g. the goprobe.service systemd script executes lots of pre/post tasks, which are unnecessary within systemd (no PID file is required, systemd takes care of that by itself), plus we don't have any shared libraries anymore, hence the LD_LIBRARY_PATH stuff is unneeded as well.
For me, this simple file works splendidly:

[Unit]
Description=Network Traffic Monitoring
After=network.target syslog.target

[Service]
ExecStart=/bin/goProbe -config /opt/ntm/goProbe/etc/goprobe.conf
StandardOutput=syslog
Restart=on-failure

[Install]
WantedBy=multi-user.target
Alias=goprobe.service

Now, obviously there is one issue with that: The socket file (which might remain stale). However, I don't think we should handle that in the init script (as it is OS specific and just "cleans up" a mess that shouldn't even be there). Plus, systemd doesn't support custom service verbs (our current "info" verb doesn't work), so the whole socket interaction doesn't hold properly within that concept anyways. There's other considerations that bother me with the current Socket / info concept:

  • We depend on netcat, which strongly differs for each OS
  • We depend on perl, just to interact with the daemon and get a status information (OMG!)
  • Sockets are fun, but not very 2018 (let's face it, it's in there for OSAG)
  • On several occasions, the info target locks up on some machines (it's sporadic, but sometimes it takes minutes until I get even the header line), so stability is an issue as well

I have the following proposals:

  • We remove the whole socket logic and provide an HTTP endpoint instead on lo. This can be parsed / extracted by any script / tool on any OS and will get rid of all other dependencies. Plus, go is made for that shit. Plus, we can extend it with more functionality in the future (as per your suggestion), to e.g. allow querying as well.
  • We provide an additional option to the goProbe binary (e.g. -info) that just queries that endpoint and prints the information (obviously not doing anything else, just exiting right away). That way we ensure that "info" can be called on any machine where goProbe is actually running without any additional dependencies / tools / whatever.

As a result, we would massively simplify OS integration, make the dependency stack much smaller (which is even more important now that we can just "go get", sucks a little that in the end the user figures out that there's a huge list of additional dependencies just to get it running).

@els0r , I'd be interested in your thoughts about this...

goProbe not shutting down gracefully

There still seems to be an issue with the state machine somehow. I can (relatively) reliably reproduce the following behavior:

Shutdown a running instance of goProbe via Ctrl-C does not cause the state machine to terminate correctly (in fact, packets are continued to be processed - I had some debugging statements in the code that were still popping up after I sent the signal). Signal is received, but even after a while it still continues to run:

...
^C2023-03-06T14:29:20.333+0100	info	goProbe/goProbe.go:232	shutting down gracefully	{"app_name": "goProbe_slimcap", "app_version": "872abf81"}
...

I was able to abort it with a signal 6, maybe that gives a hint at where he's stuck: abort.txt

Switch goProbe's API to gin-gonic

We are currently using go-chi as the API framework. There are more widespread frameworks out there, and #43 will use gin-gonic under the hood.

At the same time, I would favor to remove the whole v1 prefix from the API, or handle it differently. How often to we change versions? Don't we break every API imaginable with the switch to v4 anyhow? Does it really matter at this scale? Having a standard set of endpoints ready to go would also simplify the way we currently set up and configure the API.

So why not use the same framework for all APIs that this software suite has to offer?

DoD

  • replace go-chi with gin-gonic
  • simplify API instantiation
  • unify logic in the endpoints (selection for all interfaces and a /:interface sub-path
  • hook up the https://github.com/gin-contrib/pprof middleware when profiling is enabled (maybe the --profiling flag isn't so dumb after all
  • investigate which parts of the code base can be re-used in #43 (e.g. request logging, etc.)

Support in-place sorting of hashmap-based lists

With hashmap.List and hashmap.Flatten() the list-based Sort() we already have rudimentary functionality to convert the in-memory maps containing goQuery results into a flattened table structure. This could be leveraged to avoid the current approach of copying all map entries into a results structure after the aggregation step and instead use less data duplication (at least using the hashmap keys from memory).

DoD

  • Add more generic sorting concept for hashmap.List() (to match current features)
  • Support zero-copy for transfer from map to list (if possible, at least do not copy keys)
  • Add tests

Deadlock during shutdown when no traffic present on interface

It seems that there is some potential deadlock situation during shutdown when there is no traffic on an interface being monitored. When trying to stop the service / executing Ctrl-C when running interactively in a shell, goprobe simply stops when trying to deactivate the handle for said interface, emitting the deactivation message for all prior interfaces and then stalling. Killing the binary with a trace signal reveals this (this is on release 2.1.1, built and executed on debian):

`C=0x45bdb1 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex(0xa0fc80, 0x80, 0x0, 0x0, 0xc000000000, 0x7ffd5b06bc58, 0x0, 0x0, 0x7ffd5b06bc78, 0x40d9c2, ...)
/home/fk/Develop/goProbe-2.1.1/go/src/runtime/sys_linux_amd64.s:531 +0x21
runtime.futexsleep(0xa0fc80, 0x7ffd00000000, 0xffffffffffffffff)
/home/fk/Develop/goProbe-2.1.1/go/src/runtime/os_linux.go:46 +0x4b
runtime.notesleep(0xa0fc80)
/home/fk/Develop/goProbe-2.1.1/go/src/runtime/lock_futex.go:151 +0xa2
runtime.stoplockedm()
/home/fk/Develop/goProbe-2.1.1/go/src/runtime/proc.go:2165 +0x8a
runtime.schedule()
/home/fk/Develop/goProbe-2.1.1/go/src/runtime/proc.go:2565 +0x2d9
runtime.park_m(0xc00006a600)
/home/fk/Develop/goProbe-2.1.1/go/src/runtime/proc.go:2676 +0xae
runtime.mcall(0x5a43d0)
/home/fk/Develop/goProbe-2.1.1/go/src/runtime/asm_amd64.s:299 +0x5b

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc00008a4e8)
/home/fk/Develop/goProbe-2.1.1/go/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc00008a4e0)
/home/fk/Develop/goProbe-2.1.1/go/src/sync/waitgroup.go:130 +0x64
OSAG/goProbe.(*RunGroup).Wait(0xc00008a4e0)
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goProbe/rungroup.go:28 +0x2d
OSAG/goProbe.(*CaptureManager).disable(0xc00001e080, 0xc0000825e0, 0x2, 0x2)
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goProbe/capture_manager.go:118 +0xfc
OSAG/goProbe.(*CaptureManager).DisableAll(0xc00001e080)
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goProbe/capture_manager.go:169 +0x5f
main.main()
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/capture/cmd.go:211 +0x871

goroutine 19 [syscall]:
os/signal.signal_recv(0x6700e0)
/home/fk/Develop/goProbe-2.1.1/go/src/runtime/sigqueue.go:139 +0x9c
os/signal.loop()
/home/fk/Develop/goProbe-2.1.1/go/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
/home/fk/Develop/goProbe-2.1.1/go/src/os/signal/signal_unix.go:29 +0x41

goroutine 4 [chan receive]:
main.handleWriteouts(0xc00005c060, 0xc000178000, 0x0)
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/capture/cmd.go:272 +0xef5
created by main.main
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/capture/cmd.go:186 +0x6e3

goroutine 21 [IO wait]:
internal/poll.runtime_pollWait(0x7f83ddf12e30, 0x72, 0x0)
/home/fk/Develop/goProbe-2.1.1/go/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc00016c018, 0x72, 0xc000058000, 0x0, 0x0)
/home/fk/Develop/goProbe-2.1.1/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a
internal/poll.(*pollDesc).waitRead(0xc00016c018, 0xffffffffffffff00, 0x0, 0x0)
/home/fk/Develop/goProbe-2.1.1/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Accept(0xc00016c000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/home/fk/Develop/goProbe-2.1.1/go/src/internal/poll/fd_unix.go:384 +0x1a0
net.(*netFD).accept(0xc00016c000, 0x0, 0xc0001780c0, 0xc000035fc8)
/home/fk/Develop/goProbe-2.1.1/go/src/net/fd_unix.go:238 +0x42
net.(*UnixListener).accept(0xc000010030, 0x0, 0x0, 0x0)
/home/fk/Develop/goProbe-2.1.1/go/src/net/unixsock_posix.go:162 +0x32
net.(*UnixListener).Accept(0xc000010030, 0xc000088300, 0xc00008a408, 0x66f5c0, 0xc0001780c0)
/home/fk/Develop/goProbe-2.1.1/go/src/net/unixsock.go:257 +0x47
main.handleControlSocket(0x670fe0, 0xc000010030, 0xc00005c060)
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/capture/cmd.go:366 +0xc8
created by main.main
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/capture/cmd.go:197 +0x7f0

goroutine 7 [chan receive]:
OSAG/goProbe.(*Capture).process(0xc00016c200)
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goProbe/capture.go:456 +0x2b6
created by OSAG/goProbe.NewCapture
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goProbe/capture.go:365 +0x174

goroutine 9 [syscall]:
github.com/google/gopacket/pcap._Cfunc_pcap_next_ex_escaping(0x7f83c4000a10, 0xc00016e310, 0xc00016e318, 0x0)
_cgo_gotypes.go:538 +0x4d
github.com/google/gopacket/pcap.(*Handle).getNextBufPtrLocked.func1(0x7f83c4000a10, 0xc00016e310, 0xc00016e318, 0xc00001e4f0)
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/github.com/google/gopacket/pcap/pcap.go:403 +0x6a
github.com/google/gopacket/pcap.(*Handle).getNextBufPtrLocked(0xc00016e2d0, 0xc000040d20, 0x3, 0xc000020070)
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/github.com/google/gopacket/pcap/pcap.go:403 +0x8a
github.com/google/gopacket/pcap.(*Handle).ReadPacketData(0xc00016e2d0, 0x647c68, 0x0, 0xc000040d30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/github.com/google/gopacket/pcap/pcap.go:340 +0x77
github.com/google/gopacket.(*PacketSource).NextPacket(0xc0000103c0, 0x646f98, 0xc00016c280, 0xc000040f10, 0x0)
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/github.com/google/gopacket/packet.go:801 +0x6d
OSAG/goProbe.(*Capture).process.func1(0x0, 0x0)
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goProbe/capture.go:398 +0x99
OSAG/goProbe.(*Capture).process(0xc00016c280)
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goProbe/capture.go:440 +0xd0
created by OSAG/goProbe.NewCapture
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goProbe/capture.go:365 +0x174

goroutine 22 [chan receive]:
main.handleRotations(0xc00005c060)
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/capture/cmd.go:232 +0xb6
created by main.main
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/capture/cmd.go:200 +0x812

goroutine 25 [chan receive]:
OSAG/goProbe.(*Capture).Disable(0xc00016c280)
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goProbe/capture.go:743 +0xdd
OSAG/goProbe.(*CaptureManager).disable.func1()
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goProbe/capture_manager.go:115 +0x4a
OSAG/goProbe.(*RunGroup).Run.func1(0xc00008a4e0, 0xc000082620)
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goProbe/rungroup.go:23 +0x4f
created by OSAG/goProbe.(*RunGroup).Run
/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goProbe/rungroup.go:21 +0x62

rax 0xca
rbx 0xa0fb40
rcx 0x45bdb3
rdx 0x0
rdi 0xa0fc80
rsi 0x80
rbp 0x7ffd5b06bc40
rsp 0x7ffd5b06bbf8
r8 0x0
r9 0x0
r10 0x0
r11 0x286
r12 0xc
r13 0xff
r14 0x66a556
r15 0x0
rip 0x45bdb1
rflags 0x286
cs 0x33
fs 0x0
gs 0x0`

The issue is reproducible on any interface which encounters zero traffic (e.g. when no cable is connected and no outgoing connections are being established).
My first guess would be that the NextPacket() method is waiting for a new packet while the deactivation signal is received on the control channel, which cannot be executed due to the capture method still waiting...

Occasional "bad map state" during query

Non-reproducibly, a large query (-30d) on a relatively large DB (~182MB compressed) crashes during query (see below). Based on the paths in the trace the issue occurs with v2.1.1 (built with go 1.11.1).

fatal error: bad map state

goroutine 8 [running]:
runtime.throw(0x5b460b, 0xd)
	/home/fk/Develop/goProbe-2.1.1/go/src/runtime/panic.go:608 +0x72 fp=0xc000473920 sp=0xc0004738f0 pc=0x42bf92
runtime.evacuate(0x57e7e0, 0xc00046e1e0, 0x414)
	/home/fk/Develop/goProbe-2.1.1/go/src/runtime/map.go:1097 +0x657 fp=0xc0004739e0 sp=0xc000473920 pc=0x410927
runtime.growWork(0x57e7e0, 0xc00046e1e0, 0x414)
	/home/fk/Develop/goProbe-2.1.1/go/src/runtime/map.go:1043 +0x61 fp=0xc000473a08 sp=0xc0004739e0 pc=0x410281
runtime.mapassign(0x57e7e0, 0xc00046e1e0, 0xc000473c80, 0xc016485f68)
	/home/fk/Develop/goProbe-2.1.1/go/src/runtime/map.go:579 +0x4be fp=0xc000473a90 sp=0xc000473a08 pc=0x40f3fe
OSAG/goDB.(*DBWorkManager).readBlocksAndEvaluate(0xc000056a40, 0xc000010a00, 0xc00024049b, 0xa, 0xc0002ec000, 0x120, 0x200, 0xc00046e1e0, 0x0, 0x0)
	/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goDB/DBWorkManager.go:302 +0xb92 fp=0xc000473ef0 sp=0xc000473a90 pc=0x532be2
OSAG/goDB.(*DBWorkManager).grabAndProcessWorkload(0xc000056a40, 0xc00021cfc0, 0xc00021cf00, 0xc000296360)
	/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goDB/DBWorkManager.go:140 +0x127 fp=0xc000473fc0 sp=0xc000473ef0 pc=0x531db7
runtime.goexit()
	/home/fk/Develop/goProbe-2.1.1/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc000473fc8 sp=0xc000473fc0 pc=0x457d91
created by OSAG/goDB.(*DBWorkManager).ExecuteWorkerReadJobs
	/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goDB/DBWorkManager.go:159 +0xc8

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc000296368)
	/home/fk/Develop/goProbe-2.1.1/go/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc000296360)
	/home/fk/Develop/goProbe-2.1.1/go/src/sync/waitgroup.go:130 +0x64
OSAG/goDB.(*DBWorkManager).ExecuteWorkerReadJobs(0xc000056a40, 0xc00021cf00)
	/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goDB/DBWorkManager.go:168 +0x1de
main.main()
	/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/query/GPQuery.go:729 +0xae6

goroutine 6 [chan receive]:
main.main.func1(0xc00007eeb0, 0xc0000b0000, 0x414ea08e00000000)
	/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/query/GPQuery.go:682 +0x76
created by main.main
	/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/query/GPQuery.go:680 +0x6c4

goroutine 7 [runnable]:
main.aggregate(0xc00021cf00, 0xc00021cf60)
	/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/query/GPQuery.go:413 +0x3a4
created by main.main
	/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/query/GPQuery.go:725 +0xa77

goroutine 9 [runnable]:
OSAG/goDB.(*DBWorkManager).readBlocksAndEvaluate(0xc000056a40, 0xc000010a00, 0xc000246a9b, 0xa, 0xc0002dc000, 0x120, 0x200, 0xc00006e5d0, 0x0, 0x0)
	/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goDB/DBWorkManager.go:304 +0xc66
OSAG/goDB.(*DBWorkManager).grabAndProcessWorkload(0xc000056a40, 0xc00021cfc0, 0xc00021cf00, 0xc000296360)
	/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goDB/DBWorkManager.go:140 +0x127
created by OSAG/goDB.(*DBWorkManager).ExecuteWorkerReadJobs
	/home/fk/Develop/goProbe-2.1.1/addon/gocode/src/OSAG/goDB/DBWorkManager.go:159 +0xc8```

Replace gozstd with zstd system library (in line with lz4)

We are currently using gozstd as primary zstandard implementation / package. Under the hood, it ships statically compiled libraries for all architectures. Along the lines of what we recently did with lz4 I propose to switch the zstd implementation in the same manner. That way we are independent of statically compiled components and have a unified way of interacting with these C libraries.
Since both use standard C libraries without customization I assume there will be no performance impact (but potentially even better access to parameters like block sizes and / or checksum control).

DoD

  • Replace gozstd with direct link into system-wide libzstd (similar to lz4 interface)
  • Attempt to re-use compressor / decompressor to avoid repeated context allocation
  • Provide global NullCompressor for shared use (at it isn't context aware / just a wrapper around copy()
  • Perform some tests / benchmarks

Simplify the Table Printer

It's an exercise in boilerplate at the moment for something, which is relatively simple: print elements of an array in different format.

For json, this comes out of the box with annotations of the data structure.

Instead of having a complicated TablePrinter interface, let the list of Attribute/Value structs implement To<Format>.

This change would also be a chance to move away from a goDB internal type to something that has more appropriate type structure. After all, it stands at the end of the processing pipeline when the most performance-critical parts have been taken care of.

Something along the lines of:

type Result struct {
	Timestamp time.Time `json:"timestamp,omitempty"
	Iface string `json:"iface,omitempty"
	SrcIP net.IP `json:"src_ip,omitempty"`
	DstIP net.IP `json:"dst_ip,omitempty"`
	IPProto string `json:"proto,omitempty"`
	DstPort uint16 `json:"dport,omitempty"
}

Investigate move away from home-cooked bigendian package

Just a sidenote: We might reconsider replacing these with Go standard library little-endian storage / readout - in some recent benchmarks I realized that with recent Go versions there doesn't seem to be any performance benefit anymore (actually, in some cases even a penalty)...

Originally posted by @fako1024 in #12

Flow direction heuristic breaks on long-running UDP flows (and other scenarios)

TL;DR: The direction detection heuristic stumbles on long-running UDP connections (e.g. VPN connections), causing significant inconsistencies / errors in measured ingress / egress traffic.

Longer story:
While implementing AF_PACKET in #34 I stumbled upon some flow mismatches when comparing the PCAP approach (by simply running two instances in parallel, one using PCAP, one using AF_PACKET). At first I thought it was related to, well, AF_PACKET, but it turns out that this can be reproduced by simply running two standard instances of goProbe. In the beginning, the output from goQuery on the two databases is identical, but after a few hours of running on my local device I obtained these results from the two DBs, respectively:

└─ $ ▶ goQuery -i any  -d /tmp/goprobe_db_pcap -f -10000d dport,proto -n 10

                              packets   packets             bytes      bytes       
        iface  dport  proto        in       out      %         in        out      %
  enp45s0u1u1    443    UDP    2.32 M  265.91 k  37.35    2.76 GB   25.99 MB  49.41
  enp45s0u1u1   3478    UDP  949.59 k    1.34 M  33.15  725.17 MB  871.80 MB  27.67
  enp45s0u1u1  50024    UDP  579.44 k  777.59 k  19.62  435.55 MB  477.46 MB  15.82
  enp45s0u1u1    443    TCP  114.95 k   96.88 k   3.06  192.06 MB   47.86 MB   4.16
  enp45s0u1u1   4500    UDP   89.55 k   63.22 k   2.21   58.12 MB   23.49 MB   1.41
  enp45s0u1u1  50016    UDP  184.82 k   65.19 k   3.61   35.90 MB   12.32 MB   0.84
  enp45s0u1u1  50059    UDP   33.80 k    3.45 k   0.54   29.29 MB  487.12 kB   0.52
  enp45s0u1u1    993    TCP    9.44 k    8.62 k   0.26    3.97 MB    1.26 MB   0.09
  enp45s0u1u1      0    UDP    1.00      2.08 k   0.03    1.47 kB    2.78 MB   0.05
  enp45s0u1u1  50050    UDP    1.44 k    2.26 k   0.05  173.56 kB    1.05 MB   0.02
                                  ...       ...               ...        ...       
                               4.29 M    2.63 M           4.21 GB    1.43 GB       
                                                                            
      Totals:                            6.92 M                      5.64 GB  

Timespan / Interface : [2022-08-18 11:31:19, 2022-08-18 17:56:19] / virbr0,lo,enp45s0u1u1

vs.

└─ $ ▶ goQuery -i any  -d /tmp/goprobe_db_pcap2 -f -10000d dport,proto -n 10

                              packets   packets             bytes      bytes       
        iface  dport  proto        in       out      %         in        out      %
  enp45s0u1u1    443    UDP    2.32 M  265.91 k  37.35    2.76 GB   25.99 MB  49.41
  enp45s0u1u1   3478    UDP  890.49 k    1.42 M  33.42  666.36 MB  913.75 MB  27.38
  enp45s0u1u1  50024    UDP  642.15 k  700.42 k  19.41  497.12 MB  435.56 MB  16.16
  enp45s0u1u1    443    TCP  114.95 k   96.88 k   3.06  192.06 MB   47.86 MB   4.16
  enp45s0u1u1   4500    UDP   89.55 k   63.22 k   2.21   58.12 MB   23.49 MB   1.41
  enp45s0u1u1  50016    UDP  184.79 k   65.22 k   3.61   35.90 MB   12.33 MB   0.84
  enp45s0u1u1  50059    UDP   30.21 k    3.02 k   0.48   26.54 MB  425.87 kB   0.47
  enp45s0u1u1    993    TCP    9.44 k    8.62 k   0.26    3.97 MB    1.26 MB   0.09
  enp45s0u1u1      0    UDP    1.00      2.08 k   0.03    1.47 kB    2.78 MB   0.05
  enp45s0u1u1  50050    UDP    1.44 k    2.26 k   0.05  173.67 kB    1.05 MB   0.02
                                  ...       ...               ...        ...       
                               4.29 M    2.63 M           4.21 GB    1.43 GB       
                                                                            
      Totals:                            6.92 M                      5.64 GB  

Timespan / Interface : [2022-08-18 11:31:19, 2022-08-18 17:56:19] / enp45s0u1u1,virbr0,lo

As can be seen, the overall sum of tracked packets is basically identical, but the two tuples for ports 3478 and 50024 show significant leakage. Most likely at some point in time flow expiry differed slightly (because of raciness), causing packets direction to come out differently in the next time frame.

Structured logging for goProbe/goQuery

The way the logger is plastered all over the code base doesn’t help readability of the code base and takes an unwieldy detour via the els0r/log package.

If we already go with an explicit dependency for logging, I suggest that we use the prevalent packages for structured logging in go:

  • zap
  • logrus

Personally leaning towards the former due to it’s clearer API and performant logging if using explicit types.

Suggestion is to tackle this after the open PRs around AF_Packet and compression have been wrapped up.

Improved map key & in-memory data structure concept

After doing some tests as part of #37 it might be beneficial to revamp the internal data structures, in particular the in-memory flow aggregation map(s) (which uses an expensive struct key right now and is not very flexible). General idea would be to replace the map key by a string and then use raw data bytes (with string casts where required) all the way from capture until the moment information needs to be formatted for the user. Expected advantages:

  • Fewer allocations due to reduced number of conversions / struct population
  • Less time spent on map hashing (since a simple string can be hashed faster than a composite key), which currently is one of the most CPU consuming tasks for larger queries
  • More flexibility w.r.t. to keying, e.g. IPv4 vs. IPv6 traffic (could save space both in-memory and on-disk
  • More straight-forward data storage / retrieval (essentially just handling byte slices instead of copying forth and back between structs and bytes)

Refactor metadata concept

Currently, GPFile block metadata is stored in .gpf.meta files, one metadata file per GPFile. On interface level this is invisible to the caller (GPFile essentially handling its respective header under the hood). Unfortunately there's a few issues with the whole concept that need to be tackled:

  • Both goProbe and goQuery have to open one additional (tiny) file for each day that is access, effectively doubling the amount of IOPS
  • The metadata information is stored and processed as a single line protocol (read / written line by line), causing a huge amount of overhead
  • The whole concept of a "hidden" metadata file that gets handled behind the scenes feels slightly unclean

DoD

  • Switch GPFile metadata storage to a single metadata file per "day" / folder
  • Choose highly efficient serialization protocol (and potentially compression)
  • Reading: Load metadata upon access to anything in a folder and inject into GPFile constructor (or maybe even better: abstract "Folder" that lazy-loads "Columns" (i.e. GPFiles) upon access, acting as manager)
  • Writing: Update metadata upon successful write of all blocks to director (caveat: what happens if there is a failure during write?)

Update .deb build Makefile target and instructions

The current way of building .deb Packages is problematic in the sense that it uses the binaries compiled on the build host to generate Debian-specific packages. Due to the CGO dependency this will cause issues if e.g. a different GLIBC version is present on the build host (or none at all, e.g. on Alpine).
Since we're relying on Docker already anyways and we've switched the build pipeline to support Go modules, performing the full build in a multi-stage Dockerfile should be almost trivial.

Make Metadata writes atomic

In order to minimize the likelihood of broken blocks or even full GPDir / GPFile instances we should make the atomic (or at least as atomic as possible). That way we can not only minimize the possibility of corruption but also "recover" even if e.g. all or some of the GPFiles in a GPDir have been written when an issue (e.g. system halt or similar) occurs by simply picking up the file offsets from the last successful write, losing only the block during which the error occurred.

DoD

  • Make metadata write atomic (use temporary file, then move after successful serialization)
  • Ensure that GPFile instances are opened at their last known offset (not the end) and overwriting is allowed
  • Add test(s) to reproduce a failure / recovery scenario

Packet fragmentation handling

In order to assess slimcap we currently are throwing a WARNING message upon every fragmented packet we encounter in GPPacket.go:

// return decoding error if the packet carries anything other than the
// first fragment, i.e. if the packet lacks a transport layer header
if fragOffset != 0 {
  return fmt.Errorf("Fragmented IP packet: offset: %d flags: %d", fragOffset, fragBits)
}

After testing is concluded this should be removed / replaced by a simple nil return (as we are covering / measuring fragmented packets based on their first fragment).

DoD

  • Check if any further handling is required (e.g. for IPv6)
  • Remove error in favor of nil return
  • Add a simple test to GPPacket_test.go
  • Add E2E test case to cover fragmentation

Improve performance / throughput of packet population / classification

Since the packet population in GPPacket.go is the hot path of our capture processing logic in goProbe we should maximize achievable throughput of these functions (since during processing no further packets can be fetched from the wire).

Also relevant: fako1024/slimcap#1

DoD

  • Profile bottlenecks / assess potential for optimizations
  • Replace transport layer copy with a single byte (for now, could be extended selectively in the future) for any (protocol dependent) auxiliary information about a packet we need to carry for later classification to minimize the overhead (similar to what we had in place before with tcpFlags, but generalized)
  • Replace reset() functionality with a newly stack-allocated instance of GPPacket (preliminiary tests show it's much faster and does not escape to the heap)
  • Remove capture.Packet interface in favor of specific struct (already implemented upstream in fako1024/slimcap#1)
  • Compare performance

Support skipping irrelevant data for IPv4/IPv6 only conditions

There's two scenarios where it could be useful to support reading / addressing only parts of the source files:

  • If there's a time range overlap with the current directory (i.e. "day") that would cause only parts of the file being used (in particular when using the memory buffered file) as mentioned by @els0r .
  • If there's a conditional that is either IPv4-only or IPv6-only. Since the flows are split at a margin it would significantly boost performance for such cases if GPFile would only consume / read the required part of the files and the DBWorkManager would start / stop only at the required ranges.

DoD:

  • Determine if a query has conditionals that limit it to IPv4 / IPv6 only
  • Support sub-addressing the MemFile (i.e. handing over a start / end option to GPFile that limits either the os.File or the MemFile to the required range, causing only a partial read)
  • Set the loop boundaries and read operations in DBWorkManager accordingly

goProbe panics when writing out data for an interface without flows

ts=2023-03-06T17:04:46Z level=debug caller=capture/capture.go:183 msg="there are currently no flow records available" app_name=goProbe app_version=8e02f0a7 iface=eth1 state=capturing
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x751a17]

goroutine 9 [running]:
github.com/els0r/goProbe/pkg/goDB.dbData({0xc0001a2000?, 0x88?}, 0x0?, 0xc000121830?)
        /usr/src/packages/src/sdwan/security/pkg/OSAGtmi/goProbe/pkg/goDB/db_writer.go:107 +0xb7
github.com/els0r/goProbe/pkg/goDB.(*DBWriter).Write(0xc0002c2930, 0xc000121d60?, {0xc00002d540?}, 0x4?)
        /usr/src/packages/src/sdwan/security/pkg/OSAGtmi/goProbe/pkg/goDB/db_writer.go:57 +0x23d
github.com/els0r/goProbe/pkg/goprobe/writeout.(*Handler).HandleWriteouts.func1()
        /usr/src/packages/src/sdwan/security/pkg/OSAGtmi/goProbe/pkg/goprobe/writeout/handler.go:217 +0x790
created by github.com/els0r/goProbe/pkg/goprobe/writeout.(*Handler).HandleWriteouts
        /usr/src/packages/src/sdwan/security/pkg/OSAGtmi/goProbe/pkg/goprobe/writeout/handler.go:179 +0xaa

Revisit HostID generation

Provide means to place an identifier into the HostID field which uniquely identifies the system the goDB is stored on.

See #64 for details.

DoD

  • Remove dependency on external machineid package
  • Implement use of /etc/machine-id + /var/lib/dbus/machine-id for linux
  • Implement fallback random generation / storage of goDB internal ID (same format to be consistent)
  • Adapt logic in goQuery (if required)
  • Add tests

Review compression concept

Since we are planning to perform a conversion step for v4 anyway it could be reasonable to have another look at the current compression concept, e.g.

  • Consider switching to (newest) default LZ4 build (allows direct linking against system-wide liblz4), which might have received significant updates (including performance related ones that might outweigh the benefits of this specific build)
  • Create dictionaries (as supported by LZ4, similar to ZStandard) for columns with low cardinality (e.g. Ports, Proto) to speed up their de(compression)
  • Consider skipping compression for columns with high entropy, e.g. bytes / packets (although at typical LZ4 read speeds it might not matter spped-wise)
  • Use zero-allocation compression (in-place) or at least explicitly re-use memory buffers

Provide interface for query Execution

As a precursor to #43, this issue aims to provide a more generic approach to obtaining query results. With #43, there will be two ways of getting them:

  • via accessing goDB data and aggregating it
  • via consolidated data from a query run on goDB

The latter being important when aggregating results across several hosts.

Proposal is an interface, which goDB and any other tools that call goQuery implement:

type Runner interface {
    Run(ctx context.Context, stmt *query.Statement) (result results.Result, err error)
}

Implications of this are a major restructuring (not necessarily a refactor) of the package structure inside pkg (mainly query and goDB) and having a goDBQueryRunner execute queries.

Note on naming: going for Runner vs. Executor because the latter just sounds off.

Provide compression interface for goDB

Currently, the compression algorithm and library is hard coded into goQuery. Since the number of methods needed to compress the gpf data files is small, an interface is the right option.

More importantly, this switch will open up goQuery to other compression algorithm and won't enforce LZ4 anymore.

Race condition in workload counter

The CI revealed a (minor) race condition in the workload counter of DBWorkManager (see run here).

DoD

  • Fix race condition (probably use an RWMutex or an atomic counter)

Endpoint: /query

Aside from goQuery it should be possible to use the query API directly via goProbe's Web API.

Suggestion:

  • an endpoint that requires a query args definition and returns the json encoded result of the query

Implications for the backend:

  • OOM memory handling needs to be refactored, since we don't want to give the caller of the endpoint the ability to DOS goProbe via a memory intensive query. This is necessary since the query now "lives" inside the capturing probe

@fako1024 : FYI.

Support PF_RING

Since gopacket supports the PF_RING module (see https://www.ntop.org/products/packet-capture/pf_ring/), we might consider supporting it via goprobe as well (refer to https://pkg.go.dev/github.com/google/gopacket/pfring for how to use it within the context of gopacket).
This way, we could support high-traffic scenarios without being limited to libpcap (which will, at some point, start dropping packets if rates get too high).
Ideally, we would support both capturing via libpcap and PF_RING without build constraint, but I'm unsure if that's possible (I wasn't yet able to test the module due to my Kernel being too new, see ntop/PF_RING#817).

Race condition in slimcap capture shutdown

There seems to be a (minor) race condition when calling Close() and Free() on a capture source on shutdown:

2023-04-05T14:18:45.888+0200  info    goProbe/goProbe.go:232  shutting down gracefully        {"app_name": "goProbe_alpine", "app_version": "872abf81"}
2023-04-05T14:18:45.889+0200    info    capture/capture.go:437  closing capture handle  {"app_name": "goProbe_alpine", "app_version": "872abf81", "iface": "lo", "state": "closing"}
2023-04-05T14:18:45.889+0200    info    capture/capture.go:437  closing capture handle  {"app_name": "goProbe_alpine", "app_version": "872abf81", "iface": "eth0", "state": "closing"}
2023-04-05T14:18:45.889+0200    error   capture/capture.go:534  failed to free capture resources: cannot call Free() on open capture source, call Close() first {"app_name": "goProbe_alpine", "app_version": "872abf81", "iface": "lo", "state": "capturing"}
2023-04-05T14:18:45.893+0200    info    writeout/handler.go:261 completed all writeouts {"app_name": "goProbe_alpine", "app_version": "872abf81"}
2023-04-05T14:18:45.893+0200    info    goProbe/goProbe.go:251  graceful shut down completed    {"app_name": "goProbe_alpine", "app_version": "872abf81"}

Free() is obviously (sometimes) called before Close() (which makes sense as there currently is no "termination" state machine / handling in the process() method.

DoD

  • Ensure that Free() is only called after Close() has returned

Alpha test on sensor fleet(s)

Preliminary test done as part of #131 .

Host ID # Interfaces # Cores GB memory RB Block Size RB Num Blocks Classifcation Comments
85f74f66 75 8 32 1 MiB 2 Mid-range volume, Mid-range number of Interfaces Details in #131
86c3efe2 6 32 128 1-2 MiB 2-4 High Volume Host Drops experienced consistently on lower ring buffer setting. Fewer observed when setting to more blocks and higher block size. Came at the expense of a higher memory footprint between 1.1 GB - 2.0 GB vs. 900 MB - 1.2 GB. Still seems to be an issue for traffic bursts.
765bd9af 337 16 32 2 MiB 4 Mid-range volume, High number of Interfaces Directly went to a larger ring buffer size and number of blocks due to many drops observed across the band. With the default setting of 1 MiB and 2 blocks, the drops were in the thousands. With the current setting, spot checks showed no drops
77c4f356 4 48 128 1 MiB 4 High Volume Host No drops

Next steps:

  • deploy and leave running on production hosts (3 flavors)

Minimize GC in custom hashmap implementation

In #44 there have been considerable improvements to performance. However, large queries generate equally large maps of intermediate / aggregate data that the underlying map causes heavy GC pressure due to the underlying bucket implementation using pointers to all those objects (keys + counters). This could probably be alleviated by using a flat memory structure and use indexing instead of pointer logic, possible because:

  • We don't need to delete items from the map (it's add / update only)
  • We know the types involved and hence can choose a static memory / data structure

DoD:

  • Assess low hanging fruits, e.g. storing key + value in a single object (might cut the number of pointers in half)
  • Implement a "flat" data structure that can grow with the map and can be indexed instead of having to rely on pointers
  • Adapt bucket logic to store information in flat structure
  • Benchmark + optimize

goProbe rotation runs into deadlock

When running a current instance of goProbe (develop) I can see flows sporadically not being tracked properly. Compared to a legacy version of goProbe there are several missing (although the overall numbers are not crazily far off, so it must be an issue that does not affect all flows). Both queries were issued at 16:00, so the flows should have long rotated for sure:

# goQuery_legacy -d /usr/local/goProbe/db -i eth1.10 -f "2023-03-07 15:10:00" -c "dip = 84.46.64.49" -n 10 raw

                                                                      packets   packets             bytes      bytes       
               time    iface         sip          dip  dport  proto        in       out      %         in        out      %
  23-03-07 15:23:49  eth1.10  10.0.0.102  84.46.64.49    443    TCP    2.37 k    5.63 k   4.15  155.81 kB    7.84 MB   4.03
  23-03-07 15:38:49  eth1.10  10.0.0.102  84.46.64.49    443    TCP  488.00      1.58 k   1.07   32.62 kB    2.18 MB   1.12
  23-03-07 15:48:49  eth1.10  10.0.0.102  84.46.64.49    443    TCP   49.94 k  132.69 k  94.77    3.17 MB  184.87 MB  94.85
  23-03-07 15:53:49  eth1.10  10.0.0.102  84.46.64.49    443    TCP    4.00      8.00     0.01  264.00  B  576.00  B   0.00
                                                                                                                           
                                                                      52.81 k  139.92 k           3.35 MB  194.89 MB       
                                                                                                                    
            Totals:                                                            192.72 k                    198.25 MB  

Timespan / Interface : [2023-03-07 15:07:45, 2023-03-07 15:58:49] / eth1.10
Sorted by            : first packet time
Query stats          : 4.00   hits in 89ms
Conditions:          : dip = 84.46.64.49

vs.

# goQuery_slimcap -d /tmp/test_db -i eth1.10 -f "2023-03-07 15:10:00" -c "dip = 84.46.64.49" -n 10 raw

                                                                       packets  packets              bytes    bytes        
                 time    iface         sip          dip  dport  proto       in      out       %         in      out       %
  2023-03-07 15:25:00  eth1.10  10.0.0.102  84.46.64.49    443    TCP   2.37 k   5.63 k  100.00  155.81 kB  7.84 MB  100.00
                                                                                                                           
                                                                        2.37 k   5.63 k          155.81 kB  7.84 MB        
                                                                                                                   
              Totals:                                                            8.01 k                     7.99 MB  

Timespan / Interface : [2023-03-07 15:10:00, 2023-03-07 15:30:00] / eth1.10
Sorted by            : first packet time
Query stats          : displayed top 1 hits out of 1 in 12ms
Conditions:          : dip = 84.46.64.49

global-query

Distributed querying for goQuery, aggregating results.Result structures. This allows to run queries and flow aggregations against a global fleet which has goProbe/goQuery deployed.

Proof-of-concept for slimcap

In order to properly assess feasibility / performance, we should implement a simple PoC for goProbe using github.com/fako1024/slimcap instead of gopacket.

DoD

  • Remove gopacket dependency from goProbe
  • Use slimcap to retrieve packets from the wire
  • Transfer data from the received payload to GPPackets / GPFlow (zero-copy)
  • Adapt state machine to accomodate for the more simple interface of slimcap
  • Implement packet stats for slimcap (similar to pcap stats, there should be a socket option for that in AF_PACKET)
  • Implement TPacket V3 handling to provide bulk packet handling instead of polling individual ones
  • Improve upon handling for buffer / block / frame sizes
  • Perform benchmarks and optimize

CI/CD solution for goProbe

We had drone CI in place a while back, but it crapped out and I removed it.

Shall we, for integrity's sake, re-introduce an easy to use, simple CI pipeline?

DoD

  • Create CI for builds / tests
  • Enable race detection for tests
  • Enable test coverage detection
  • Setup some CodeQL checks
  • Setup some basic release pipelines (e.g. .deb + .rpm packages)
  • Adjust badge in README.md

Inconsistent permissions for goDB / metadata

It turns out that there currently is a (trivial) conflict in permission handling for data stored by goProbe in the goDB:

-rw------- 1 root root 232 Mar 17 10:20 .blockmeta
-rw-r--r-- 1 root root 109 Mar 17 10:20 bytes_rcvd.gpf
-rw-r--r-- 1 root root 109 Mar 17 10:20 bytes_sent.gpf
-rw-r--r-- 1 root root 144 Mar 17 10:20 dip.gpf
-rw-r--r-- 1 root root  56 Mar 17 10:20 dport.gpf
-rw-r--r-- 1 root root  73 Mar 17 10:20 pkts_rcvd.gpf
-rw-r--r-- 1 root root  73 Mar 17 10:20 pkts_sent.gpf
-rw-r--r-- 1 root root  36 Mar 17 10:20 proto.gpf
-rw-r--r-- 1 root root  37 Mar 17 10:20 sip.gpf

When implementing #50 I adhered to the security recommendations and set permissions for the metadata file to 0600, while the files are still written with 0644.
Given that traffic data can be considered sensitive I don't really think that having it limited to 0600 is such a bad idea (assuming that usually whoever runs goProbe is also the user doing the query, which is presumptuous of course). @els0r What do you think? We could also make this configurable (should we want to retain the option to keep access more "open)...

DoD

  • Make blockmeta more permissive
  • Add test
  • Ensure that legacy tool honors permissions

Use seeded hash for hashmap

In order to prevent hash collisions (and minimize potential attack vectors) we should use a seeded hash for the hashmap implementation. xxh3 should support this out of the box.

DoD

  • Switch xxh3 to seeded variant
  • Assess performance implications (should have none)

Make compression algorithm used for blocks configurable

Right now, we enforce the use of lz4 for block compression. There are other good compression algorithms out there and we want to extend goQuery to choose the compression algorithm.

Proposal is to:

  • Implement a magic file header in the .gpf file to show that this file supports compression headers
  • Include a leading byte for each compressed block, which stores the compression algorithm used. The byte will be based on an enumeration of supported algorithms.

The additional storage overhead is minimal versus the gained flexibility.

What this does also entail is providing a utility that converts all .gpf files from version 1 to version 2. As a consequence, older versions of goQuery will not be able to read the format.

This should go hand-in-hand with release v4 of the software suite.

@fako1024 : will be fun 😀

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.