Giter VIP home page Giter VIP logo

logprep's Introduction

Logprep

Introduction

Logprep allows to collect, process and forward log messages from various data sources. Log messages are being read and written by so-called connectors. Currently, connectors for Kafka, Opensearch, ElasticSearch, S3, HTTP and JSON(L) files exist.

The log messages are processed in serial by a pipeline of processors, where each processor modifies an event that is being passed through. The main idea is that each processor performs a simple task that is easy to carry out. Once the log message is passed through all processors in the pipeline the resulting message is sent to a configured output connector.

Logprep is primarily designed to process log messages. Generally, Logprep can handle JSON messages, allowing further applications besides log handling.

This readme provides basic information about the following topics:

More detailed information can be found in the Documentation.

About Logprep

Pipelines

Logprep processes incoming log messages with a configured pipeline that can be spawned multiple times via multiprocessing. The following chart shows a basic setup that represents this behaviour. The pipeline consists of three processors: the Dissector, Geo-IP Enricher and the Dropper. Each pipeline runs concurrently and takes one event from it's Input Connector. Once the log messages is fully processed the result will be forwarded to the Output Connector, after which the pipeline will take the next message, repeating the processing cycle.

flowchart LR
A1[Input\nConnector] --> B
A2[Input\nConnector] --> C
A3[Input\nConnector] --> D
subgraph Pipeline 1
B[Dissector] --> E[Geo-IP Enricher]
E --> F[Dropper]
end
subgraph Pipeline 2
C[Dissector] --> G[Geo-IP Enricher]
G --> H[Dropper]
end
subgraph Pipeline n
D[Dissector] --> I[Geo-IP Enricher]
I --> J[Dropper]
end
F --> K1[Output\nConnector]
H --> K2[Output\nConnector]
J --> K3[Output\nConnector]

Processors

Every processor has one simple task to fulfill. For example, the Dissector can split up long message fields into multiple subfields to facilitate structural normalization. The Geo-IP Enricher, for example, takes an ip-address and adds the geolocation of it to the log message, based on a configured geo-ip database. Or the Dropper deletes fields from the log message.

As detailed overview of all processors can be found in the processor documentation.

To influence the behaviour of those processors, each can be configured with a set of rules. These rules define two things. Firstly, they specify when the processor should process a log message and secondly they specify how to process the message. For example which fields should be deleted or to which IP-address the geolocation should be retrieved.

For performance reasons on startup all rules per processor are aggregated to a generic and a specific rule tree, respectively. Instead of evaluating all rules independently for each log message the message is checked against the rule tree. Each node in the rule tree represents a condition that has to be meet, while the leafs represent changes that the processor should apply. If no condition is met, the processor will just pass the log event to the next processor.

The following chart gives an example of such a rule tree:

flowchart TD
A[root]
A-->B[Condition 1]
A-->C[Condition 2]
A-->D[Condition 3]
B-->E[Condition 4]
B-->H(Rule 1)
C-->I(Rule 2)
D-->J(rule 3)
E-->G(Rule 4)

To further improve the performance, it is possible to prioritize specific nodes of the rule tree, such that broader conditions are higher up in the tree. And specific conditions can be moved further down. Following json gives an example of such a rule tree configuration. This configuration will lead to the prioritization of tags and message in the rule tree.

{
  "priority_dict": {
    "category": "01",
    "message": "02"
  },
  "tag_map": {
    "check_field_name": "check-tag"
  }
}

Instead of writing very specific rules that apply to single log messages, it is also possible to define generic rules that apply to multiple messages. It is possible to define a set of generic and specific rules for each processor, resulting in two rule trees.

Connectors

Connectors are responsible for reading the input and writing the result to a desired output. The main connectors that are currently used and implemented are a kafka-input-connector and a kafka-output-connector allowing to receive messages from a kafka-topic and write messages into a kafka-topic. Addionally, you can use the Opensearch or Opensearch output connectors to ship the messages directly to Opensearch or Opensearch after processing.

The details regarding the connectors can be found in the input connector documentation and output connector documentation.

Configuration

To run Logprep, certain configurations have to be provided. Because Logprep is designed to run in a containerized environment like Kubernetes, these configurations can be provided via the filesystem or http. By providing the configuration via http, it is possible to control the configuration change via a flexible http api. This enables Logprep to quickly adapt to changes in your environment.

First, a general configuration is given that describes the pipeline and the connectors, and lastly, the processors need rules in order to process messages correctly.

The following yaml configuration shows an example configuration for the pipeline shown in the graph above:

process_count: 3
timeout: 0.1

pipeline:
  - dissector:
      type: dissector
      specific_rules:
        - https://your-api/dissector/
      generic_rules:
        - rules/01_dissector/generic/
  - geoip_enricher:
      type: geoip_enricher
      specific_rules:
        - https://your-api/geoip/
      generic_rules:
        - rules/02_geoip_enricher/generic/
      tree_config: artifacts/tree_config.json
      db_path: artifacts/GeoDB.mmdb
  - dropper:
      type: dropper
      specific_rules:
        - rules/03_dropper/specific/
      generic_rules:
        - rules/03_dropper/generic/

input:
  mykafka:
    type: confluentkafka_input
    bootstrapservers: [127.0.0.1:9092]
    topic: consumer
    group: cgroup
    auto_commit: true
    session_timeout: 6000
    offset_reset_policy: smallest
output:
  opensearch:
    type: opensearch_output
    hosts:
        - 127.0.0.1:9200
    default_index: default_index
    error_index: error_index
    message_backlog_size: 10000
    timeout: 10000
    max_retries:
    user: the username
    secret: the passord
    cert: /path/to/cert.crt

The following yaml represents a dropper rule which according to the previous configuration should be in the rules/03_dropper/generic/ directory.

filter: "message"
drop:
  - message
description: "Drops the message field"

The condition of this rule would check if the field message exists in the log. If it does exist then the dropper would delete this field from the log message.

Details about the rule language and how to write rules for the processors can be found in the rule configuration documentation.

Getting Started

For installation instructions see: https://logprep.readthedocs.io/en/latest/getting_started.html#installation For execution instructions see: https://logprep.readthedocs.io/en/latest/getting_started.html#run-logprep

Reload the Configuration

A config_refresh_interval can be set to periodically and automatically refresh the given configuration. This can be useful in case of containerized environments (such as Kubernetes), when pod volumes often change on the fly.

If the configuration does not pass a consistency check, then an error message is logged and Logprep keeps running with the previous configuration. The configuration should be then checked and corrected on the basis of the error message.

Documentation

The documentation for Logprep is online at https://logprep.readthedocs.io/en/latest/ or it can be built locally via:

sudo apt install pandoc
pip install -e .[doc]
cd ./doc/
make html

A HTML documentation can be then found in doc/_build/html/index.html.

Contributing

Every contribution is highly appreciated. If you have ideas or improvements feel free to create a fork and open a pull requests. Issues and engagement in open discussions are also welcome.

logprep's People

Contributors

0xr2po avatar clumsy9 avatar coderkearns avatar djkhl avatar dogchampdiego avatar dtrai2 avatar ekneg54 avatar herrfeder avatar ppcad avatar saegel avatar urdit52 avatar vrdlbrmft avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

logprep's Issues

Get dotted field value error if value contains key part as string

The method get_dotted_field_value(...) exists in 4 versions and most of them have no check if a field is a dict.
This can lead to an error if the field contains a value that corresponds to a field.
The only implementation with a type check can be found in helpers.py.

Expected behavior
Values with the same name as a key do not cause errors in get_dotted_field_value.

Current behavior
Values with the same name as a key do cause errors in get_dotted_field_value.

Steps to reproduce
Create an event {"get": "dotted"} for any processor that needs to access the dotted field "get.dotted", but is expecting something like {"get": {"dotted": "some_value"}} (it is expected that "some_value" will be returned).
This will raise an exception.

Environment

Logprep version: 06148eb
Python version: 3.6-3.9

Possible solution
Use the version of get_dotted_field_value from helpers.py or change the others to work like in helpers.py.

Optimize git workflow with caching

Is your feature request related to a problem? Please describe.
Some git workflows do not utilize caching, which makes them unnecessarily slow.

Describe the solution you'd like
Add caching to workflows where appropriate. Caching could be added to installing dependencies in tests and code quality checks, similarly to how it's done for pex building.

Describe alternatives you've considered
Sharing a dependencies across jobs might be an alternative, but it's not clear if this is possible or how complicated it would be.

Remove non-existing imports for labeler schema validation

The schema validation for the labeler does expect imports that no longer exist and fails with an import error.

Expected behavior
The schema validation for the labeler should be performed without errors.

Current behavior
The schema validation for the labeler throws an import error.

Steps to reproduce
Run logprep (even without any configuration file).

Environment

Logprep version: 9713741
Python version: 3.6

Possible solution
Remove imports that are no longer existing.

Possible implementation
Remove DuplicateLabelInCategoryError, CategoryWithoutDesciptionInSchemaError and LabelWithoutDesciptionInSchemaError from schema_and_rule_checker.py.

Grok pattern counter stopped working

In some case, Logprep does not start anymore if the grok pattern counter has been activated and grok patterns have been set.

Expected behavior
Logprep starts if the grok pattern counter has been activated.

Current behavior
In some case, Logprep throws an exception if the grok pattern counter has been activated and grok patterns have been set.
It can't find the relative path for grok patterns.

Steps to reproduce
It has to be investigated when and why exactly this happens.

Environment

Logprep version: d4b03ff
Python version: 3.6-3.9

Config validation does not show processor anymore

The configuration validation does not show the processor name anymore if there is an error in the configuration.

Expected behavior
The configuration validation shows the processor name if there is an error in the configuration.

Current behavior
The configuration validation shows "init()" instead of the processor name if there is an error in the configuration ("Invalid processor config: init() ...").

Steps to reproduce

  1. Create a config that would be valid, but remove something, like "specific_rules"
  2. Run logprep
  3. See an error like "Invalid processor config: init() ..."

Environment

Logprep version: 8de3b07
Python version: 3.6-3.9

MySQL generic adder crash on missing config but existing rule

The generic adder crashes if the sql_config is disabled/commented out and there exist SQL rules for the generic adder.
It does not properly detect that sql was deactivated and crashes during runtime.

Expected behavior
The generic adder does not crash if the sql_config is disabled/commented out and there exist SQL rules for the generic adder.

Current behavior
The generic adder crashes if the sql_config is disabled/commented out and there exist SQL rules for the generic adder.

Steps to reproduce

  1. Configure the genric adder to not use sql.
  2. Add a SQL rule to the rule path.
  3. Run logprep
  4. Pass a document to logprep that would trigger a rule with SQL
  5. Watch it crash during runtime with a message like "'member_descriptor' object has no attribute 'get'"

Environment

Logprep version: d4b03ff
Python version: 3.6-3.9

Refactor Metric Tracking

Currently the tracking of metrics is centralized in the processor_stats.py module where all metrics are gathered and aggregated into one big dictionary. To make this more extensible and easier to maintain I would suggest to split the processor_stats.py module up into a more modularized form that should be easier to test, maintain end extend. The main idea would be to use dataclasses that only track specific metrics. Each Rule, RuleTree, Processor and Pipeline would get a specific object RuleMetrics, RuleTreeMetrics, ProcessorMetrics and PipelineMetrics. They each would implement an explose metrics method such that the aggregator must only collect the metrics and bring them into the correct format: file or prometheus. The following uml-diagram draft outlines the idee while not being the final solution yet.

classDiagram
direction TB
class Dataclass

class Metric
Metric: -str _prefix
Metric: -str _labels
Metric: -list _do_not_expose
Metric: expose(self)
Metric: reset_statistics(self)

class RuleMetrics
RuleMetrics: +int number_of_matches
RuleMetrics: +int mean_processing_time
RuleMetrics: +int ...

class RuleTreeMetrics
RuleTreeMetrics: +int number_of_rules
RuleTreeMetrics: +list[RuleMetrics] rules
RuleTreeMetrics: +int ...

class ProcessorMetrics
ProcessorMetrics: +int number_of_processed_events
ProcessorMetrics: +list[RuleTreeMetrics] generic_rule_trees
ProcessorMetrics: +list[RuleTreeMetrics] specific_rule_trees
ProcessorMetrics: +int ...

class ProcessorDomainResolverMetrics
ProcessorDomainResolverMetrics: +int total_urls
ProcessorDomainResolverMetrics: +int timeouts
ProcessorDomainResolverMetrics: +int ...

class PipelineMetrics
PipelineMetrics: +int number_of_errors
PipelineMetrics: +list[ProcessorMetrics] processors
PipelineMetrics: +int ...

class LogprepInstanceMetrics
LogprepInstanceMetrics: +int number_of_pipelines
LogprepInstanceMetrics: +list[PipelineMetrics] pipelines

%% Associations
Dataclass --|> Metric : inherits
Metric <|-- RuleMetrics : implements
Metric <|-- RuleTreeMetrics : implements
Metric <|-- ProcessorMetrics : implements
Metric <|-- PipelineMetrics : implements
Metric <|-- LogprepInstanceMetrics : implements
ProcessorMetrics <|-- ProcessorDomainResolverMetrics : implements
ProcessorMetrics <|-- ProcessorPreDetectorMetrics : implements

RuleMetrics o-- RuleTreeMetrics : has
RuleTreeMetrics o-- ProcessorMetrics : has
ProcessorMetrics o-- PipelineMetrics : has
PipelineMetrics o-- LogprepInstanceMetrics : has

These metric objects would then be integrated into the usual architecture such that each corresponding object has the correct metric object. The following reference implementation of the metrics highlights the idea further:

@dataclass
class Metric:
    _prefix: str
    _labels: dict

    def expose(self):
        ...

    def reset_statistics(self):
        ...

@dataclass
class RuleMetrics(Metric):
    number_of_matches: int = 0
    mean_processing_time: float = 0.0


@dataclass
class RuleTreeMetrics(Metric):
    rule_metrics: List[RuleMetrics]
    number_of_nodes: int = 0
    number_of_leaves: int = 0
    tree_width: int = 0
    tree_depth: int = 0

    @property
    def number_of_rules(self):
        return len(self.rule_metrics)

    @property
    def mean_rule_processing_time(self):
        return np.mean([rule.mean_processing_time for rule in self.rule_metrics])

Auto tester does not work for some processors after refactoring

The auto tester does not work for the pseudonymizer and the list comparison processor after some recent refactoring.
This could also affect future processors.

Expected behavior
The auto tester works for all processors. Rules are being initialized correctly.

Current behavior
The auto tester does not work for the pseudonymizer and the list comparison processor. Rules are not being initialized correctly.

Steps to reproduce

  1. Create rule passing auto tests for the pseudonymizer (with regex list keywords) or for the list comparison processor
  2. Run the auto tests
  3. The tests will fail

Environment

Logprep version: e67d0ed
Python version: 3.6-3.9

Possible solution
Perform additional initialization in the auto tester.

Elasticsearch output connector

Is your feature request related to a problem? Please describe.
It would be desirable to have the option to write documents directly into elasticsearch.

Describe the solution you'd like
An elasticsearch output connector that can write logs directly into elasticsearch should be implemented.
Furthermore, a factory for a connector that combines a confluentkafka input with an elasticsearch output should be implemented.

drop support for python 3.6 and bump dependencies to support python 3.11

Is your feature request related to a problem? Please describe.

python 3.11 is going to improve performance significantly. but we still have dependencies in our project to support python 3.6.

Describe the solution you'd like

we should drop the support for python 3.6 and bump our dependencies up so far to support python versions from 3.7 to 3.11.
if we can`t support lower versions, we should go in favor of the higher version.

Allow running the auto-tester without rules

The auto-tester crashes if no rules have been loaded.

Expected behavior
The auto-tester should not crash even if no rules have been loaded.

Current behavior
The auto-tester crashes if all rule directories are empty, invalid or defined as an empty list.

Steps to reproduce
Run the Logprep auto-tester with a config that points to generic and specific rule directories that don't contain any rules.

Environment

Logprep version: b695bb0
Python version: 3.6

Possible solution
This problem is caused by the calculation of the test-coverage within the auto-tester. It causes a division by zero if no rules are loaded.
It should not fail if there are no rules, but it should also not print the test-coverage.

Possible implementation
Check if the test-coverage calculation would divide by zero and give appropriate feedback instead of calculating it.

Make template replacer work even if target field does not exist

Is your feature request related to a problem? Please describe.
The template replacer only writes into an already existing target field.
This can be a problem if that field is dropped before the template replacer processes the event and it's result is still desired if it yields anything .
It would be useful if it worked regardless, since this behavior is not intuitive.

Describe the solution you'd like
The template replacer should write its results into the target field even if that field does not exist.
Only replacing the target field can be still achieved by filtering for the existence of that field.

Describe alternatives you've considered
There are workarounds for this issue, like re-adding an empty field if it was dropped and then dropping it again if it is still empty after the replacer is done with it. However, those solutions add unnecessary complexity.

Refactor Processors to a common interface

At stand now it is difficult to implement new features shared by all processors as seen in the #35 . As an developer I want to be able to implement such features in a more straight forward process without touching all the written code at once again. We should refactor the processors to a common interface class Processor. And in this Issue we should also address the refactoring of the many factories for each processor to one factory for all processors. In this way we can guarantee that all processors behave the same and it should be easier to implement new processors.

one first shot could be with Normalizer and Pseudonymizer as example:

classDiagram
    Processor <-- Normalizer : implements
    Processor <-- Pseudonymizer : implements
    dict --> ProcessorConfiguration : inherits
    Processor *-- ProcessorConfiguration
    ProcessorConfiguration <-- NormalizerConfiguration : implements
    ProcessorConfiguration <-- PseudonymizerConfiguration : implements
    ProcessorTestCase <-- NormalizerTestCase : implements
    ProcessorTestCase <-- PseudonymizerTestCase : implements
    class Processor{
        <<interface>>
        +String name
        +ProcessorConfiguration configuration
        +String describe()
        +add_rules_from_directories()
        +process()
        +apply_rules()*
    }
    class Normalizer{
        +apply_rules()
    }

    class Pseudonymizer{
        +apply_rules()
    }

    class ProcessorConfiguration{
        +__init__(dict)
        +list mandatory_fields()*
        +is_valid()
    }

    class NormalizerConfiguration{
        +list mandatory_fields()
    }

    class PseudonymizerConfiguration{
        +list mandatory_fields()
    }

    class ProcessorFactory{
        -_check_configuration(configuration)
        +Normalizer create_normalizer(name, configuration, logger)$
        +Pseudonymizer create_pseudonymizer(name, configuraiton, logger)$
    }

    class TestProcessorFactory{
        +test_check_configuration()
        +test_create_normalizer()
        +test_create_pseudonymizer()
    }

    class ProcessorTestCase{
        +test_describe()
        +test_add_rules_from_diretories()
        +test_process()
        +test_apply_rules()*
    }

    class NormalizerTestCase{
        +test_apply_rules()
    }

    class PseudonymizerTestCase{
        +test_apply_rules()
    }

Make the generic resolver use hyperscan for regex matching

Is your feature request related to a problem? Please describe.
The generic resolver checks regex patterns sequentially, which does not scale well with the size of the lists.
Adding hyperscan to the generic resolver would make it scale better with list size.

Describe the solution you'd like
Add hyperscan to the generic resolver to match lists of regex patterns.
It was made for such a use case.

Describe alternatives you've considered
Alternatively, the pattern lists could be sorted by the likelihood of a match. This could improve performance if some patterns are more likely to match than others.
This would not improve performance if the likelihood for pattern matches is uniform.
This can be already achieved by manually sorting lists, but it is additional manual work, must be constantly maintained and is prone to human error.

Pseudonymizer has error with tld list

The pseudonymizer crashes if a tld list is configured.

Expected behavior
The pseudonymizer loads the tld list.

Current behavior
The pseudonymizer crashes if a tld list is configured.

Steps to reproduce

  1. Configure a tld list for the pseudonymizer
  2. Run logprep
  3. The pseudonymizer will crash, because the variable self._config.tld_list does not exist

Environment

Logprep version: e67d0ed
Python version: 3.6-3.9

Possible implementation
Rename the variable self._config.tld_list to self._config.tld_lists.

Implement GenericSpecificProcessStrategy

Is your feature request related to a problem? Please describe.
It is not.

Describe the solution you'd like
Currently only one rule process strategy is available, namely SpecificGenericProcessStrategy, which processes specific rules before generic rules. In case someone prefers to process generic rules before specific rules a second strategy is needed. With that it would be also needed that the strategies can be configured such that a user can easily choose between those two.

Describe alternatives you've considered
/

Additional context
/

Provide Issue Templates

Providing structured templates for submitting issues helps us to deal with issues more efficiently.

For the creator(s) of future tickets, these are useful by pre-structuring the necessary level of information that is to be included when submitting an issue.

Possible sources of inspiration are:
awesome-github-templates
github-issue-templates

Collect error messages on if the config is incorrect

Is your feature request related to a problem? Please describe.
An incorrect configuration makes logprep fail, but it returns one error message at a time instead of showing everything that is wrong with the configuration.

Describe the solution you'd like
Collect error messages and give the user feedback of all incorrect configurations.

Describe alternatives you've considered
Fixing the configuration one error at a time is possible, but can be time consuming in certain situations.

Docs are available on readthedocs.io

Compiling the docs can be inconvenient, so it would be nice to have an up-to-date version on readthedocs.io, as many other open-source tools do.

Definition of Done:

  • Logprep docs are available on readthedocs.io
  • The docs are automatically updated when changed
  • A link to the docs is added to the readme file under "Documentation"

DryRun with input type json results in KeyError 0

If the dry run is started with an input type json it results in an KeyError: 0.

Expected behavior
The dry-run should be executed on one json in the file.

Current behavior
A stacktrace with the error message KeyError: 0 is given.

Steps to reproduce

  1. Delete all but one json in the file quickstart/exampledata/input_logdata/test_input.jsonl
  2. (optional) Format the json into multiple lines
  3. Start logprep with logprep quickstart/exampledata/config/pipeline.yml --dry-run quickstart/exampledata/input_logdata/test_input.jsonl --dry-run-input-type json --dry-run-full-output

Environment

Logprep version: 2.0.1
Python version: 3.9.13

Add FieldConcat processor

Is your feature request related to a problem? Please describe.

As a user of logprep I want to have a processor to concatenate the value of different fields into one new field in one step.

  • I want to control what is happening with the source fields on rule base (delete?)
  • I want to ensure that no existing field is overwritten by the new concetenated field

implement semgrep rules to ensure code security and code quality

with focus on security for this project, semgrep rules should be implemented.
Semgrep is a tool for static code analysis. You can read more here: https://semgrep.dev/

the main focus of semgrep is to avoid specific bad code patterns that could lead to security issues.

In addition semgrep can be used to establish and ensure our own code patterns. For example with semgrep it should be easily possible to ensure that new processors are implemented in the same way as the old without the need of review (but sure it will not replace the review)

acceptance criteria:

  • semgrep rules should be ran without --config auto switch.
  • all applied semgrep rules should be part of this project
  • semgrep should be implemented in pipeline
  • create a strategy to heal found semgrep issues (the healing is not scope of this issue)

Add support for multiple input and output connectors

Is your feature request related to a problem? Please describe.

with #160 the connectors were split into seperate input and output sections. These sections are dictionaries as implemented.

Now I want to declare the connectors as a list to support multiple inputs and outputs.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

All outputs should provide a config field to set it as the default output.
The goal should be to let logprep ship all messages to all default output but enable the predetector and the selective_extractor to ship to a different output targets with store_custom

The inputs should all be processed without any priority.

Format code with black

as discussed we should format all code with black formater.
to ensure reviews are not interfered by too much reformating changes, it should be a good idea, to reformat all the code in one step.

Auto-tester should not require connector definition

Some changes to Logprep have caused the auto-tester to require a connector configuration to run. This behavior is unnecessary and should be fixed.

Expected behavior
Auto-tests can be run without defining a connector in the configuration.

Current behavior
Auto-tests require a connector in the configuration even though it is not used.

Steps to reproduce
Run auto-tests without defining a connector.

Environment

Logprep version: 9713741
Python version: 3.6

Clusterer slowed down after refactoring

The clusterer slowed down extremely after refactoring.
The slowdown increases with the number of rules and the amount of processed documents.
Thus, it might not appear to slow down if only a few rules and documents are being used.

Expected behavior
The clusterer processing time is similar to other processors.

Current behavior
The clusterer is extremely slow. The more rules it has the slower it becomes.

Steps to reproduce
Run the clusterer with many rules and compare it to other processors.

Environment

Logprep version: c831997
Python version: 3.6-3.9

pipeline should fail on code quality

to ensure the code quality is met and to focus on main issues in review the pipeline should fail if changed files are not formatted according to our pylint configuration, not formatted according to our black configuration or the test coverage will decrease if merged.

Split kafka connector into input and output connectors

Is your feature request related to a problem? Please describe.
Currently the confluentkafka connector is a joint connector with an input and an output.
This should be split into an input and and output connector so that those can be more easily combined with other connectors.

Describe the solution you'd like
An input and output connector should exist for kafka.
However, it should be possible to use confluentkafka with the same configuration as before.

Add the ability to the generic adder to enrich from SQL tables

Is your feature request related to a problem? Please describe.
SQL databases are a common way to store data. It would be useful if data present in SQL tables could be directly added via the generic adder.

Describe the solution you'd like
The generic adder should be able to connect to a SQL database, look for an event value in a column of a SQL table and then add the whole row of that table to that event.
Logprep should update it's replacement data at runtime if the SQL table changes.

Application maximum poll interval (300000ms) exceeded by XXms

Since a few days i encounter the problem, that the logprep application dos produce the following error at night:
"Application maximum poll interval (300000ms) exceeded by 13ms"

Currently my workaround is to restart the logprep process entirely until the issue appears again since the maximum poll interval is also not configurable as of now via the config file.

Expected behaviour
The expected behaviour would be to reconnect to kafka

Current behaviour
After the error logprep does nothing

Steps to reproduce

Not entirely sure, it looks as if there are not enough logs to work on, the gap between active connections is to big.

Environment

Logprep version: main version
Python version: 3.6

Possible solution
The error gets raised by the rdkafka library. The following issue is similar to mine:
confluentinc/confluent-kafka-python#552
The issue is that the rdkafka library does not reconnect to kafka after raining the error as stated in the issue above.
A solution would be to run the "poll" function once every maximum poll interval since according to the issue above this reconnects the kafka connector again to the cluster

alternatively to make the maximum poll time configurable

Missing possibility to remove a whole nested object structure (json) at once

Is your feature request related to a problem? Please describe.

There is currently no way to remove (drop) the entire nested object structure without knowing and specifying all subfields in the dropper rule configuration.

Example:
to remove the following:

[{
  "nested": {
    "name1": "value1",
    "name2": "value2" 
  }
}]

you must drop both subfields:

drop: 
- nested.name1
- nested.name2

Describe the solution you'd like
There should be a possibility to remove all subfields of a nested object structure at once.
For the above example with nested.name1 and nested.name2 the syntax could be:

drop:
- nested.*

Describe alternatives you've considered
Another possibility is to use - nested instead of - nested.* within the dropper configuration (or any other syntax)

Additional context
Especially for temporary structures, which are often used by GROK patterns or timestamps, this option would be helpful. You could remove the entire nested object structure used by previous processors without knowing each subfield.

Building documentation fails

Building the documentation via tox fails.

Expected behavior
The documentation can be built via tox.

Current behavior
Building the documentation via tox fails.

Steps to reproduce
Run "tox -e py36-docs" or any other python version.

Environment

Logprep version: c831997
Python version: 3.6-3.9

Possible solution
Add missing dependency "sphinxcontrib-mermaid" to the tox file.

Make relative paths be use configurable base path

Is your feature request related to a problem? Please describe.
Some logprep rules access data via relative paths. Logprep rules can be shared on different systems and relative paths might different for those systems.
It would be easier change a base path for each processor instead of changing all relative paths for all rules.

Describe the solution you'd like
Add an option to give a base path to processors for relative paths in rule files, like it is done for the list comparison processor.

Describe alternatives you've considered
The only alternative would be to ensure the same directory structure on all systems or to adapt all path by auto-replace, which is not always a viable solution.

Split output and input connectors

Is your feature request related to a problem? Please describe.

As a user of logprep I need to have independent input and output connectors to orchestrate my input and output connectors via configuration in pipeline.yml

Additional context

As now it is not possible to combine to different connectors via pipeline.yml and there is a need to implement them hard in code. with the above flexibility a user could orchestrate the input and output connectors by his own.

Auto Rule Tester throws AttributeError after ProcessorStats Refactoring

The auto rule tester is currently not operational because it tries to access a processor attribute that was removed by the ProcesorStats refactoring.

Expected behavior
The auto rule tester should iterate over the rule tests and output the results.

Current behavior
An AttributeError is being printed AttributeError: 'Normalizer' object has no attribute 'ps'.

Steps to reproduce

  1. Create a rule test in the exampledata directory, for example exampledata/rules/normalizer/generic/example_rule_test.json
[
  {
    "raw": {
      "event": "foo"
    },
    "processed": {
      "event": "foo"
    }
  }
]
  1. Start logprep with PYTHONPATH='.' python logprep/run_logprep.py quickstart/exampledata/config/pipeline.yml --auto-test
  2. See error AttributeError: 'Normalizer' object has no attribute 'ps'

Environment

Logprep version: ba49d34
Python version: 3.9.13

Possible solution
Remove all reference to to the removed processor stats object and write more tests for the auto rule tester.

Logprep via pex fails silently if the config is incorrect

Running logprep via a pex file fails silently if the config is incorrect.

Expected behavior
A descriptive error message is thrown if the config is incorrect and logprep is started via pex, like it would be without pex.

Current behavior
Logprep via pex fails silently if the config is incorrect.

Steps to reproduce
Create an incorrect logprep config start via pex. (i.e. remove the type from processor).
There will be no output and logprep will close.

Environment

Logprep version: b5c3439
Using pex

List comparison base paths are not specified by logprep config

The list comparison base paths are not specified in the logprep config anymore. Instead they are specified within each rule, which defeats the purpose of this feature.

Expected behavior
The list comparison base paths should be specified in the logprep config.

Current behavior
The list comparison base paths are specified in rules.

Steps to reproduce
Specify a base path for lists for the list comparison processor in the logprep config that is different from "./" and create rules with lists relative to this path.

Environment

Logprep version: 06148eb
Python version: 3.6 - 3.9

Logprep applies some rules multiple times if they match

Logprep applies some rules multiple times if they matches.
It's one time for each matching disjunction a rule filter contains.

Expected behavior
Logprep applies a rule only once if it matches.

Current behavior
Logprep applies a rule multiple times for each matching disjunction it's rule filter contains.

Steps to reproduce

  1. Create a rule with a fiter like "term_1 OR term_2", where both terms are true for some event X. This is most easily tested with a processor that does not allow duplication (i.e. the domain resolver).
  2. Run Logprpe and send event X.
  3. See the duplication warning in the console.

Environment

Logprep version:
Python version: >=3.6

Possible solution
Adapt the rule tree to not match a rule multiple times or remove duplicates after the matching process if possible.

Add timestamp and delta time when events arrive in Logprep

Is your feature request related to a problem? Please describe.
Add timestamp when events arrive in Logprep and calculate the delta time with the event generation to be able to see the delay between event generation and processing by Logprep.

Describe the solution you'd like
Add it like it is done for the HMAC and the Logprep version.

Status_Logger crashes pipeline after status_logger period

When deploying the Quickstart environment provided in the main branch of this repository, one of the pipelines crashes after 30 minutes.

Expected Behavior

The pipeline should not crash on a regular basis.

Current Behavior

When starting Logprep using the provided Quickstart environment, whenever the rotating log file for the status_logger is written (every 30 minutes by default), one pipeline crashes.

Screenshot 2021-08-17 at 16 37 57

Possible Solution

The culprit is util/processor_stats.py and the issue seems to stem from the following static methods expecting self as an additional parameter, as the class StatusTracker is instanciated:

_get_sorted_output_dict
_add_derivative_data
_add_relevant_values_to_multiprocessing_pipelines
_aggregate_error_type
_remove_numpy_arrays

Steps to Reproduce

  1. Start Quickstart environment via docker-compose --profile logprep up -d (see README.md)
  2. Wait for 30 minutes.
  3. Look at Logprep's console output via the Docker CLI (Logprep container) and observe the stack trace.

Context (Environment)

Python 3.6 (including Logprep requirements), as deployed within the Quickstart docker environment.

Detailed Description

Possible Implementation

Removing the @staticmethod decorator and adding self as the first parameter to the above methods' definitions seems to solve the problem here.

Consistency for applying rules

since last big refactoring for processors rules are not applied in the same order for all processors.

Expected behavior
rules should be applied in same order for all processors and there should be a test in the BaseProcessorTestCase to ensure the same behaviour.

Environment

Logprep version: b695bb0
Python version: 3.6

Possible solution
move process method to a parent class for all rule based processors and write tests for this. then delete the process method in all child processors.

Add ability to drop an event

Is your feature request related to a problem? Please describe.
As now we only have a dropper for fields. It would be nice to have a dropper that drops an event from further pipeline steps.

Describe the solution you'd like
the possible best approach could be to write a complete new Processor for that, because this is a new behavior that differs completely from the now existing dropper -> EventDropper vs FieldDropper

Describe alternatives you've considered
an alternative could be to give the existing dropper that feature

Additional context

Hyperscan does not install on some systems

Hyperscan does not install on some systems. Therefore, logprep can't be used.
This can be a wrong python version, architecture or OS.
It is because hyperscan is either not compiled for those systems.
Furthermore, python-hyperscan does not seem to include libraries after version 0.1.5, but 0.1.5 does not support python 3.9.

Expected behavior
Logprep can be used with or without hyperscan.
Hyperscan should be optional if it is not supported by the system.

Current behavior
Hyperscan does not install on some systems. Logprep can't be used on those systems.

Steps to reproduce
Run logprep with hyperscan version 0.3.0 without having hyperscan libraries installed.

Environment

Logprep version: 2bfdf51

Possible solution
Only install and import hyperscan if it is supported by the system and only for python-hyperscan versions that include hyperscan libraries.

Exists filter raises exception if value matches filter

The exists filter matches if a value matches a filter, but it should only match for keys.
It then raises an exception, but it should not match instead.

Expected behavior
The exists filter should only match for keys.

Current behavior
The exists filter matches if a value matches its filter and raises an exception.

Steps to reproduce

  1. Create an exists filter 'filter: "key1.key2"'
  2. Insert a document '{"key1": "key2"}'
  3. See error

Environment

Logprep version: c831997
Python version: 3.6-3.9

Add opensearch output connector

Is your feature request related to a problem? Please describe.

As a user of logprep I want to have an opensearch output connector to deliver events direct to opensearch.

Additional context

the developed elasticsearch output connector can not be used for this, because the api seems to have changed.

Add processor that performs adaptive misuse detection on command-line events

Is your feature request related to a problem? Please describe.
Currently, the logprep pre-detector enables to perform misuse detection of incoming command-line events. Commonly known command-line obfuscation techniques enable to evade many command-line signatures.

Describe the solution you'd like
An additional processor should perform adaptive misuse detection on process creation events to also detect malicious command-lines that are .e.g. obfuscated. In case of a positive result, additional rule attribution shall be performed.

Add Dissecter Processor

Is your feature request related to a problem? Please describe.

in logstash there is the dissect filter plugin as described here: https://www.elastic.co/guide/en/logstash/current/plugins-filters-dissect.html
I want to have the same functionality in logprep.

Describe the solution you'd like
The dissection should be implemented with pure builtin python string manipulation and without the use of regex to keep the process fast.

Describe alternatives you've considered

Additional context

all processors should support specific and generic rules

Is your feature request related to a problem? Please describe.

  • not all processors use generic and specific rules
  • processors do not implement a common interface

Describe the solution you'd like

  • all processors should implement a common interface
  • all processors should implement the possibility to configure generic and specific rules
  • all processors should implement a rule tree

Describe alternatives you've considered

  • none

Additional context

  • it is hard to implement further behavour to all processors because they do not share a common interface.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.