opensearch-project / data-prepper Goto Github PK

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.

Home Page: https://opensearch.org/docs/latest/clients/data-prepper/index/

License: Apache License 2.0

Java 99.48% Scilab 0.01% Shell 0.19% Dockerfile 0.01% ANTLR 0.07% Python 0.01% JavaScript 0.03% TypeScript 0.20%

ingestion logs observability opensearch traces java analytics metrics

data-prepper's People

Contributors

Stargazers

Watchers

Forkers

chenqi0805 wrijeff vachashah changsongyang stjordanis graytaylor0 ahopp laneholloway dlvenable sandhose vilisseranen jianghancn asifsmohammed sshivanii cmanning09 anjanaasok kadekm etienne-carriere stockholmux sbayer55 ryanbogan dapowers87 peterzhuamazon subbusquareshift alexrogalskiy franklinharry mohitsaxenaknoldus peternied jnschbrt risdenk frankfanslc der-ofenmeister kmssap sharp-pixel izhcong1997 mch2 joshuali925 isabella232 eric-sherrill dinujoh engechas finnroblin sid-nrn7 cwperks ylwu-amzn iquirino svana oeyh macohen yang-db flyingliuhub naarcha-aws bryan-aguilar honhimw sagar-rout deepdatta bionit derek-ho carolxob kkondaka sternadsoftware rafael-gumiero shishir84 lphanikumar arunx2 namansar norisk8787 dblock udaych20 umairdeqode erasmodominguezdc tashish4u deepaksahu562 krishnanandsingh mahesh724 ammoraiahk svok jannikbrand ashoktelukuntla ajeeshakd mann-fens cumpsty venkataraopasyavula prashantkolhe121 shellu111 sabirshaik786 jirachai mikey62 kline101 dil-akovacs roshan-dongre tobihille umairofficial manvendra84 qpc-github livekn eldius umayr-codes daixba tanqiuliu

data-prepper's Issues

Accept data from the Elasticsearch Output Plugins

Create a Grok processor.

Checkstyle errors seem to be ignored

Describe the bug
Noticed checkstyle errors ignored in https://github.com/opensearch-project/data-prepper/runs/2442523964#step:7:422, not sure it's on purpose.

Error: eckstyle] [ERROR] /home/runner/work/data-prepper/data-prepper/data-prepper-plugins/opensearch/src/main/java/com/amazon/dataprepper/plugins/sink/opensearch/IndexStateManagement.java:90: Line is longer than 140 characters (found 143). [LineLength]

Support correct HTTP methods on core API endpoints

Is your feature request related to a problem? Please describe.

All HTTP methods work on all the core APIs (shutdown, list, metrics).

Describe the solution you'd like

Only the following HTTP combinations should be supported:

Path	Method
shutdown	POST
list	GET
list	POST
metrics/sys	GET
metrics/sys	POST
metrics/prometheus	GET
metrics/prometheus	POST

Update dependencies from ODFE Data Prepper

Is your feature request related to a problem? Please describe.

We updated ODFE Data Prepper dependencies with known CVEs. This task is to apply that work into OpenSearch Data Prepper.

Describe the solution you'd like

Update any dependencies which were not updated by dependabot or other changes.

See opendistro-for-elasticsearch/data-prepper#877

Add throttling handler into http source

Is your feature request related to a problem? Please describe.
Currently http source plugin does not limit the task queue size in handling new requests, this will lead to memory overflow in high load.

Describe the solution you'd like
The source plugin needs to return 429 (Too many requests) once the blockingTaskExecutor queue size reaches max_pending_requests.

Describe alternatives you've considered (Optional)
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Enable TLS/SSL for http source plugin

Is your feature request related to a problem? Please describe.
The http source only supports insecure http connection. We need to enable https

Describe the solution you'd like
As an initial open source release, we will support local certificate file path.

Describe alternatives you've considered (Optional)
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Dataprepper Amazon Elasticsearch Service / OpenSearch Sink:

We are building support for customers to ingest and process log data through Data Prepper. We have identified a need for a plugin to output data from Data Prepper to Amazon Elasticsearch Service / OpenSearch (AES/OS). This task tracks the plan for a new Sink plugin to help output logging data from Data Prepper to OpenSearch.

Grok Prepper Basic Matching

This is a subtask of the issue for a grok processor: #256.

Basic pattern matching and capture functionality will be added to Grok Prepper using the existing java grok library: https://github.com/thekrakken/java-grok

Basic log HTTP source plugin

We will add a log-http-source Source plugin. This issue tracks the following features

Boiler plate plugin code
Add basic http server
Add the http service that parse the incoming http request data and push to buffer

OpenSearch Bulk API Source

Summary

This creates a new Data Prepper source which accepts data in the form of the OpenSearch Bulk API.

Configuration

source:
  opensearch_api:
    port: 9200
    path_prefix: opensearch/

Operations

The _bulk API supports:

index
create
update
delete

This source can do something similar to what the dynamodb source does. Specifically it should include the opensearch_action metadata.

Sample

POST opensearch/_bulk
{ "index": { "_index": "movies", "_id": "tt1979320" } }
{ "title": "Rush", "year": 2013 }

The above request is the simplest case since it is an index request.

It creates an Event with data such as:

{ "_id": "tt1979320" "title": "Rush", "year": 2013 }

Additionally, the event will need metadata that we can use in the opensearch sink.

opensearch_action: "index"
opensearch_index: "movies"
opensearch_id: "tt1979320"

Query parameters

The _bulk API supports a few query parameters. The source should also support most of these and provide some of them as metadata.

pipeline -> Sets metadata: opensearch_pipeline
routing -> Sets metadata: opensearch_routing
timeout -> Configures an alternate timeout for the request in the source. This probably doesn't need to be provided downstream.

Some other parameters that we may wish to support:

refresh
require_alias
wait_for_active_shards

Finally, we should not support these parameters as they are being deprecated.

type

Response

Being able to provide the _bulk API response may be more challenging. There are a few reasons:

Unless end-to-end acknowledgments are enabled, we won't have any knowledge of the writes.
Even when acknowledgments are enabled all the metadata needed in a typical response is still not available.

An initial version could provide responses that either have empty values (where appropriate) or use synthetic values.

Centralize SSL configuration and certificate provider factory

Is your feature request related to a problem? Please describe.
The TLS/SSL config has not been unified across plugins (otel-trace-source, http, peerforwarder). It should be unifiable so that all plugins could reuse the same config and CertificateProviderFactory.

Describe the solution you'd like
We need to sort out all necessary TLS/SSL config parameters and document them in a separate TLS/SSL module

Describe alternatives you've considered (Optional)
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Plugin Class Refactoring

Is your feature request related to a problem? Please describe.

The current plugin classes are static utilities and inextensible. To support additional new sources of plugins, as in #321, will require either changing these static classes or creating a more flexible class structure.

Describe the solution you'd like

The goal of this design is to replace the following classes:

PluginFactory (note, this is being replaced by a new class with the same name)
PluginRepository
SourceFactory
BufferFactory
PrepperFactory
SinkFactory

The following two diagrams outline the proposed class design for Data Prepper plugins. These diagrams show Java packages, as well as the module in which those packages exist. Packages within the same module share the same color.

The com.amazon.dataprepper.plugin package is split into two different diagrams. The first diagram focuses on what it exposes to other packages and modules.

The following diagram outlines the internal details of the com.amazon.dataprepper.plugin package.

PluginFactory - Implementations of this class provide the ability to create new plugin instances. This interface exists in data-prepper-api so that custom plugins can use this class without depending on data-prepper-core.
DefaultPluginFactory - This design anticipates that we will only need one implementation of PluginFactory. This is that single implementation.
PluginProvider - An interface for finding plugin classes.
ClasspathPluginProvider - An implementation of PluginFactory which locates plugins in the classpath only.
RepositoryPluginProvider - An implementation of PluginFactory which locates plugins from a remote repository. This is not in scope for the initial work.

Additionally, this proposal changes to the @DataPrepperPlugin class:

@Documented
@Retention(RetentionPolicy.RUNTIME)
@Target({ElementType.TYPE})
public @interface DataPrepperPlugin {
    /**
     *
     * @return Name of the plugin which should be unique for the type
     */
    String name();

    /**
     * @deprecated Remove in favor of {@link DataPrepperPlugin#pluginType()}
     * @return The plugin type enum
     */
    @Deprecated
    PluginType type();

    /**
     * The class type for this plugin.
     *
     * @return The Java class
     */
    Class<?> pluginType();
}

Additional Context

Existing classes will be deprecated and removed in a future major release of Data Prepper.

Accept OTel log data from the OTell Collector

Support ingest of OTel log data.

Specification: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/overview.md

All TLS/SSL certificate provider should accommodate encrypted key

Is your feature request related to a problem? Please describe.
Currently we have the following certificate providers on source plugin (server) and peer forwarder (client)

local file
s3
ACM
where only ACM supports encrypted key file with passphrase.

Describe the solution you'd like
We should support encrypted key with passphrase for all certificate provider since key file encryption standard is generic.

Approach:
Since Armeria server provides the API to configure TLS with encrypted key and passphrase, we only need to modify the Certificate model to accommodate encrypted key and passphrase.

Describe alternatives you've considered (Optional)
Alternatively, we could refactor the decrypt method existing in ACM provider into a common SSL utility method for reusage. This will work with servers that happen not to support encrypted private key. We do not have such use case so far.

Additional context
Add any other context or screenshots about the feature request here.

[RFC] OpenSearch Sink Updates Proposal

What kind of business use case are you trying to solve? What are your requirements?

A roadmap has been released to enable Data Prepper to ingest logs from telemetry data collection agents (such as FluentBit and Telemetry Collector) into OpenSearch. In accordance to this roadmap, we are proposing new updates on current OpenSearch Sink implementation to support following use cases.

Use Cases:

As a user, I want to be able to ingest log data in batches from Data Prepper into OpenSearch.
As a developer, I want to refactor OpenSearch Sink plugin to make it easily extensible to support more index types.

What is the problem? What is preventing you from meeting the requirements?

OpenSearch Sink doesn’t officially support ingestion of log data.
OpenSearch Sink’s source code needs to be refactored to make it easily extensible to support more index types. The existing code has some anti-pattern implementations. For example, in the IndexStateManagement class, if-else statements are used to switch between different index types.

What are you proposing? What do you suggest we do to solve the problem or improve the existing situation?

We are going to make the following two major updates on existing OpenSearch Sink implementation.

Refactoring for Better Extensibility

Create an IndexType enum list which maintains the complete list of supported index types.

public enum IndexType {
    TRACE_ANALYTICS_RAW("trace-analytics-raw"),
    TRACE_ANALYTICS_SERVICE_MAP("trace-analytics-service-map"),
    CUSTOM("custom"); //will be used to handle generic event ingestion, inlcuding log.
    
    private String indexTypeCode; // which will be passed in through the new index_type parameter in configuration.
}

Create IndexManager interface, which is implemented by subclasses corresponding to index types: TraceAnalyticsRawIndexManager, TraceAnalyticsServiceMapIndexManager, DefaultIndexManager.

interface IndexManager {
    
    boolean checkISMEnabled();
    Optional<String> checkAndCreatePolicy();
    void checkAndCreateIndexTemplate();
    void checkAndCreateIndex();

}

Delete IndexStateManagement class. The implementations in IndexStateManagement class will be moved to relevant IndexManager sub-classes. For example, implementations for Raw Trace Analytics in IndexStateManagement will be moved to TraceAnalyticsRawIndexManager.
DefaultIndexManager will be used to support general data ingestion, including log ingestion, into OpenSearch.
Move OpenSearchSink::checkAndCreateIndex() to corresponding IndexManager sub-classes because its implementation is specific to index type.
Move OpenSearchSink:: checkAndCreateIndexTemplate() to corresponding IndexManager sub-classes because its implementation is related to index type.
Add a new IndexManagerFactory which produces an IndexManager instance corresponding to the IndexType input.

public class IndexManagerFactory {

    Map<IndexType, IndexManager> indexManagers = new HashMap<>();

    public IndexManagerFactory() {
        indexManagers.put(IndexType.CUSTOM, DefaultIndexManager.getInstance());
        indexManagers.put(IndexType.TRACE_RAW, TraceRawIndexManager.getInstance());
        indexManagers.put(IndexType.TRACE_SERVICE_MAP, TraceRawIndexManager.getInstance());
        indexManagers.put(IndexType.LOG_SERVICE_MAP, LogServiceMapIndexManager.getInstance());
        //...
    }

    public IndexManager getIndexManager(IndexType indexType) {
        return indexManagers.get(indexType);
    }

}

OpenSearchSink class will be independent of any specific index types. It will depend on IndexManagerFactory to get specific IndexManager instance to perform index related operations.

Support of New Parameters in Configuration

Parameter	Required	Description	Additional Comments
index_type	No	Default: custom One string value from the index types that Data Prepper supports: trace-analytics-raw, trace-analytics-service-map, custom This parameter is used only when "trace_analytics_raw" and "trace_analytics_service_map" are not set.	We consider two parameters, "trace_analytics_raw" and "trace_analytics_service_map", as deprecated. If customers are using "trace_analytics_raw" and "trace_analytics_service_map", we will show a warning on console notifying customers that they are using a parameter which will be removed and that they should use index_type instead. We will remove the two deprecated parameters in the next major version release of Data Prepper, likely in 2022.
timeout	No	Default: 60 Set the timeout, in seconds, for network operations and requests sent OpenSearch. If a timeout occurs, the request will be retried.
ism_policy_file	No	This specifize a path to an index state machine policy file. If not provided, we will use a default one. This is only effective when index state management is enabled on OpenSearch.
number_of_shards	No	Default: 1 Number of shards for the index	if the parameter is set in configuration, the IndexConfiguration class will use the set value to override default.
number_of_replicas	No	Default: 1 Number of replicas for the index	if the parameter is set in configuration, the IndexConfiguration class will use the set value to override default.

Example of the OpenSearch Sink Configuration

sink:
  - opensearch:
        hosts: ["https://search-host1.example.com:9200", "https://search-host2.example.com:9200"]
        cert: "config/root-ca.pem"
        username: "ta-user"
        password: "ta-password"
        index_type: "custom" 
        index: "my-service-application-log"
        template_file: "/path/to/index_template"
        ism_policy_file: "/path/to/ism_policy"

What are your assumptions or prerequisites?

This aligns with the most recent blog post

What are remaining open questions?

We are considering supporting ingestion to OpenSearch hosts through a reversed proxy.
We are considering supporting date and time pattern in index names.

Support Proxies

Is your feature request related to a problem? Please describe.
We, like thousands of other organisations, use proxies to filter outbound internet traffic. Without proxy support, data-prepper is unusable in our environment.

Describe the solution you'd like
Ideally, I could use -Dhttp.proxyHost and other Java proxy properties on the command line like most Java applications. Failing that, support via configuration (e.g., data-prepper-config.yaml) would be OK.

Describe alternatives you've considered (Optional)
There are no alternatives for our environment. I'm actually really surprised Amazon is recommending this tool without this basic feature. If this support exists, documentation for using it is nonexistant.

Add backwards compatibility tests for automation

Add backwards compatibility tests for the plugin to automate end-to-end upgrade paths for faster releases. Developer documentation for implementing bwc tests and hooking them to CI: https://github.com/opensearch-project/opensearch-plugins/blob/main/TESTING.md#backwards-compatibility-testing

Configuration Consistency

Is your feature request related to a problem? Please describe.

Data Prepper configuration names inconsistently use snake_case or lowerCamelCase. Configuring Data Prepper is clearer when the naming convention is consistent.

Describe the solution you'd like

Most importantly Data Prepper should have a consistent naming convention for configuration names. The standard we are proposing is snake_case. This convention is used by most of the existing properties. Thus, users will have fewer properties to reconfigure.

Data Prepper will need to:

Have documentation clearly stating the standard.
Provide a migration for names which are not in compliance. The first part of this migration is adding snake_case properties for each camelCase property. The existing camelCase properties will remain as deprecated properties. The next major release of Data Prepper will remove the cameCase versions.

Tasks

Secure OTel gRPC API with HTTP Basic security

Is your feature request related to a problem? Please describe.

Data Prepper's OTel endpoint is currently unsecured. Thus, users must either add a proxy on the host, or leave them open to network access.

Describe the solution you'd like

Support HTTP Basic Authentication on the OTel endpoint. Additionally, we will turn this on by default with a preconfigured username and password. Users will be able to turn off the HTTP security.

This is similar to #312, but covers the OTel endpoint specifically.

[RFC] Plugin Redesign

This RFC proposes a new approach for supporting plugins with Data Prepper. Its main goals are to promote modularity and allow for decentralized plugins.

What is the problem? What is preventing you from meeting the requirements?
Data Prepper plugins should be split out so that they do not need to be part of every installation and Data Prepper. Additionally, other teams or individuals should be able to easily create their own plugins for Data Prepper. These remote plugins would be development and deployed independently of Data Prepper.

Currently, Data Prepper requires that all the plugins be embedded within the jar and in the current classpath.

What are you proposing? What do you suggest we do to solve the problem or improve the existing situation?

Data Prepper will support loading plugins from two different sources:

The Java classpath of Data Prepper
External plugins from remote repositories

This diagram outlines the division between the two types of plugins.

The Java classpath will be used for two scenarios:

Any core plugin which will always be included within Data Prepper
Custom Data Prepper distributions. Users of Data Prepper may be running Data Prepper without internet access (e.g. running in an enclosed network), and should be able to install custom versions of Data Prepper with all the plugins they need.

Loading plugins from remote repositories will be used for plugins which are not currently installed in Data Prepper. This proposal uses Maven Central as the mechanism for distribution of these plugins. Additionally, plugins will load within their own dedicated class loader to provide isolation between plugins. A future RFC or proposal will detail this approach.

This diagram shows the concept for supporting remote plugins which are external to Data Prepper.

Remote plugins will be downloaded into their own plugins directory within the Data Prepper directory structure.

The following outlines a possible directory structure. In this approach, each plugin has its own uber-jar file with all of its dependencies include. It may also be worth considering expanding a plugin uber jar as an alternative.

data-prepper-$VERSION/
  bin/
    data-prepper                    # Shell script to run Data Prepper on Linux/macOS
  plugins/
    my-plugin-a.jar
    my-plugin-b.jar
  logs/                             # Directory for log output
  LICENSE
  NOTICE
  README.md

What are your assumptions or prerequisites?

This work depends on the new directory structure being added as part of Directory Structure for Data Prepper.

Tasks

#1543
Support loading plugins from a remote repository

[RFC] Log HTTP source Plugin

What kind of business use case are you trying to solve? What are your requirements?

When ingesting logs into Opensearch for log analytics, users would like to use FluentBit as client application log data collector and expects Data-Prepper to receive log data from the FluentBit HTTP output, to transform and export data to Opensearch and Amazon Elasticsearch Service. Therefore we would like to present customers with a reliable FluentBit-DataPrepper-Opensearch pipeline that supports essential configuration and features for log data ingestion.

What is the problem? What is preventing you from meeting the requirements?

Log HTTP source plugin is the receiver component of the [Data-Prepper log analytics pipeline](TODO: link for log ingestion RFC) workflow that communicates with FluentBit or other HTTP client and pass its received data downstream for further processing.

What are you proposing? What do you suggest we do to solve the problem or improve the existing situation?

For data-prepper to receive log requests from FluentBit. We will implement a log HTTP source plugin that satisfies the following functional requirements:

The plugin shall be capable of receiving request from FluentBit HTTP output plugin
The plugin shall support json format from FluentBit HTTP output plugin.
The plugin shall be configurable via YAML, similar as other existing source plugin in data-prepper
The plugin shall push data to buffer in a unified format/model that
- facilitates the downstream preppers to do data-processing and transformation
- allows sinks to post to backend (Happy path).
The plugin shall support host verification (TLS/SSL)

and performance requirements:

The source plugin should manage thread pool counts, max connection counts, etc.
The source plugin should throttle gracefully.

The http source plugin will include the following configuration parameters:

port [int] (Optional) - The port number the source plugin is listening on, defaults to 2021.
threadCount [int] (Optional) - The number of threads of http request executor thread pool. Default to 200 maximum threads.
maxConnectionCount [int] (Optional) - The maximum allowed number of open connections. Default to 256.
maxPendingRequests [int] (Optional) - Maximum number of incoming requests to store in a temporary task queue to be processed by worker threads. If a request arrives and the queue is full a 429 response will be returned immediately. Default to 1024
ssl [bool] (Optional) - A boolean enables TLS/SSL security verification. Default to false
- certFilePath [String] (Optional): required when ssl flag is true.
- privateKeyFilePath [String] (Optional): required when ssl flag is true.
- privateKeyPassword [String] (Optional): required when ssl flag is true.

What are your assumptions or prerequisites?

The log data received from each http request is complete, i.e. for a multiline log, it is assumed to appear in a single request body instead of scattering around multiple requests. The assumption is based on FluentBit support for multiline filtering (https://docs.fluentbit.io/manual/pipeline/filters/multiline-stacktrace).
For initial implementation, we will only deal with json content type. More codecs can be supported for enhancement later on.

What are remaining open questions?

Batched log requests are submitted to the log http source's buffer w/o unwrapping the contents. Either has its pros and cons. We will address the data model, buffering behavior enhancement in a separate issue.

Remove support for ODFE < 1.13.0

We will no longer support ODFE < 1.13.0 in OpenSearchSink. This is expected to happen after release 1.0.0.0-beta1.

Support metrics ingestion

Currently I can't ingest metrics from my nodejs application using the OpenTelemetry node sdk, data-prepper does not support it.
Will be awesome if data-prepper could process the metrics collected by the OpenTelemetry collector so they can be ingested by OpenSearch.
An issue for this exists on the OpenDistro repo opendistro-for-elasticsearch/data-prepper#669

Ensure DCO Workflow Check

Coming from opensearch-project/project-meta#17

A Developer Certificate of Origin is required on commits in the OpenSearch-Project.

See doc.yml for an example workflow. Ensure CONTRIBUTING.md to has a section on the DCO per the project template.

DCO Check Workflow
CONTRIBUTING.md DCO Section

Secure core APIs with HTTP Basic security

Is your feature request related to a problem? Please describe.

Data Prepper provides a few core APIs such as shutdown, list, metrics/.... Currently, users of Data Prepper must currently secure them either through network security or a proxy on the host.

Describe the solution you'd like

Add support for HTTP Basic authentication on these endpoints. Additionally, this authentication mechanism will be turned on by default with a predefined username and password. Turning it on should encourage users to set their own username and password. We will support turning off HTTP Basic authentication entirely.

Accept data from the Logstash OpenSearch Output Plugin.

Write a source that accepts data from the Logstash OpenSearch output plugin.

[RFC] Directory Structure for Data Prepper

This RFC introduces a change from distributing Data Prepper as an uber-jar into a bundled directory structure. This approach is similar to how OpenSearch is distributed.

What is the problem? What is preventing you from meeting the requirements?

Data Prepper distributes its code in a single uber-jar. We are planning work to support extending Data Prepper with custom plugins. This requires that we have a location for loading additional jar files which are not part of the uber-jar. Additionally, those plugins will be decoupled from data-prepper-core, which means we will have multiple jar files.

What are you proposing? What do you suggest we do to solve the problem or improve the existing situation?

We propose distributing Data Prepper as a bundle which must be extracted into a directory structure. Data Prepper will now include a shell script for starting Java with the required classpath.

User Experience

Users can deploy Data Prepper using the following options:

Run the Docker container
Install Bundled Distribution
Build from source

Docker

Users deploying using Docker currently run the following commands.

docker pull opensearchproject/data-prepper:latest
docker run --name data-prepper --expose 21890 -v /full/path/to/pipelines.yaml:/usr/share/data-prepper/pipelines.yaml -v /full/path/to/data-prepper-config.yaml:/usr/share/data-prepper/data-prepper-config.yaml opensearchproject/data-prepper:latest

This proposal will slightly alter the process by changing linked files.

Original Destination	Updated Destination
/usr/share/data-prepper/pipelines.yaml	/usr/share/data-prepper/pipelines/pipelines.yaml
/usr/share/data-prepper/data-prepper-config.yaml	/usr/share/data-prepper/config/data-prepper-config.yaml

The new commands will be:

docker pull opensearchproject/data-prepper:latest
docker run --name data-prepper --expose 21890 -v /full/path/to/pipelines.yaml:/usr/share/data-prepper/pipelines/pipelines.yaml -v /full/path/to/data-prepper-config.yaml:/usr/share/data-prepper/config/data-prepper-config.yaml opensearchproject/data-prepper:latest

Install Bundled Distribution

Users who installed our bundled distribution get a single uber-jar. Because this uber-jar has everything bundled, users run Data Prepper with the following command.

java -jar data-prepper-core-$VERSION.jar pipelines.yaml data-prepper-config.yaml

With the proposed update, users will install Data Prepper by performing steps similar to the following.

cd path/to/parent/directory
wget https://.../path/to/data-prepper-$VERSION.tar.gz
tar -xf data-prepper-$VERSION.tar.gz

Users can then run Data Prepper with the following commands.

cd data-prepper-$VERSION
bin/data-prepper

Proposed Structure

Below is the initial directory structure which this change will introduce.

data-prepper-$VERSION/
  bin/
    data-prepper                    # Shell script to run Data Prepper on Linux/macOS
  config/
    data-prepper-config.yaml        # The Data Prepper configuration file
    log4j.properties                # Logging configuration
  pipelines/                             # New directory for pipelines
    trace-analytics.yaml
    log-ingest.yaml
  lib/
    data-prepper-core.jar
    ... any other jar files
  logs/                             # Directory for log output
  LICENSE
  NOTICE
  README.md

The proposed structure is similar to OpenSearch’s directory structure.

What are your assumptions or prerequisites?

This RFC is limited in scope to the change to deploying with a directory structure. It does not include details for any features which depend on this.

Future features may expand the proposed directory structure. This approach does not attempt to foresee every possible directory.

Users still continue to provide the pipeline configuration file to Data Prepper as a command-line argument.

We expect that users only need to run Data Prepper on Linux. This proposal does not include a Windows script, which would likely be bin/data-prepper.bat. However, this could be included into this approach if requested.

Additional changes

In addition to deploying with a directory structure, this includes other related changes.

Data Prepper can read the config file located at config/data-prepper-config.yaml rather than require it as a command-line argument.
Data Prepper will expose the Log4j logging configuration file. This can be a clearer approach than requiring users to override a Java property.
Data Prepper will write logs to the logs/ directory.

Tasks

Rename Prepper to Processor

Is your feature request related to a problem? Please describe.

Data Prepper currently supports prepper plugins. This term is not entirely clear. Additionally, it is somewhat ambiguous with Data Prepper.

Describe the solution you'd like

Rename prepper to processor.

This can be done in a phased approach. First, we will add processor as a supported name within the pipeline YAML configuration. We will continue to support either processor or prepper in the YAML. However, only one may be present per pipeline.

In a major release version of Data Prepper we will remove prepper so that only processor is allowed.

Additionally, we will add a new interface Processor with the same signature as Prepper. The Prepper interface will be made to inherit from Processor and will be deprecated. In a major release of Data Prepper, we will remove the Prepper interface.

Grok Prepper Configuration and Boilerplate

This is a subtask of the issue for a grok processor: https://github.com/opensearch-project/data-prepper/issues/256.

Grok prepper field names and default values need to be set up, as well as the reading of the configuration and boilerplate for implementing AbstractPrepper.

Common docs for Data-Prepper/Elasticsearch/OpenSearch/opentelemetry version compatibility reference

Is your feature request related to a problem? Please describe.
It would be nice to have a doc that tracks the above compatibility for each release of data-prepper.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered (Optional)
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Add a GeoIP processor.

Provide a new processor which can enrich Data Prepper events with location information using a provided IP address.

The minimal configuration is to provide a source_key with the JSON Pointer key path.

processor:
  - geoip:
      source_key: "peer/ip"

Additionally, this plugin should be able to use either a MaxMind GeoIP Lite2 database or the GeoIP2 Commercial Licensing database. The Data Prepper author must provide information for configuring the commercial license.

The pipeline author can also specify an optional target_key property to specify where the location fields are written. By default, this will be the root of the event.

Example 1 - Minimal Configuration

processor:
  - geoip:
      source_key: "peer/ip"

Input Event:

"peer" : {
  "ip" : "1.2.3.4"
  "host" : "example.org"
}
"status" : "success"

Output Event:

"peer" : {
  "ip" : "1.2.3.4"
  "host" : "example.org"
}
"status" : "success"
"country" : "United States"
"city_name" : "Seattle"
"latitude" : 47.64097
"longitude" : 122.25894
"zip_code" : "98115"

Example 2 - Target Key

processor:
  - geoip:
      source_key: "peer/ip"
      target_key: "location"

Input Event:

"peer" : {
  "ip" : "1.2.3.4"
  "host" : "example.org"
}
"status" : "success"

Output Event:

"peer" : {
  "ip" : "1.2.3.4"
  "host" : "example.org"
}
"location" : {
  "status" : "success"
  "country" : "United States"
  "city_name" : "Seattle"
  "latitude" : "47.64097"
  "longitude" : "122.25894"
  "zip_code" : "98115"
}

#3941
#3942

Define the log output naming conventions that will be used by Observability dashboard.

We need to define the naming conventions that we will use to send data to the indices for log analysis in the observability plugin.

[RFC] Internal Model Proposal

What kind of business use case are you trying to solve? What are your requirements?

Existing preppers consume and emit serialized JSON strings. This wastes CPU cycles when chaining preppers due to excessive de/serialization. Users of the DP have encountered runtime exceptions due to conflicting data requirements of their preppers. Model definitions are duplicated throughout prepper plugins. (e.g. otel-trace-group-prepper TraceGroup and otel-trace-raw-prepper)

Requirements:

Extendable - the model should scale beyond the existing trace analytics support.
Type Safety - the model should be prescriptive enough to enable type safety checks throughout the pipelines in the future.
Eliminate need to duplicate code between plugins
Allow preppers to operate on internal data in a generic way
Remove excessing serialization

What is the problem? What is preventing you from meeting the requirements?

Currently, data flows through Data Prepper as a Collection<Records>. Records are generic types that allow for any type to flow through. Trace events have been been defined as a Collection of Records as type String. The strings are serialized representations of JSON objects conforming to the OTEL Spec.

What are you proposing? What do you suggest we do to solve the problem or improve the existing situation?

We will deprecate Records and define explicit object models for Traces and Logs. Traces and Logs will implement a new interface called Event. Events will be the new data type flowing through Data Prepper.

Source plugins will be responsible for translating the external requests into Events. Sink plugins will be responsible for transforming the Events into the correct output schema. Preppers will only accept Events as inputs and outputs or subtypes. This will effectively create internal boundaries for our model between sources and sinks.

Event

Events will be managed through public putting, deleting and fetching methods. An additional method for generating a JSON String is included to support the sinks.

/**
* Add or update the key with a the given value in the Event
*/
void put(String key, Object value);

/**
* Retrieves the given key from the Event
*/
<T> T get(String key, Class<T> type);

/**
* Deletes the given key from the Event
*/
void delete(String key);

/**
* Generates a Json representation of the Event
*/
JsonNode toJsonNode();

/**
* Get Metadata
*/
Metadata getMetadata();

EventMetadata

Will be a class with a slight refactor of the RecordMetadata Class. Currently, RecordMetadata maintains a map of attributes and has one required recordType attribute. However, the recordType has been historically ignored. The new model will have required attributes recategorized as POJO fields. The eventType will help preserve the type (i.e log, span), for casting and type validation. The EventMetadata class will still maintain a mapping for custom metadata defined in attributes.

public class EventMetadata {

private String eventType;
private long timeReceived;
private Map<String, Object> attributes;

Span

Span will be a new model to support the traces. It will implement the Event interface and maintain the same attributes as the current RawSpan Object. This will ensure backwards compatibility with our existing preppers.

Phased Approach

This design includes breaking changes and will be broken into two phases. One phase allows us to build support for the new model and onboard log ingestion and trace analytics. The second phase will deprecate the old model and will be a part of the 2.0 release

What are your assumptions or prerequisites?

The design and changes to the pipelines to enforce type safety are out of scope and should addressed in a separate review. However, the output of this design should not hinder but enable type safety enforcement.

This aligns with the proposal for Log Ingestion RFC

What are remaining open questions?

Which library should we use to support the underlying interfaces? (JsonPath or Jackson) JsonPath is a library for reading and updating JSON documents. It natively supports the dot notation. Jackson is fast JSON library for parsing json objects and supports Json Pointers for managing objects. Both libraries will work. Jackson will be ideal to reduce dependencies.

Accept data from Kafka

Data Prepper can receive events from Kafka using a source which acts as a Kafka consumer.

This source should support most or all of the consumer configurations.

It should be able to support deserializing objects directly into Events. With the StringDeserializer it can create an Event with a single field, say message with the string as the value. It could possible have a JSON-based deserializer which maps JSON data directly into the fields of the Event.

Firewall Log Ingest

Is your feature request related to a problem? Please describe.
It would be nice to be able to ingest firewall logs into OpenSearch for Obervability / SIEM related activities.

Describe the solution you'd like
A firewall log source plugin that accepts firewall logs and normalizes them.

Describe alternatives you've considered (Optional)
N/A

Additional context
See this discussion for more detail.

BlockingTaskExecutor is never used in Otel-Trace-Source

Describe the bug
Although configured in ServerBuilder:

data-prepper/data-prepper-plugins/otel-trace-source/src/main/java/com/amazon/dataprepper/plugins/source/oteltrace/OTelTraceSource.java

Line 107 in 1cb44e1

sb.blockingTaskExecutor(

BlockingTaskExecutor is not used by default to execute GRPC service: https://armeria.dev/docs/server-grpc#blocking-service-implementation
This leads to blocking tasks executed directly on main EventLoop thread that blocks event handling in netty framework. Also, as a consequence, threads parameter never actually works in the source plugin.

To Reproduce
Do not have way to reproduce in data-prepper context. But it is expected to severely delay the new request handling.

Expected behavior
We should useBlockingExecutor in grpcServiceBuilder:

final GrpcServiceBuilder grpcServiceBuilder = GrpcService
                    .builder()
                    // explicitly use blocking task executor
                    .useBlockingTaskExecutor(true)
                    .addService(new OTelTraceGrpcService(
                            oTelTraceSourceConfig.getRequestTimeoutInMillis(),
                            buffer,
                            pluginMetrics
                    ))
                    .useClientTimeoutHeader(false);

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS: [e.g. Ubuntu 20.04 LTS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

Grok Prepper Match Timeout

This is a subtask of the issue for a grok processor: #256.

The grok prepper needs a configurable way to specify how long the prepper should attempt to match patterns against a log before it should move on to the next one. This will allow users to improve performance by not spending too much time matching an individual log.
The user should also be able to set the timeout to be disabled, which will be done when the timeout_millis value is set to 0.

Better input validation.

Describe the bug
Basic input validation is done on String inputs but correct formats are not checked for what they should be. For example, ARNs, S3 paths, etc. are not checked if they are a valid format or not until the time of use.

To Reproduce
Steps to reproduce the behavior:
Create a YAML configuration file with an incorrectly formatted ARN and run the code.

Expected behavior
When Data Prepper parses the file, it should throw errors and prevent Data Prepper startup due to misconfigurations.

Proxy support for AWS SDKs

Is your feature request related to a problem? Please describe.

Data Prepper running behind proxies is not currently supported.

Describe the solution you'd like

Follow the same approach as we are planning #300. Each AWS configuration can include a new proxy configuration.

For example, within peer-forwarder:

  prepper:
    - peer_forwarder:
        discovery_mode: aws_cloud_map
        proxy: http://my-proxy:9000

Describe alternatives you've considered (Optional)

We may wish to have a default AWS configuration available for users to configure. This could be used for any plugin which uses AWS. However, this would be a much larger change.

AWS Components Needing Updating

peer-forwarder Cloud Map (AWS SDK v2)
opensearch (AWS SDK v2)
CloudWatchMeterRegistryProvider (AWS SDK v2)
ACMCertificateProvider (multiple plugins; AWS SDK v1)
S3CertificateProvider (multiple plugins; AWS SDK v1)

Cannot build the project/jar - Data-prepper-plugins

Describe the bug
A clear and concise description of what the bug is.
Gradlew build command is failing to the output jar. Looks like the issue is with the opensearch project inside data-prepper-plugins project. Following is the error:
FAILURE: Build failed with an exception.

What went wrong:
A problem occurred configuring project ': opensearch project inside - data-prepper-plugins:opensearch'

Could not resolve all artifacts for configuration ':data-prepper-plugins:opensearch:classpath'.
Could not find org.opensearch.gradle:build-tools:1.0.0-alpha2.
Searched in the following locations:
- https://plugins.gradle.org/m2/org/opensearch/gradle/build-tools/1.0.0-alpha2/build-tools-1.0.0-alpha2.pom
- file:/home/nair/.m2/repository/org/opensearch/gradle/build-tools/1.0.0-alpha2/build-tools-1.0.0-alpha2.pom
Required by:
project :data-prepper-plugins:opensearch

To Reproduce
Steps to reproduce the behavior:

Clone the data-prepper repository from github
run - ./gradlew build
Error message:

Starting a Gradle Daemon, 1 incompatible and 1 stopped Daemons could not be reused, use --status for details

FAILURE: Build failed with an exception.

What went wrong:
A problem occurred configuring project ':data-prepper-plugins:opensearch'.

Could not resolve all artifacts for configuration ':data-prepper-plugins:opensearch:classpath'.
Could not find org.opensearch.gradle:build-tools:1.0.0-alpha2.
Searched in the following locations:
- https://plugins.gradle.org/m2/org/opensearch/gradle/build-tools/1.0.0-alpha2/build-tools-1.0.0-alpha2.pom
- file:/home/nair/.m2/repository/org/opensearch/gradle/build-tools/1.0.0-alpha2/build-tools-1.0.0-alpha2.pom
Required by:
project :data-prepper-plugins:opensearch

Expected behavior
Jar file is build for the data-prepper-plugins project.

Screenshots
Error messages attached above

Environment (please complete the following information):

OS: Ubuntu 20.04 LTS
Version v1.0.0

Additional context
Following comment is included in the build.gradle file.
// TODO: replace local built OpenSearch artifact with the public artifact

Support more TLS/SSL certificate providers in http source

Is your feature request related to a problem? Please describe.
Currently only local file certificate provider is supported.

Describe the solution you'd like
We will expand support to s3 and ACM when necessary

Describe alternatives you've considered (Optional)
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Accept data from FluentBit.

Accept data from FluentBit -- probably need to create an HTTP input source.

Receive log data from S3 as a Source

Use-Case

Many users have external systems which write their logs to Amazon S3. These users want to use OpenSearch to analyze these logs. Data Prepper is an ingestion tool which can aid teams in extracting these logs for S3 and sending them to OpenSearch or elsewhere.

This proposal is to receive events from S3 notifications, read the object from S3, and create log lines for these.

Basic Configuration

This plugin will be a single source plugin which:

Polls a configured SQS standard queue which should hold S3 Event messages.
Reads S3 objects which the message indicates as created.
Uses a configured codec to parse the S3 object into Log Events.
Writes the Log Events into the Data Prepper buffer.

The following example shows what a basic configure would look like.

source:
  s3:
    notification_type: sqs
    sqs:
      queue_url: "https://sqs.us-east-2.amazonaws.com/123456789012/MyS3EventQueue"
    codec:
      single-line:
  processor:
    grok:
      match:
        message:  [ "%{COMMONAPACHELOG}" ]

Detailed Process

The S3 Source will start a new thread for reading from S3. (The number of threads can be configured).

This thread will perform the following steps repeatedly until shutdown

Use the SQS ReceiveMessage API to receive messages from SQS.
For each Message from SQS, it will:
a. Parse the Message as an S3Event.
b. Download the S3 Object which the S3Event indicates was created.
c. Decompress the object if configured to do so.
d. Parse the decompressed file using the configured codec into a list of Log Event objects.
e. Write the Log objects into the Data Prepper buffer.
Perform a DeleteMessageBatch with all of the messages which were successfully processed.
Repeat

Error Handling

The S3 Source will suppress exceptions which occur during processing. Any Message which is not processed correctly will not be included in the DeleteMessageBatch request. Thus, the message will appear in the SQS again. Data Prepper expects that the SQS queue is correctly configured with a DLQ or MessageRetentionPeriod to prevent the SQS queue from filling up with invalid messages.

Codecs

The S3 Source will use configurable codecs to support multiple data formats in the S3 objects. Initially, two codecs are planned:

single-line - This is used for logs which should be separated by a newline.
json - A codec for parsing JSON logs

Single Line

The single-line codec has no configuration items.

Below is an example S3 object.

POST /search
POST /index
PUT /document/12345

With single-line, the S3 source will produce 3 Events, each with the following structure.

"bucket" : "my-bucket",
"key" : "application1/instance200/2022-05-11.log",
"message" : "POST /search"

"bucket" : "my-bucket",
"key" : "application1/instance200/2022-05-11.log",
"message" : "POST /index"

"bucket" : "my-bucket",
"key" : "application1/instance200/2022-05-11.log",
"message" : "PUT /document/12345"

JSON

The json codec supports reading a JSON file and will create Events for each JSON object in an array. This S3 plugin is starting with the expectation that the incoming JSON is formed as a large JSON array of JSON objects. Each JSON object in that array is an Event. Thus, this codec will find the first JSON array in the JSON. It will output the objects within that array as Events from the JSON.

Future iterations of this plugin could allow for more customization. One possibility is to use JSON Pointer. However, the first iteration should meet many use-cases and allows for streaming the JSON to support parsing large JSON objects.

Below is an example configuration. This configures the S3 Sink to read a JSON array from the items key.

s3:
  codec:
    json:

Given the following S3 Object:

{
  "http_requests" : [
    { "status" : 200, "path" : "/search", "method" : "POST" },
    { "status" : 200, "path" : "/index", "method" : "POST" },
    { "status" : 200, "path" : "/document/12345", "method" : "PUT" }
  ]
}

The S3 source will output 3 Log events:

"bucket" : "my-bucket",
"key" : "application1/instance200/2022-05-11.json",
"message" : { "status" : 200, "path" : "/index", "method" : "POST" }

"bucket" : "my-bucket",
"key" : "application1/instance200/2022-05-11.json",
"message" : { "status" : 200, "path" : "/search", "method" : "POST" }

"bucket" : "my-bucket",
"key" : "application1/instance200/2022-05-11.json",
"message" : { "status" : 200, "path" : "/document/12345", "method" : "PUT" }

Compression

The S3 Source will support three configurations for compression.

none - The object will be treated as uncompressed.
gzip - The object will be decompressed using the gzip decompression algorithm
automatic - The S3 Source will example the object key to guess if it is compressed or not. If the key ends with .gz the S3 Source will attempt to decompress it using gzip. It can support other heuristics to determine if the file is compressed in future iterations.

Full Configuration Options

Option	Type	Required	Description
notification_type	Enum: `sqs`	Yes	Only SQS is supported. SNS may be a future option
compression	Enum: `none`, `gzip`, `automatic`	No	Default is `none`
codec	Codec	Yes	See Codecs section above.
sqs.queue_url	String - URL	Yes	The queue URL of the SQS queue.
sqs.maximum_messages	Integer	No	Directly related to SQS input. Default is 10.
sqs.visibility_timeout	Duration	No	Directly related to SQS input. Default is TBD.
sqs.wait_time	Duration	No	Directly related to SQS input. Default is TBD.
sqs.poll_delay	Duration	No	An optional delay between iterations of the process. Default is 0 seconds.
sqs.thread_count	Integer	No	Number of threads polling S3. Default is 1.
region	String	Yes	The AWS Region. TBD.
sts_role_arn	String	No	Role used for accessing S3 and SQS
access_key_id	String	No	Static access to S3 and SQS
secret_key_id	String	No	Static access to S3 and SQS
buckets	String List	No	If provided, only read objects from the buckets provided in the list.
account_ids	String List	No	If provided, only read objects from the buckets owned by an accountId in this list.

S3 Events

The S3 Source will parse all SQS Messages according to the S3 Event message structure.

The S3 Source will also parse out any event types which are not s3:ObjectCreated:*. These events will be silently ignored. That is, the S3 Source will remove them from the SQS Queue, and will not create an Events for them.

Additionally, this source will have an optional buckets and account_ids lists. If supplied by the pipeline author, Data Prepper will only read objects for S3 events which are part of that list. For the buckets list, only S3 buckets in the list are used. For the account_ids list, only buckets owned by accounts with matching Ids are used. If this list is not provided, Data Prepper will read from any bucket which is owned by the accountId of the SQS queue. Use of this list is optional.

AWS Permissions Needed

The S3 Source will require the following permissions:

Action	Resource
`s3:GetObject`	The S3 bucket and key path for any object needed
`sqs:ReceiveMessage`	The ARN of the SQS queue specified by `sqs.queue_url`
`sqs:DeleteMessageBatch`	The ARN of the SQS queue specified by `sqs.queue_url`

Possible Future Enhancements

Direct SNS Notification

The notification_type currently only supports SQS. Some teams may want Data Prepper to receive notifications directly from SNS and thus remove the need for an SQS queue.

The notification_type could support an sns value in the future.

Additional Codecs

As needed, Data Prepper can support other codecs. Some possible candidates to consider are:

Multi-line
JSON List

Metrics

messagesReceived (Counter)
messagesDeleted (Counter)
messagesFailed (Counter)
eventsCreated (Counter)
requestsDuration (Timer)

Not Included

This proposal is focused only reading S3 objects starting with a notification. Thus any use-case for replay is not part of this scope. Also, use-cases for reading existing logs are not covered. These use-cases can have their own issue.
Updated S3 objects are not part of the scope. This work will only support use-cases when a log file is written once.
Configuration of SQS queue to receive SNS topics should be done externally. Data Prepper will not manage this.

Tasks

Add and implement writeAll API in Buffer

Is your feature request related to a problem? Please describe.
Current buffer offers write API that only allows write single item atomically. Once we migrated to the internal data model as mentioned in #306, the existing write API will lead to partially ingested request data when called in source plugin.

Describe the solution you'd like
We will provide a new writeAll(Collection items) API that atomically writes collection of raw data type as multiple items into the buffer.

Describe alternatives you've considered (Optional)
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Rework getting started documentation

The getting started documentation needs to be better written and easier to understand so that getting data prepper up and running is very simple.

Update package naming to be org.opensearch.dataprepper from com.amazon.dataprepper

Is your feature request related to a problem? Please describe.
The project is in the opensearch project and the naming should likewise be adjusted.

Describe the solution you'd like
com.amazon will be replaced with org.opensearch in all files

Correct copyright notices to reflect Copyright OpenSearch Contributors

Coming from opensearch-project/.github#21.

The correct copyright for open-source projects in opensearch-project is "Copyright OpenSearch Contributors". Please correct any places that say otherwise, especially where it says copyright Amazon. Make sure NOTICE.txt and README match. See opensearch-project/.github#24 for an example.

Logstash Template support:

The goal is take an existing Logstash configuration file that we can plug with Data Prepper and perform identical transformations on the data being ingested by the data prepper Log pipeline.

[RFC] Log Ingestion for Data Prepper

This RFC introduces a proposal log ingestion in Data Prepper.

What kind of business use case are you trying to solve? What are your requirements?

Users would like to support processing unstructured log data and storing structured output in OpenSearch.

Use Cases:

As a user, I want to be able to send data from FluentBit to Data Prepper to be processed before ingest into OpenSearch.
As a user, I want to convert unstructured log lines into a structured data format to ingest into OpenSearch.

What is the problem? What is preventing you from meeting the requirements?

OpenSearch does not support the processing of unstructured data prior to saving the data to the index. There are numerous open source projects that support log ingestion however users are looking for an OpenSearch native
solution.

What are you proposing? What do you suggest we do to solve the problem or improve the existing situation?

We will build HTTP/S source and Grok prepper plugins for Data Prepper. The HTTP/S plugin will support handling JSON data. We will build a new Grok prepper plugin to convert unstructured log lines into a structured data. We will build a new internal model to improve existing system performance and extend the OpenSearch sink to support structured log data.
[Image: FluentBit and Grok Plugin-Proposed System Overview.jpg]

HTTP Source Plugin

This plugin will be responsible for receiving requests from a user's FluentBit clients. This plugin will accept JSON formatted data initially with the flexibility to add other data formats in the future.

Internal Model

We will migrate away from using serialized JSON Strings as our internal data structure and define a new internal model. This will improve our system’s performance by removing excessive de/serialization (something we are currently experiencing as part of the trace pipelines) while keeping the current flexibility of our existing design.

Grok Prepper Plugin

The grok filtering functionality will be supported through a new grok prepper plugin. This prepper will process the collection of records from the FluentBit source plugin and filter according to the data prepper grok configuration.

Sink

The Data Prepper sink OpenSearch configuration will be extended to support structured log data as it currently only supports raw trace data and service map data.

What are your assumptions or prerequisites?

We elected to support Fluent Bit and grok because they are widely used features for exporting and transforming log data. Other logs sources (FluentD, Beats, etc.) and processing functionality (dropping data, mutating, etc.) are out of scope at this time. We plan to expand support for other plugins based on feedback from the community.

Detailed designs for each sections will be proposed in separate RFCs.

This aligns with the most recent blog post

opensearch-project / data-prepper Goto Github PK

data-prepper's People

Contributors

Stargazers

Watchers

Forkers

data-prepper's Issues

Summary

Configuration

Operations

Sample

Query parameters

Response

What kind of business use case are you trying to solve? What are your requirements?

What is the problem? What is preventing you from meeting the requirements?

What are you proposing? What do you suggest we do to solve the problem or improve the existing situation?

Refactoring for Better Extensibility

Support of New Parameters in Configuration

Example of the OpenSearch Sink Configuration

What are your assumptions or prerequisites?

What are remaining open questions?

What kind of business use case are you trying to solve? What are your requirements?

What is the problem? What is preventing you from meeting the requirements?

What are you proposing? What do you suggest we do to solve the problem or improve the existing situation?

What are your assumptions or prerequisites?

What are remaining open questions?

What is the problem? What is preventing you from meeting the requirements?

What are you proposing? What do you suggest we do to solve the problem or improve the existing situation?

User Experience

Docker

Install Bundled Distribution

Proposed Structure

What are your assumptions or prerequisites?

Tasks

Example 1 - Minimal Configuration

Example 2 - Target Key

What kind of business use case are you trying to solve? What are your requirements?

What is the problem? What is preventing you from meeting the requirements?

What are you proposing? What do you suggest we do to solve the problem or improve the existing situation?

Event

EventMetadata

Span

Phased Approach

What are your assumptions or prerequisites?

What are remaining open questions?

Use-Case

Basic Configuration

Detailed Process

Error Handling

Codecs

Single Line

JSON

Compression

Full Configuration Options

S3 Events

AWS Permissions Needed

Possible Future Enhancements

Direct SNS Notification

Additional Codecs

Metrics

Not Included

Tasks

What kind of business use case are you trying to solve? What are your requirements?

What is the problem? What is preventing you from meeting the requirements?

What are you proposing? What do you suggest we do to solve the problem or improve the existing situation?

HTTP Source Plugin

Internal Model

Grok Prepper Plugin

Sink

What are your assumptions or prerequisites?

Recommend Projects

Recommend Topics

Recommend Org