cogstack / cogstack-nifi Goto Github PK

View Code? Open in Web Editor NEW

36.0 16.0 18.0 75.7 MB

Building data processing pipelines for documents processing with NLP using Apache NiFi and related services

Home Page: https://hub.docker.com/r/cogstacksystems/cogstack-nifi/

License: Other

Dockerfile 2.75% Groovy 5.79% Python 46.56% Shell 21.05% Makefile 1.35% Jupyter Notebook 21.85% R 0.20% TSQL 0.46%

apache-nifi nifi elasticsearch kibana data-pipelines nlp rest electronic-health-records data-integration

cogstack-nifi's Introduction

Introduction

This repository proposes a possible next step for the free-text data processing capabilities implemented as CogStack-Pipeline, shaping the solution more towards Platform-as-a-Service.

CogStack-NiFi contains example recipes using Apache NiFi as the key data workflow engine with a set of services for documents processing with NLP. Each component implementing key functionality, such as Text Extraction or Natural Language Processing, runs as a service where the data routing between the components and data source/sink is handled by Apache NiFi. Moreover, NLP services are expected to implement an uniform RESTful API to enable easy plugging-in into existing document processing pipelines, making it possible to use any NLP application in the stack.

Important

Please note that the project is under constant improvement, brining new features or services that might impact current deployments, please be aware as this might affect you, the user, when making upgrades, so be sure to check the release notes and the documentation beforehand.

Asking questions

Feel free to ask questions on the github issue tracker or on our discourse website which is frequently used by our development team!

Project organisation

The project is organised in the following directories:

nifi - custom Docker image of Apache NiFi with configuration files, drivers, example workflows and custom user resources.
security - scripts to generate SSL keys and certificates for Apache NiFi and related services (when needed) with other security-related requirements.
services - available services with their corresponding configuration files and resources.
deploy - an example deployment of Apache NiFi with related services.
scripts - helper scripts containing setup tools, sample ES ingestion, bash ingestion into DB samples etc.
data - any data that you wish to ingest should be placed here.

Documentation and getting started

Knowledge requirements: Docker usage (mandatory), Python, Linux/UNIX understarting.

Official documentation now available here.

As a good starting point, deployment walks through an example deployment with some workflow examples.

All issues are tracked in README, check that section before opening a bug report ticket.

Important news and updates

Please check IMPORTANT_NEWS for any major changes that might affect your deployment and security problems that have been discovered.

cogstack-nifi's People

Contributors

Stargazers

Watchers

Forkers

lrog databill86 sohail0786 sandertan umcu abhiagar2019 marrowp1968 datastark rsun0013 jthteo richardbeare monash-cogstack richdobson tomgw rajeevyadav cyruschan360 tomdango datom95

cogstack-nifi's Issues

Suggestions for simplifying Docker Compose

Hi @vladd-bit , in addition to #19 I think there are some more simplifications possible for services.yml that make it easier to do custom deployments while also making it easy to regularly pull updates from this repository's master branch.

Move env variables to YML files.
Are there any specific reasons to keep some ENV vars in the docker-compose, while having other configuration for the same services in the YML / properties files? In elasticsearch-1, -2 and kibana services there are quite a number of environment variables that can also be specified in the respective YML-files. ~~Also the nifi service contains some env variables that can be moved to nifi/conf/nifi.properties I think (although I've not tested this).~~ Apparently some NiFi properties can only be set using ENV vars https://stackoverflow.com/a/55266528/4141535.
Create git tracked -EXAMPLE files for configuration files
Just like I suggested earlier with deploy/.env-example (git tracked) and deploy/.env (git ignored), we can use this way of working with the OpenSearch YML and NiFi properties files. Custom deployments can copy the example file and tailor it to their needs. This makes it easy to pull new changes, and maintainers can inspect (e.g. using a diff-tool) the differences between the example and used file to see whether properties are added/changed/deleted. This way of working was quite effective in the previous projects I collaborated in (example).
Remove container and network names, and rely on $COMPOSE_PROJECT_NAME (https://docs.docker.com/compose/reference/envvars/#compose_project_name). I documented how to use this in #19 .
There are a lot of commented and uncommented lines regarding ElasticSearch and Kibana mounted security files. Perhaps this can be simplified by using a single ENV var, e.g. $ELASTICSEARCH_SECURITY_DIR, which we can also put it .env-example and refer to ../security/es_certificates/opensearch/.
In our deployments we set all hosts ports in the .env outside of the docker-compose. For example, - ${KIBANA_HOST_PORT}:5601. This makes it easy to switch between local (fine if port is open) and server deployments (we set port to 127.0.0.1:5601 and let the reverse proxy on the host machine regulated traffic). What are your thoughts on moving this configuration to .env?
When ports: is set, expose: no longer has any informative meaning, and can be removed (https://stackoverflow.com/a/40801773/4141535). Or do you include them for a different reason?

For a new user doing a new deployment, it would be nice to require the least amount of actions to start the Docker containers. Perhaps only creating a .env file from the .env-example and executing docker-compose up is enough. We can configure the .env-example to point it to all the other examples files, which the user can at a later time change to his deployment specific configuration files.

I'd rather discuss this with you before creating a PR, since your workflows probably depend on the current way of working.

By the way, congratulations on releasing v1.0.0 :)

ElasticSearch reader component

We need to implement a functionality to only read the newest documents from ElasticSearch since the last ingestion.

A NiFi database reader has the possibility to persist the value of the last record's maximum-value columns (such as primary key). Hence it can keep track of new records available to be ingested. However, such option does not seem to be implemented when reading documents from ElasticSearch.

Integration tests for supported workflows

We need to provide integration tests for supported workflows - these include for now:

1. documents ingestion: DB -> ES
2. documents ingestion with text extraction from BLOBs: DB -> Tika -> ES
3. documents ingestion with NLP annotations extraction: DB -> NLP -> ES
4. combined 2 and 3: DB -> Tika -> NLP -> ES

Sample DB docker not populated with data on startup

I am deploying the Cogstack on Windows 10 machine for testing.

The Sample BD which is supposed to get populated with following data is empty. Leading to workflows not being run.
patients - structured patient information,
encounters - structured encounters information,
observations - structured observations information,

Any thoughts ????

nlp response groovy script error

Using the script "parse-anns-from-nlp-response-bulk.groovy" in the nifi annotation workflow gives an error as it cannot validate ann_id using "assert ann_id". I had to convert the ann_id type to string to make it work:

def ann_id = outAnn[annotation_id_field as String]
assert ann_id.**toString()**

Upgrade opendistro/opensearch for log4j vulnerability

You're probably well aware, but you it's recommended to upgrade to OpenDistro >= 1.13.3 or OpenSearch >=1.2.1 asap to mitigate the log4j vulnerability (see: the internet).

https://opendistro.github.io/for-elasticsearch/blog/2021/12/update-to-1-13-3/
https://opensearch.org/blog/releases/2021/12/update-to-1-2-1/

Any plans to move from OpenDistro to OpenSearch?

It seems the work on OpenDistro has now quite definitively moved into OpenSearch, see e.g: https://opendistro.github.io/for-elasticsearch/blog/2021/06/forward-to-opensearch/

Do you happen to have any plans moving to OpenSearch? As far as I can tell right now, this should not result in very fundamental problems. Kibana is renamed OpenSearch Dashboards, and still comes in a separate docker.

In the short term, the problem for me is that OpenDistro does not support docker for Apple M1 chips, so I cannot work with it locally anymore. I might try to make the change myself, but just wondering if you had any thoughts on the issue.

Thanks!
-Vincent

Support for Different File Data Sources

Currently, the preferred data source is a relational DB with documents included as BLOBs but there are other possible sources, but some are less needed to be supported at present. I've ranked them based on how likely I expect these to be encountered.

1. BLOBs in database
2. Pointers to file paths on a filesystem (or object store)
3. Files on Filesystem with metadata e.g. patientID in Filename or File contents
4. Object Store (e.g. S3 or other) with metadata e.g. patientID in object metadata labels