jvalue / ods Goto Github PK

Open Data Service - Make consuming open data easy, safe, and reliable

License: GNU Affero General Public License v3.0

Dockerfile 1.76% JavaScript 22.60% Java 18.25% TypeScript 41.75% Shell 0.89% HTML 0.09% Vue 14.34% Python 0.32%

ods's Introduction

Open Data Service (ODS)

The Open Data Service (ODS) is an application which can collect data from multiple sources simulataneously, process that data and then offer an improved (or "cleaned") version to its clients. We aim to establish the ODS as the go-to place for using Open Data!

Quick Start

To execute the ODS locally, run docker-compose up in the project root directory. The ui will be accessible under localhost:9000.

Open Data Service (ODS)
Quick Start
Table of Contents
Configure the ODS
- Using the API
- Using the UI
Project Structure
Deployment
Contributing
Contact us
License

Configure the ODS

In order to fetch and transform data from an external source, the ODS needs to be configured. The configuration consists of two steps:

A Data Source needs to be configured, i.e. URI, protocol, data format has to be specified as well as a trigger.
For each Data Source, one or more Pipelines can be configured to further process the data and possibly trigger a notification.

This configuration can be done programmatically via the API or browser based with a gui.

Using the API

There is a collection of examples for entire configurations in our example request collection.

Using the UI

The easiest way to use the ODS is via the UI. If you started the ODS with docker-compose you can access the UI under http://localhost:9000/.

To demonstrate the ODS we will create a new pipeline to fetch water level data for German rivers and have a look at the collected data.

First, go to the Datasources page and click on Create new Datasource. The configuration workflow for creating a new Datasource is divided into the following four steps.

Step 1: Name the datasource.

Step 2: Configure an adapter to crawl the data. You can use the prefilled example settings.

Step 3: Describe additional meta-data for the data source.

Step 4: Configure the interval of how often the data should be fetched. If Periodic execution is disabled the data will be fetched only once. With the two sliders, you can choose the interval duration. The first execution of the pipeline will be after the Time of First Execution plus the interval time. Please choose 1 minute, so that you don't have to wait too long for the first data to arrive.

The configuration of the data source is now finished. In the overview, you see the recently created data source. Remember the id on the left to the Datasource name, we will need it in the next step.

To obtain the data fetched by this data source, you need to create a pipeline operating on the data source we just created. Go to the Pipelines page and click on Create new Pipeline. The creation process consists of three steps.

Step 1: Choose a name for the pipeline and fill in the datasource id of the datasource we just created.

Step 2: In this step, you can manipulate the raw data to fit your needs by writing JavaScript code. The data object represents the incoming raw data. In this example, the attribute test is added to the data object before returning it.

Step 3: Describe additional meta-data for the pipeline.

After clicking on the save button, you should see the recently created pipeline.

By clicking on the Data button inside the table you see the collected data by the pipeline.

In this storage view, you see all data sets for the related pipeline. On top of this list, a static link shows the URL to fetch the data with a REST client. Each data entry in the list can be expanded to see the fetched data and additional meta-data.

Project Structure

We use the microservice architectural style in this project. The microservices are located in the sub-directories and communicate at runtime over network with each other. Each Microservice has its own defined interface that has to be used by other services, direct access to the database of other microservices is strictly prohibited. In production, each microservice can be multiplied in order to scale the system (except the scheduler at the moment).

Microservice	Description
Web-Client / UI	easy and seamless configuration of Sources, Pipelines
Scheduler	orchestrates the executions of Pipelines
Adapter-Service	fetches data from Sources and imports them into the system
Pipeline-Service	definition of pipelines, execution of data pipelines
Notification-Service	execution of notifications
Storage-Service	stores data of Pipelines and offers an API for querying
Reverse-Proxy	communication of UI with backend microservices independent from deployment environment

Further information about a specific microservice can be found in the respective README file. Examples showing the API of each microservice are in the example request directory.

Instructions on how to analyse the microservice architecture with a service dependency graph in Neo4j, can be found here.

Details on our used docker image versions and their reasoning can be found here.

Deployment

Docker images of the microservices the ods consists of are deployed via our continous deployment pipeline.

An online live instance is planned and will soon be available.

A detailled explanation of available deployment mechanisms is accessible in our deployment section.

Contributing

Contributions are welcome. Thank you if you want to contribute to the development of the ODS. There are several ways of contributing to the ODS:

by implementing new features
by fixing known bugs
by filing bug reports
by improving the documentation
by discussing use cases that are not covered yet

You can check our issue board for open issues to work on or to create new issues with a feature request, bug report, etc. Before we can merge your contribution you need to accept your Contributor License Agreement (CLA), integrated into the Pull Request process.

Please provide your contribution in the form of a pull request. We will then check your pull request as soon as possible and give you feedback if necessary. Please make sure that commits related to an issue (e.g. closing an issue) contains the issue number in the commit message.

Contact us

If you have any questions or would like to contact us, you can easily reach us via gitter channel. Issues can be reported via GitHub.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.

SPDX-License-Identifier: AGPL-3.0-only

ods's People

Contributors

Stargazers

Watchers

ods's Issues

Adapter/Scheduler RabbitMQ

The scheduler event polling at the adapter service needs to be replaced by rabbitMQ communication.
This entails:

Integrating rabbitmq in the scheduler
Data import triggering via datasource id instead of sending the whole configuration

Adapter Datasource Deserializing Test Sometimes Fails

Sometime the datasource deserializing unit tests fails for whatever reason (sometimes runs through successfully).


expected: org.jvalue.ods.adapterservice.datasource.model.Datasource<Datasource{id=123, protocol=AdapterProtocolConfig {type='HTTP', parameters='{location=http://www.the-inder.net}'}, format=AdapterFormatConfig {type='XML', parameters='{}'}, metadata=PipelineMetadata{displayName='TestName', author='icke', license='none', creationTimestamp=13 May 2020 14:13:39 GMT, description='Describing...'}, trigger=PipelineTriggerConfig{periodic=true, firstExecution=Fri Dec 01 03:30:00 CET 1905, interval=50000}}> but was: org.jvalue.ods.adapterservice.datasource.model.Datasource<Datasource{id=123, protocol=AdapterProtocolConfig {type='HTTP', parameters='{location=http://www.the-inder.net}'}, format=AdapterFormatConfig {type='XML', parameters='{}'}, metadata=PipelineMetadata{displayName='TestName', author='icke', license='none', creationTimestamp=13 May 2020 14:13:39 GMT, description='Describing...'}, trigger=PipelineTriggerConfig{periodic=true, firstExecution=Fri Dec 01 03:30:00 CET 1905, interval=50000}}>
java.lang.AssertionError: expected: org.jvalue.ods.adapterservice.datasource.model.Datasource<Datasource{id=123, protocol=AdapterProtocolConfig {type='HTTP', parameters='{location=http://www.the-inder.net}'}, format=AdapterFormatConfig {type='XML', parameters='{}'}, metadata=PipelineMetadata{displayName='TestName', author='icke', license='none', creationTimestamp=13 May 2020 14:13:39 GMT, description='Describing...'}, trigger=PipelineTriggerConfig{periodic=true, firstExecution=Fri Dec 01 03:30:00 CET 1905, interval=50000}}> but was: org.jvalue.ods.adapterservice.datasource.model.Datasource<Datasource{id=123, protocol=AdapterProtocolConfig {type='HTTP', parameters='{location=http://www.the-inder.net}'}, format=AdapterFormatConfig {type='XML', parameters='{}'}, metadata=PipelineMetadata{displayName='TestName', author='icke', license='none', creationTimestamp=13 May 2020 14:13:39 GMT, description='Describing...'}, trigger=PipelineTriggerConfig{periodic=true, firstExecution=Fri Dec 01 03:30:00 CET 1905, interval=50000}}>
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:834)
	at org.junit.Assert.assertEquals(Assert.java:118)
	at org.junit.Assert.assertEquals(Assert.java:144)
	at org.jvalue.ods.adapterservice.datasource.model.DatasourceTest.testDeserialization(DatasourceTest.java:29)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:567)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:110)
	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
	at org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:62)
	at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:567)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
	at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33)
	at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:94)
	at com.sun.proxy.$Proxy2.processTestClass(Unknown Source)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:118)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:567)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
	at org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:182)
	at org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:164)
	at org.gradle.internal.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:412)
	at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
	at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run(ManagedExecutorImpl.java:48)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.run(ThreadFactoryImpl.java:56)
	at java.base/java.lang.Thread.run(Thread.java:830)

Fix picture embedding in README.md

Pictures embedded in the root level README.md currently have relative file-system paths, e.g. doc/configuration-example/01_overview.jpg.

This worked when the repository was on GitLab but does not work on GitHub anymore.

The relative paths have to be replaced with absolute paths to where the pictures are accessible via GitHub, e.g. for the example above: https://github.com/jvalue/open-data-service/blob/master/doc/configuration-example/01_overview.jpg.

(Re-)add origin information to stored pipeline storage

Was temporarily removed due to a refactoring

Originally posted by @georg-schwarz in https://github.com/jvalue/open-data-service-ms/diffs

Core/Transformation: Move remaining pipelineConfig to transformation service

Currently, configurations for data transformation and notifications are stored and handled at the core service. This task should be handled by the transformation service. The exact implementation is highly dependent on whether we already have rabbitMQ or not. I suggest implementing rabbitMQ first so we do not have to implement event handling twice.

Combine multiple data sources

We should enable the combination of data from multiple sources. This can be done either in the adapter service or in the transformation service. I suggest doing it in the latter since data combination is modeled better by a transformation than by an adaptation.

It is probably best to refactor the data flow in the ODS. Currently, a pipeline consists of one adapter, followed by multiple transformations, storage, and optional notifications. That is, in pseudo-EBNF: A T* S N*. What we want are multiple stages of one adapter each followed by one transformation each and in the end optional storage and optional notifications, i.e.: {A T}+ [S] N*. Since the transformations are touring complete, modeling multiple transformations after one adapter is redundant and could be simplified by using just one. If we find out that it is more convenient to split data transformation into multiple parts, we can still offer it in the UI and just concatenate the transformations before passing it to the scheduler. The amse projects revealed that a common use case for the ODS is getting URLs from one source that should be used to get data from another. To enable this process we need a dynamic adapter configuration. The easiest implementation of this should be adding the parameters for subsequent adapters to the data field so they can be used there.

Adapter: Get imports and imported data of datasource via API

Import does not contain data, but only id, timestamp, link to data (url)

API suggestion:

/datasources/{id}/imports -> map of all imports (id, timestamp, link to data url)
/datasources/{id}/imports/{id} -> import (id, timestamp, link to data url)
/datasources/{id}/imports/latest -> latest data import
/datasources/{id}/imports/{id}/data -> the real data

Implementation:

managing imports and import data should be handled in sub-package of datasources (not adapters). Probably requires a fair bit of refactoring.
/dataImport should return imported data directly, no caching of these kind of data imports

UI: Serve as static content

Currently we use a development server in the Docker image to start the UI. The goal is to serve the UI as static resources via a webserver, like nginx.

https://daten-und-bass.io/blog/serving-a-vue-cli-production-build-on-a-sub-path-with-nginx/

Please check if this solves the issue that the UI is not served correctly on Kubenetes cluster

Adapter: Improve Exception Logging

Currently, exceptions are just being dumped to the console. Instead, we should log a meaningful message including context information

Run CI on PRs

Currently, when PRs are coming from a forked repo, it seems like the CI pipeline is not triggered in our repo. Probably we have to adjust the trigger

Links:

Add GitHub Action Badges to README

Let's have badges on the README that show if the microservices on the master branch build and tests run successfully.

Restructure README / documentation

I think our README could need a little more structuring.
Let's collect here first what should be part of the top-level README and what will go to other files, like the service READMEs..

A little inspiration: https://github.com/gentics/mesh/blob/dev/README.md

Adapter: Include datasource id in imported data

UI: Paginate pipeline data

Currently, all data a pipeline has produced in its lifetime is loaded once the user clicks on the "data" button. For longer running pipelines, this is very slow. We need some kind of pagination to limit the amount of data to be loaded at once.

Reload UI sites

In #56 we already made the first step towards make the UI reloadable on every site.

Top-level reloading on e.g. /datasources works now, but not to datasources/new.
So I guess, therefore, the issue is somewhere else? Any guesses?

Originally posted by @georg-schwarz in #56 (comment)

Try to import data from covid19 api (https://covid19api.com/)

Add resulting configuration to example requests in doc folder

Scheduler sends event via RabbitMQ to Notification instead of using trigger endpoint

Notification: poll instead of sleep in integration tests

Instead of sleeping, we could poll for the notification to arrive like it is done in the system-test

Originally posted by @mathiaszinnen in #120

Fix System-Test README

Startup of system-test as described in system-test/README.md does not work.

UI: Introduce Transformation Editor to Pipelines

We have a nice new component for the transformation editor. We just have to integrate it into the workflow.

Document/Specify GitHub workflow guidelines

After migrating from GitLab to GitHub we need to explicitly define the work processes of the ODS development, i.e. usage of issue board, project board, pull requests, etc.
A first draft of the guidelines has to be added which can then be discussed and modified.

Transformation: Webhook notification contains internal data location

Currently, the data location field of the webhook callback object contains the docker-internal data location which is not publically accessible. We need to instead use the public URL.

Add README for example requests

explain how to submit example requests using vscode and why

Add Linter to Integration and System Tests

Currently, there is no eslint enabled for our integration tests and system tests. However, we should do that in order to keep it maintainable.

Storage: Data references

Instead of passing the actual data to the transformation service, the scheduler should pass a reference to the data. The transformation service should then fetch the data, transform it and pass a reference to the transformed data as a result. We need to decide whether the transformation needs an own persistence to save its results or if we want to use some kind of a shared solution.

When should notifications trigger?

When should notifications trigger? After a pipeline execution or after the data is stored?

UI: Use imported data in transformation stepper

Instead of displaying unrelated sample data, pipeline transformation input data should be preview data from the corresponding adapter (if there is one). The newly added manual adapter trigger could be used for that.

RabbitMQ integration adapter

Docs for Deployment Environment

We need a documentation about how we set up the Kubernetes Cluster.
Can be copied from old GitLab Wiki..

Rename docker-compose files for system-test and integration-tests.

Currently, the docker-compose files for system-test and integration testing inside the ci environment are named inconsistently.
We should rename them to make the workflow easier to understand.

Event communication core/scheduler

The current HTTP based polling of the scheduler has to be replaced by events using rabbitMQ

System-test does not depend on successful notification service build

We need to add the notification service build to the system test dependencies.

Remove GitLab CI File

We migrated to GitHub, so the gitlab-ci file is obsolete.

API documentation with Swagger UI

There shall be a swagger documentation for every microservice and a common swagger ui

Cleanup package.json

Over time a lot of clutter was added to the package.json(s) of the project. Remove all unnecessary stuff and update the repo url to GitHub.

Scheduler: Integration test failing occasionally even when tests succeed

The problem seems to be related to the last line of the integration test
afterAll(() => setTimeout(() => process.exit(), 1000)).

Fix keycloak references in documentation

In some of the subproject's README.md files, "keycloak" is referred to as "keyclone". We need to find the places and fix those typos.

Enable rabbitMQ management console

The rabbitMQ management console is not accessible. This can be fixed by changing the rabbitMQ image in the docker-compose file.

System-Test: Service logs are only shown for succesful test

With our current CI configuration, in integration and system tests, logs of services other than the test are only shown when tests are successful. Actually, we need them when tests fail to simplify debugging.
This is because after failing system/integration-test the ci job is abandoned (--exit-code-from flag).
We should think of a way to show logs for failing tests.

Integration of Event-Driven Architecture PR

Just went through all the changes and came up with the following consecuting work packages in order to have as small PRs as possible.

WP A: Events for Notification Trigger

Notification Service reacts on trigger event (already implemented in #120)
Scheduler sends event via RabbitMQ to Notification instead of using trigger endpoint

WP B: Decouple Storage via Events

Integrate storage_mq from #102
Scheduler sends event via RabbitMQ to trigger storage_mq

WP C: Trigger Storage and Notification by Transformation

Move trigger functionality from scheduler to transformation to trigger notification and storage

WP D: Remove Core

Move pipeline config logic from core to transformation
Add trigger endpoint to transformation service
Remove pipeline config polling from scheduler
Use trigger endpoint for pipelines in scheduler instead of sending whole pipeline configurations for execution
Note: stateless execution interface stays untouched!
Note: Integraiton tests can be copied from core to transformation, should still run through after URL change

WP E: Trigger Pipelines via Trigger Event

Publish event after successful adapter execution in adapter service
Listen to adapter events in transformation service to trigger pipelines
Remove trigger functionality from scheduler

Up next:

rename transformation service to pipeline service
event-commnication between adapter and scheduler