Giter VIP home page Giter VIP logo

ods's Introduction

Open Data Service (ODS) on gitter

Open Data Service (ODS)

The Open Data Service (ODS) is an application which can collect data from multiple sources simulataneously, process that data and then offer an improved (or "cleaned") version to its clients. We aim to establish the ODS as the go-to place for using Open Data!

Quick Start

To execute the ODS locally, run docker-compose up in the project root directory. The ui will be accessible under localhost:9000.

Table of Contents

Configure the ODS

In order to fetch and transform data from an external source, the ODS needs to be configured. The configuration consists of two steps:

  • A Data Source needs to be configured, i.e. URI, protocol, data format has to be specified as well as a trigger.
  • For each Data Source, one or more Pipelines can be configured to further process the data and possibly trigger a notification.

This configuration can be done programmatically via the API or browser based with a gui.

Using the API

There is a collection of examples for entire configurations in our example request collection.

Using the UI

The easiest way to use the ODS is via the UI. If you started the ODS with docker-compose you can access the UI under http://localhost:9000/.

To demonstrate the ODS we will create a new pipeline to fetch water level data for German rivers and have a look at the collected data.

First, go to the Datasources page and click on Create new Datasource. The configuration workflow for creating a new Datasource is divided into the following four steps.

alt

Step 1: Name the datasource.

alt

Step 2: Configure an adapter to crawl the data. You can use the prefilled example settings.

alt

Step 3: Describe additional meta-data for the data source.

alt

Step 4: Configure the interval of how often the data should be fetched. If Periodic execution is disabled the data will be fetched only once. With the two sliders, you can choose the interval duration. The first execution of the pipeline will be after the Time of First Execution plus the interval time. Please choose 1 minute, so that you don't have to wait too long for the first data to arrive.

alt

The configuration of the data source is now finished. In the overview, you see the recently created data source. Remember the id on the left to the Datasource name, we will need it in the next step. alt

To obtain the data fetched by this data source, you need to create a pipeline operating on the data source we just created. Go to the Pipelines page and click on Create new Pipeline. The creation process consists of three steps. alt

Step 1: Choose a name for the pipeline and fill in the datasource id of the datasource we just created. alt

Step 2: In this step, you can manipulate the raw data to fit your needs by writing JavaScript code. The data object represents the incoming raw data. In this example, the attribute test is added to the data object before returning it. alt

Step 3: Describe additional meta-data for the pipeline. alt

After clicking on the save button, you should see the recently created pipeline. alt

By clicking on the Data button inside the table you see the collected data by the pipeline.

alt

In this storage view, you see all data sets for the related pipeline. On top of this list, a static link shows the URL to fetch the data with a REST client. Each data entry in the list can be expanded to see the fetched data and additional meta-data.

Project Structure

We use the microservice architectural style in this project. The microservices are located in the sub-directories and communicate at runtime over network with each other. Each Microservice has its own defined interface that has to be used by other services, direct access to the database of other microservices is strictly prohibited. In production, each microservice can be multiplied in order to scale the system (except the scheduler at the moment).

Microservice Architecture

Microservice Description
Web-Client / UI easy and seamless configuration of Sources, Pipelines
Scheduler orchestrates the executions of Pipelines
Adapter-Service fetches data from Sources and imports them into the system
Pipeline-Service definition of pipelines, execution of data pipelines
Notification-Service execution of notifications
Storage-Service stores data of Pipelines and offers an API for querying
Reverse-Proxy communication of UI with backend microservices independent from deployment environment

Further information about a specific microservice can be found in the respective README file. Examples showing the API of each microservice are in the example request directory.

Instructions on how to analyse the microservice architecture with a service dependency graph in Neo4j, can be found here.

Details on our used docker image versions and their reasoning can be found here.

Deployment

Docker images of the microservices the ods consists of are deployed via our continous deployment pipeline.

An online live instance is planned and will soon be available.

A detailled explanation of available deployment mechanisms is accessible in our deployment section.

Contributing

Contributions are welcome. Thank you if you want to contribute to the development of the ODS. There are several ways of contributing to the ODS:

  • by implementing new features
  • by fixing known bugs
  • by filing bug reports
  • by improving the documentation
  • by discussing use cases that are not covered yet

You can check our issue board for open issues to work on or to create new issues with a feature request, bug report, etc. Before we can merge your contribution you need to accept your Contributor License Agreement (CLA), integrated into the Pull Request process.

Please provide your contribution in the form of a pull request. We will then check your pull request as soon as possible and give you feedback if necessary. Please make sure that commits related to an issue (e.g. closing an issue) contains the issue number in the commit message.

Contact us

If you have any questions or would like to contact us, you can easily reach us via gitter channel. Issues can be reported via GitHub.

License

Copyright 2019-2020 Friedrich-Alexander Universität Erlangen-Nürnberg

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.

SPDX-License-Identifier: AGPL-3.0-only

ods's People

Contributors

9dt avatar acasadoquijada avatar andreas-bauer avatar f3l1x98 avatar felix-oq avatar georg-schwarz avatar hmartinez69 avatar jenswaechtler avatar jsone-studios avatar ke45xumo avatar kexplx avatar knusperkrone avatar lechodecho avatar lunedis avatar mathiaszinnen avatar nxmyoz avatar sonallux avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ods's Issues

Adapter/Scheduler RabbitMQ

The scheduler event polling at the adapter service needs to be replaced by rabbitMQ communication.
This entails:

  • Integrating rabbitmq in the scheduler
  • Data import triggering via datasource id instead of sending the whole configuration

Adapter Datasource Deserializing Test Sometimes Fails

Sometime the datasource deserializing unit tests fails for whatever reason (sometimes runs through successfully).


expected: org.jvalue.ods.adapterservice.datasource.model.Datasource<Datasource{id=123, protocol=AdapterProtocolConfig {type='HTTP', parameters='{location=http://www.the-inder.net}'}, format=AdapterFormatConfig {type='XML', parameters='{}'}, metadata=PipelineMetadata{displayName='TestName', author='icke', license='none', creationTimestamp=13 May 2020 14:13:39 GMT, description='Describing...'}, trigger=PipelineTriggerConfig{periodic=true, firstExecution=Fri Dec 01 03:30:00 CET 1905, interval=50000}}> but was: org.jvalue.ods.adapterservice.datasource.model.Datasource<Datasource{id=123, protocol=AdapterProtocolConfig {type='HTTP', parameters='{location=http://www.the-inder.net}'}, format=AdapterFormatConfig {type='XML', parameters='{}'}, metadata=PipelineMetadata{displayName='TestName', author='icke', license='none', creationTimestamp=13 May 2020 14:13:39 GMT, description='Describing...'}, trigger=PipelineTriggerConfig{periodic=true, firstExecution=Fri Dec 01 03:30:00 CET 1905, interval=50000}}>
java.lang.AssertionError: expected: org.jvalue.ods.adapterservice.datasource.model.Datasource<Datasource{id=123, protocol=AdapterProtocolConfig {type='HTTP', parameters='{location=http://www.the-inder.net}'}, format=AdapterFormatConfig {type='XML', parameters='{}'}, metadata=PipelineMetadata{displayName='TestName', author='icke', license='none', creationTimestamp=13 May 2020 14:13:39 GMT, description='Describing...'}, trigger=PipelineTriggerConfig{periodic=true, firstExecution=Fri Dec 01 03:30:00 CET 1905, interval=50000}}> but was: org.jvalue.ods.adapterservice.datasource.model.Datasource<Datasource{id=123, protocol=AdapterProtocolConfig {type='HTTP', parameters='{location=http://www.the-inder.net}'}, format=AdapterFormatConfig {type='XML', parameters='{}'}, metadata=PipelineMetadata{displayName='TestName', author='icke', license='none', creationTimestamp=13 May 2020 14:13:39 GMT, description='Describing...'}, trigger=PipelineTriggerConfig{periodic=true, firstExecution=Fri Dec 01 03:30:00 CET 1905, interval=50000}}>
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:834)
	at org.junit.Assert.assertEquals(Assert.java:118)
	at org.junit.Assert.assertEquals(Assert.java:144)
	at org.jvalue.ods.adapterservice.datasource.model.DatasourceTest.testDeserialization(DatasourceTest.java:29)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:567)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:110)
	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
	at org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:62)
	at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:567)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
	at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33)
	at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:94)
	at com.sun.proxy.$Proxy2.processTestClass(Unknown Source)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:118)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:567)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
	at org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:182)
	at org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:164)
	at org.gradle.internal.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:412)
	at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
	at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run(ManagedExecutorImpl.java:48)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.run(ThreadFactoryImpl.java:56)
	at java.base/java.lang.Thread.run(Thread.java:830)


Fix picture embedding in README.md

Pictures embedded in the root level README.md currently have relative file-system paths, e.g. doc/configuration-example/01_overview.jpg.

This worked when the repository was on GitLab but does not work on GitHub anymore.

The relative paths have to be replaced with absolute paths to where the pictures are accessible via GitHub, e.g. for the example above: https://github.com/jvalue/open-data-service/blob/master/doc/configuration-example/01_overview.jpg.

Core/Transformation: Move remaining pipelineConfig to transformation service

Currently, configurations for data transformation and notifications are stored and handled at the core service. This task should be handled by the transformation service. The exact implementation is highly dependent on whether we already have rabbitMQ or not. I suggest implementing rabbitMQ first so we do not have to implement event handling twice.

Combine multiple data sources

We should enable the combination of data from multiple sources. This can be done either in the adapter service or in the transformation service. I suggest doing it in the latter since data combination is modeled better by a transformation than by an adaptation.

It is probably best to refactor the data flow in the ODS. Currently, a pipeline consists of one adapter, followed by multiple transformations, storage, and optional notifications. That is, in pseudo-EBNF: A T* S N*. What we want are multiple stages of one adapter each followed by one transformation each and in the end optional storage and optional notifications, i.e.: {A T}+ [S] N*. Since the transformations are touring complete, modeling multiple transformations after one adapter is redundant and could be simplified by using just one. If we find out that it is more convenient to split data transformation into multiple parts, we can still offer it in the UI and just concatenate the transformations before passing it to the scheduler. The amse projects revealed that a common use case for the ODS is getting URLs from one source that should be used to get data from another. To enable this process we need a dynamic adapter configuration. The easiest implementation of this should be adding the parameters for subsequent adapters to the data field so they can be used there.

Adapter: Get imports and imported data of datasource via API

Import does not contain data, but only id, timestamp, link to data (url)

API suggestion:

  • /datasources/{id}/imports -> map of all imports (id, timestamp, link to data url)
  • /datasources/{id}/imports/{id} -> import (id, timestamp, link to data url)
  • /datasources/{id}/imports/latest -> latest data import
  • /datasources/{id}/imports/{id}/data -> the real data

Implementation:

  • managing imports and import data should be handled in sub-package of datasources (not adapters). Probably requires a fair bit of refactoring.
  • /dataImport should return imported data directly, no caching of these kind of data imports

Adapter: Improve Exception Logging

Currently, exceptions are just being dumped to the console. Instead, we should log a meaningful message including context information

UI: Paginate pipeline data

Currently, all data a pipeline has produced in its lifetime is loaded once the user clicks on the "data" button. For longer running pipelines, this is very slow. We need some kind of pagination to limit the amount of data to be loaded at once.

Reload UI sites

In #56 we already made the first step towards make the UI reloadable on every site.

Top-level reloading on e.g. /datasources works now, but not to datasources/new.
So I guess, therefore, the issue is somewhere else? Any guesses?

Originally posted by @georg-schwarz in #56 (comment)

Document/Specify GitHub workflow guidelines

After migrating from GitLab to GitHub we need to explicitly define the work processes of the ODS development, i.e. usage of issue board, project board, pull requests, etc.
A first draft of the guidelines has to be added which can then be discussed and modified.

Storage: Data references

Instead of passing the actual data to the transformation service, the scheduler should pass a reference to the data. The transformation service should then fetch the data, transform it and pass a reference to the transformed data as a result. We need to decide whether the transformation needs an own persistence to save its results or if we want to use some kind of a shared solution.

UI: Use imported data in transformation stepper

Instead of displaying unrelated sample data, pipeline transformation input data should be preview data from the corresponding adapter (if there is one). The newly added manual adapter trigger could be used for that.

Cleanup package.json

Over time a lot of clutter was added to the package.json(s) of the project. Remove all unnecessary stuff and update the repo url to GitHub.

System-Test: Service logs are only shown for succesful test

With our current CI configuration, in integration and system tests, logs of services other than the test are only shown when tests are successful. Actually, we need them when tests fail to simplify debugging.
This is because after failing system/integration-test the ci job is abandoned (--exit-code-from flag).
We should think of a way to show logs for failing tests.

Integration of Event-Driven Architecture PR

Just went through all the changes and came up with the following consecuting work packages in order to have as small PRs as possible.

WP A: Events for Notification Trigger

  • Notification Service reacts on trigger event (already implemented in #120)
  • Scheduler sends event via RabbitMQ to Notification instead of using trigger endpoint

WP B: Decouple Storage via Events

  • Integrate storage_mq from #102
  • Scheduler sends event via RabbitMQ to trigger storage_mq

WP C: Trigger Storage and Notification by Transformation

  • Move trigger functionality from scheduler to transformation to trigger notification and storage

WP D: Remove Core

  • Move pipeline config logic from core to transformation
  • Add trigger endpoint to transformation service
  • Remove pipeline config polling from scheduler
  • Use trigger endpoint for pipelines in scheduler instead of sending whole pipeline configurations for execution
    Note: stateless execution interface stays untouched!
    Note: Integraiton tests can be copied from core to transformation, should still run through after URL change

WP E: Trigger Pipelines via Trigger Event

  • Publish event after successful adapter execution in adapter service
  • Listen to adapter events in transformation service to trigger pipelines
  • Remove trigger functionality from scheduler

Up next:

  • rename transformation service to pipeline service
  • event-commnication between adapter and scheduler

Remove top-level package-lock.json

From old times there is still the /package-lock.json file in the root of the repository. This file is not necessary anymore and should be removed.

Update README to new UI

The pictures in the howTo show an outdated UI. We need to update the pictures and description to match the current workflow for pipeline creation (Datasource and Pipeline separated).

Rafactor UI: datasources to own directory

In PR #28 we separated datasources and pipelines in the ui. Since we kept the changes to a minimum, we still need to extract the datasource related files into a separate directory. This means a lot of import path fixing and file renaming.

Integration tests and rabbitMQ

We need to decide on how to perform end to end tests using rabbitMQ instead of http calls, suggestions for frameworks etc. are welcome

Transformation: Pass data references

Instead of passing the actual data to the transformation service, the scheduler should pass a reference to the data. The transformation service should then fetch the data, transform it and pass a reference to the transformed data as a result. We need to decide whether the transformation needs an own persistence to save its results or if we want to use some kind of a shared solution.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.