Giter VIP home page Giter VIP logo

etl's Introduction

LinkedPipes ETL

Build Status

LinkedPipes ETL is an RDF based, lightweight ETL tool.

Requirements

For building locally

Installation and startup

You can run LP-ETL in Docker, or build it from the source.

Docker

To start LP-ETL you can use:

git clone https://github.com/linkedpipes/etl.git
cd etl
docker-compose up

Note that on Windows, there is an issue with buildkit. See the temporary workaround.

When running this on Windows, you might get a build error. There is a workaround for that.

You may need to run the docker-compose command as sudo or be in the docker group.

Configuration

Each component (executor, executor-monitor, storage, frontend) has separate Dockerfile.

Environment variables:

  • LP_ETL_BUILD_BRANCH - The Dockerfiles are designed to run build from the github repository, the branch is set using this property, default is master.
  • LP_ETL_BUILD_JAVA_TEST - Set to empty to allow to run Java tests, this will slow down the build.
  • LP_ETL_DOMAIN - The URL of the instance, this is used instead of the domain.uri from the configuration.
  • LP_ETL_FTP - The URL of the FTP server, this is used instead of the executor-monitor.ftp.uri from the configuration.

For Docker Compose, there are additional environment variables:

  • LP_ETL_PORT - Specify port mapping for frontend, this is where you can connect to your instance. This does NOT have to be the same as port in LP_ETL_DOMAIN in case of reverse-proxying.

For example to run LP-ETL from develop branch on http://localhost:9080 use can use following command:

curl https://raw.githubusercontent.com/linkedpipes/etl/develop/docker-compose-github.yml | LP_ETL_PORT=9080 LP_ETL_DOMAIN=http://localhost:9080 LP_ETL_BUILD_BRANCH=develop docker-compose -f - up

docker-compose utilizes several volumes that can be used to access/provide data. See docker-compose.yml comments for examples and configuration. You may want to create your own docker-compose.yml for custom configuration.

From source on Linux

Installation

$ git clone https://github.com/linkedpipes/etl.git
$ cd etl
$ mvn install

Configuration

The configuration file deploy/configuration.properties can be edited, mainly changing paths to working, storage, log and library directories.

Startup

$ cd deploy
$ ./executor.sh >> executor.log &
$ ./executor-monitor.sh >> executor-monitor.log &
$ ./storage.sh >> storage.log &
$ ./frontend.sh >> frontend.log &

Running LP-ETL as a systemd service

See example service files in the deploy/systemd folder.

From source on Windows

Note that it is also possible to use Bash on Ubuntu on Windows or Cygwin and proceed as with Linux.

Installation

git clone https://github.com/linkedpipes/etl.git
cd etl
mvn install

Configuration

The configuration file deploy/configuration.properties can be edited, mainly changing paths to working, storage, log and library directories.

Startup

In the deploy folder, run

  • executor.bat
  • executor-monitor.bat
  • storage.bat
  • frontend.bat

Data import

You can copy pipelines, templates and mapping data from one instance to another directly iff both instance runs on the same domain. As this is, mostly, not the case you need to utilize special script to update the resources.

Assume that you have copy of a data directory ./data-source with knowledge, pipelines and templates sub-directories. You can obtain the directory from any running instance, you can even merge content of multiple of those directories together. In the next step you would like to import the data into a new instance. The new instance has a data directory ./data-target and the domain, set in configuration as domain.uri is https://example.com.

In that case you can utilize a Python script from script directory. The script is called change_domain.py and requires installation of rdflib, Once rdflib is installed you can run the script using the following command:

python change_domain.py --input ./data-source --domain https://example.com --output ./data-target

After the script is finished you can start the target instance of LinkedPipes ETL and all the data should be available there.

Plugins - Components

The components live in the jars directory. If you need to create your own component, you can copy an existing component and change it.

Update notes

Update note 5: 2019-09-03 breaking changes in the configuration file. Remove /api/v1 from the executor-monitor.webserver.uri, so it looks like: executor-monitor.webserver.uri = http://localhost:8081. You can also remove executor.execution.uriPrefix as the value is derived from domain.uri.

Update note 4: 2019-07-03 we changed the way frontend is run. If you do not use our script to run it, you need to update yours.

Update note 3: When upgrading from develop prior to 2017-02-14, you need to delete {deploy}/jars and {deploy}/osgi.

Update note 2: When upgrading from master prior to 2016-11-04, you need to move your pipelines folder from e.g., /data/lp/etl/pipelines to /data/lp/etl/storage/pipelines, update the configuration.properites file and possibly the update/restart scripts as there is a new component, storage.

Update note: When upgrading from master prior to 2016-04-07, you need to delete your old execution data (e.g., in /data/lp/etl/working/data)

etl's People

Contributors

cowclaw avatar jakubklimek avatar jindrichmynarz avatar jlleitschuh avatar nvdk avatar skodapetr avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.