Giter VIP home page Giter VIP logo

chicago / open-data-etl-utility-kit Goto Github PK

View Code? Open in Web Editor NEW
95.0 24.0 30.0 17.11 MB

Use Pentaho's open source data integration tool (Kettle) to create Extract-Transform-Load (ETL) processes to update a Socrata open data portal. Documentation is available at http://open-data-etl-utility-kit.readthedocs.io/en/stable

License: Other

Shell 67.85% Batchfile 32.15%
open-data socrata government chicago pentaho kettle etl

open-data-etl-utility-kit's Introduction

ETL Utilities for an Open Data Program

This toolkit provides several utilities and framework to help governments deploy automated ETLs using the open-source Pentaho data integration (Kettle) software.

Namely, this toolkit will allow:

  • Loading data from a database and upload it to a Socrata data portal
  • Integrates with an SMTP server to provide e-mail alerts on the outcome of ETL scripts to administrators
  • Handles deployment issues when using multiple operating systems during development
  • Utilities to allow administrators to quickly analyze the log files of ETLs for quick diagnostics

The ETL framework is organized so each function can be modified in one file that is used by all ETLs. This provides for easier maintenance, upgrading, and modification over hundreds of ETLs.

Features

  • Open source at the core - this framework can be deployed using Kettle, an open-source ETL software. With an annual support subscription, Pentaho also provides telephone support and training if desired.
  • Compatible with multiple data sources - this ETL framework can be used with a variety of data sources, including a range of databases (MySQL, PostgreSQL, Oracle, SQL Server, and variety of NoSQL), APIs, text files, etc.
  • Compatible workflow for multiple operating systems - ETLs can be developed and deployed across multiple operating systems. ETLs can be developed on a Windows environment and deployed on Linux
  • Helpful utilities - includes several scripts to help users quickly analyze log files

Requirements

The requirements for the recommended configuration require the following pieces of software:

  • Kettle (or Pentaho) data integration - Note: This framework has only been tested with Kettle 4.4.0 and lower.
  • Java 1.6 or higher
  • DataSync (for use with Socrata) - Note: This framework is designed for the version of DataSync in the DataSync directory and will not necessarily work with earlier or later versions.
  • MacOS X, Linux, or Unix (only required for full automation with included scripts)

Kettle Compatibility

This framework has only been tested using Kettle 4.3.0 and Kettle 4.4.0. It is possible that this framework is fully compatible with Kettle 5.x, but has not been tested. If you would like to contribute, please see the issue page.

Errors / Bugs

Experiencing issues with the included files? Report it on our issue tracker

open-data-etl-utility-kit's People

Contributors

fgregg avatar jefw avatar levyj avatar palewire avatar shua123 avatar timwis avatar tomschenkjr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

open-data-etl-utility-kit's Issues

Add unit testing for ETL framework

Admittedly, it's hard to define a solid unit testing framework, but should have some tests to do some regression testing. This should be done in Windows (Appveyor) and Linux (Travis CI) builds.

  • Compare outputs of ETL utilities (e.g., Log/A_DatasetLogs, Log/A_ETLRuntimes) to ensure it outputs the "correct" answers
  • Ensure four-by-four checks work with ETL utilites
  • ...?

Using 'result' as field name can result in email notification error

In the image below, the "Get System Data" step creates a field called "result" which contains the result of the previous job entry in the job (KJB file) that called the ETL_Completion_E-Mail.ktr transformation file containing these steps. The "Set Subject and Body" step makes use of this result value.

If there was a field called "result" in the upstream transformation (KTR file) for the dataset update, itself, this appears to create a conflict and cause ETL_Completion_E-Mail.ktr to fail.

image

ignoreServerLatLong

Need to eliminate ignoreServerLatLong from the control file template. This is no longer compatible with DataSync (at least version 1.5.4).

Info request

Hi all,
I cloned your repo for instruction purposes and to make some tests. I don't understand where the environment variables are set. For example the variable P_Sender_Address, I don't find a config file that sets this variable and the others; in the documentation it's not explained. Perhaps I'm missing something!
Thanks in advance for your help.
Franco

Warning produced running job

Right now, a successful completion of a Kettle ETL for the portal will produce the following warning:

WARNING: -h,--fileToPublishHasHeaderRow is being ignored because -sc,--pathToFTPControlFile or -cf,--pathToControlFile was supplied.Publishing

ETLs will succeed, but will only recognize the control file settings.

Just need to fix the ETL/Utilities/DataSyncHTTP.ktr step which compiles the DataSync command.

Update metadata when updating data

With some of our currently manual updates we have been updating the dataset description or title with date information. I have been thinking about ways this could be integrated into the etl workflow.

One possibility... DataSync has the metadata job option. This could be integrated into the workflow to run before or after the DataSync replace job. The downside to this is that DataSync's metadata job needs to make a working copy which is unnecessary overhead.

Create documentation

We will use ReadTheDocs to promulgate documentation for the workflow.

Considering the following outline, please let me know your thoughts.

  • Introduction (index.rst)
  • Installation and configuration
  • Creating and configuring an ETL
  • Configuring for automation
  • Utilities for administering ETLs

I've opened a new branch (iss1) that is working on documentation.

Add CLA

Need to add CLA for any contributors.

GIS Importer version

Create alternative/optional workflow that can utilize Socrata's gis_importer jar instead of DataSync in order to update Mondara datasets with zipped shapefiles.

Support automation for Windows

Right now, a substantial portion of the workflow automation is based on Bash scripts (.sh).

The following files need to be transitioned to be Windows supportable. Namely, the following files need to be transitioned to a Windows-supportable format (i.e., batch scripts):

.gitignore additions

Since the documentation recommends putting Kettle folders within the kit folder structure I would suggest adding data-integration*/ to .gitignore

Maybe also add DataSync/config.json so that password and connection settings are not accidently committed to a public repo. I'm not sure. There are also other settings that end up saved in generic files like the Email settings in open-data-etl-kit/ETL/Utilities/ETL_Completion_E-Mail.ktr. It might be helpful to pull all settings like that out and put into `kettle.properties' or some other environment variable type file.

In general, I'm just trying to make it clearer what files & folders can be safely in a public repo vs which ones probably shouldn't be (config settings, dataset specific files, etc).

Include DataSync in repo?

We may want to include a working JAR in DataSync/ directory from the DataSync release pages. Right now, a slug is placed that places the onus to install DataSync on the user.

The objective is to include the highest version of DataSync that is compatible with the commands that assemble an ETL job for the portal. We're starting to experience issues with newer versions of DataSync that are not compatible with flags or control files that are included in this repo.

This would not mean this repo is not compatible with higher versions, but provide some mechanism to verify compatibility.

kettle_etl_workflow.png has some incorrect references

I made a few errors on this one:

  • The use of ~/ is inconsistent with the rest of the documentation. Need to be changed to open-data-etl-utility-kit
  • The log analysis tools (e.g., A_DatasetLogs.sh) are shown under open-data-etl-utility-kit/ETL/Utilities instead of the ../Tools directory
  • The connection between the diagram and the rest of the text is unclear to readers. Need to make the documentation clearer for each step.

Automated way to set good column widths for new datasets

New datasets often have column widths that are too narrow or too wide. It is possible to get and change the widths, using the views API (e.g., https://data.cityofchicago.org/api/views/ydr8-5enu). This raises the possibility of creating a tool to set widths for a new dataset.

Setting the widths probably is not that hard. The more challenging part of this idea would be figuring out what the widths should be. This is a function both of the column name and the longest value, as well as some subjective preferences about whether either needs to be fully displayed or if more columns, but truncated, are a better use of screen width. However, any solution need not be perfect. If it does a pretty good job so that it cuts the manual adjustments necessary, that is still of value.

Our best (really crude) formula so far is:

Width = (4.7 * Number of Characters in Column Name) + 127

We are happy to share more information with anyone who cares to take on this project.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.