chicago / open-data-etl-utility-kit Goto Github PK

Use Pentaho's open source data integration tool (Kettle) to create Extract-Transform-Load (ETL) processes to update a Socrata open data portal. Documentation is available at http://open-data-etl-utility-kit.readthedocs.io/en/stable

License: Other

Shell 67.85% Batchfile 32.15%

open-data socrata government chicago pentaho kettle etl

open-data-etl-utility-kit's Introduction

ETL Utilities for an Open Data Program

This toolkit provides several utilities and framework to help governments deploy automated ETLs using the open-source Pentaho data integration (Kettle) software.

Namely, this toolkit will allow:

Loading data from a database and upload it to a Socrata data portal
Integrates with an SMTP server to provide e-mail alerts on the outcome of ETL scripts to administrators
Handles deployment issues when using multiple operating systems during development
Utilities to allow administrators to quickly analyze the log files of ETLs for quick diagnostics

The ETL framework is organized so each function can be modified in one file that is used by all ETLs. This provides for easier maintenance, upgrading, and modification over hundreds of ETLs.

Features

Open source at the core - this framework can be deployed using Kettle, an open-source ETL software. With an annual support subscription, Pentaho also provides telephone support and training if desired.
Compatible with multiple data sources - this ETL framework can be used with a variety of data sources, including a range of databases (MySQL, PostgreSQL, Oracle, SQL Server, and variety of NoSQL), APIs, text files, etc.
Compatible workflow for multiple operating systems - ETLs can be developed and deployed across multiple operating systems. ETLs can be developed on a Windows environment and deployed on Linux
Helpful utilities - includes several scripts to help users quickly analyze log files

Requirements

The requirements for the recommended configuration require the following pieces of software:

Kettle (or Pentaho) data integration - Note: This framework has only been tested with Kettle 4.4.0 and lower.
Java 1.6 or higher
DataSync (for use with Socrata) - Note: This framework is designed for the version of DataSync in the DataSync directory and will not necessarily work with earlier or later versions.
MacOS X, Linux, or Unix (only required for full automation with included scripts)

Kettle Compatibility

This framework has only been tested using Kettle 4.3.0 and Kettle 4.4.0. It is possible that this framework is fully compatible with Kettle 5.x, but has not been tested. If you would like to contribute, please see the issue page.

Errors / Bugs

Experiencing issues with the included files? Report it on our issue tracker

open-data-etl-utility-kit's People

Contributors

Stargazers

Watchers

open-data-etl-utility-kit's Issues

Add unit testing for ETL framework

Admittedly, it's hard to define a solid unit testing framework, but should have some tests to do some regression testing. This should be done in Windows (Appveyor) and Linux (Travis CI) builds.

Compare outputs of ETL utilities (e.g., Log/A_DatasetLogs, Log/A_ETLRuntimes) to ensure it outputs the "correct" answers
Ensure four-by-four checks work with ETL utilites
...?

Broken table of contents on documentation

Things seem to have been duplicated:

Using 'result' as field name can result in email notification error

In the image below, the "Get System Data" step creates a field called "result" which contains the result of the previous job entry in the job (KJB file) that called the ETL_Completion_E-Mail.ktr transformation file containing these steps. The "Set Subject and Body" step makes use of this result value.

If there was a field called "result" in the upstream transformation (KTR file) for the dataset update, itself, this appears to create a conflict and cause ETL_Completion_E-Mail.ktr to fail.

ignoreServerLatLong

Need to eliminate ignoreServerLatLong from the control file template. This is no longer compatible with DataSync (at least version 1.5.4).

Info request

Hi all,
I cloned your repo for instruction purposes and to make some tests. I don't understand where the environment variables are set. For example the variable P_Sender_Address, I don't find a config file that sets this variable and the others; in the documentation it's not explained. Perhaps I'm missing something!
Thanks in advance for your help.
Franco

Change "Copyright" footer

Need to change it from "Tom Schenk and Jonathan Levy" to "City of Chicago and contributors"

Warning produced running job

Right now, a successful completion of a Kettle ETL for the portal will produce the following warning:

WARNING: -h,--fileToPublishHasHeaderRow is being ignored because -sc,--pathToFTPControlFile or -cf,--pathToControlFile was supplied.Publishing

ETLs will succeed, but will only recognize the control file settings.

Just need to fix the ETL/Utilities/DataSyncHTTP.ktr step which compiles the DataSync command.

Clarify documentation on Text File Output fields

Need to clarify that the Text File Output may need to be modified to eliminate "File", "DatasetID", "ControlFile", and "File_to_Geocode" fields that can get added when using "Add All" in Pentaho/Kettle.

Need to revise this page.

Update metadata when updating data

With some of our currently manual updates we have been updating the dataset description or title with date information. I have been thinking about ways this could be integrated into the etl workflow.

One possibility... DataSync has the metadata job option. This could be integrated into the workflow to run before or after the DataSync replace job. The downside to this is that DataSync's metadata job needs to make a working copy which is unnecessary overhead.

Create documentation

We will use ReadTheDocs to promulgate documentation for the workflow.

Considering the following outline, please let me know your thoughts.

Introduction (index.rst)
Installation and configuration
Creating and configuring an ETL
Configuring for automation
Utilities for administering ETLs

I've opened a new branch (iss1) that is working on documentation.

Add CLA

Need to add CLA for any contributors.

toctree is not being recognized on ReadTheDocs generation

RTD seems to ignore the toctree directive and doesn't include other files in the tree.

GIS Importer version

Create alternative/optional workflow that can utilize Socrata's gis_importer jar instead of DataSync in order to update Mondara datasets with zipped shapefiles.

Support automation for Windows

Right now, a substantial portion of the workflow automation is based on Bash scripts (.sh).

The following files need to be transitioned to be Windows supportable. Namely, the following files need to be transitioned to a Windows-supportable format (i.e., batch scripts):

Standard_ETL.sh
The four-by-four scripts (i.e., abcd-1234.sh)

Improve the exclusion mechanism in A_TodayLogs.sh

The mechanism for excluding certain datasets with the -e parameter is fairly crude. It certainly could be improved. Maybe read from a text file listing the 4x4's to exclude?

Broken C:\ formatting in installation-configuration.rst

Under Setting-up default directories, Read the Docs is somehow unhappy about the combination of the bulleted list and the backslashes. I do not know the markup language well enough to attempt a fix.

Is this on readthedocs?

Noticed you have what appears to be a readthedocs setup for documentation. Is that up?

.gitignore additions

Since the documentation recommends putting Kettle folders within the kit folder structure I would suggest adding data-integration*/ to .gitignore

Maybe also add DataSync/config.json so that password and connection settings are not accidently committed to a public repo. I'm not sure. There are also other settings that end up saved in generic files like the Email settings in open-data-etl-kit/ETL/Utilities/ETL_Completion_E-Mail.ktr. It might be helpful to pull all settings like that out and put into `kettle.properties' or some other environment variable type file.

In general, I'm just trying to make it clearer what files & folders can be safely in a public repo vs which ones probably shouldn't be (config settings, dataset specific files, etc).

Include DataSync in repo?

We may want to include a working JAR in DataSync/ directory from the DataSync release pages. Right now, a slug is placed that places the onus to install DataSync on the user.

The objective is to include the highest version of DataSync that is compatible with the commands that assemble an ETL job for the portal. We're starting to experience issues with newer versions of DataSync that are not compatible with flags or control files that are included in this repo.

This would not mean this repo is not compatible with higher versions, but provide some mechanism to verify compatibility.

kettle_etl_workflow.png has some incorrect references

I made a few errors on this one:

The use of ~/ is inconsistent with the rest of the documentation. Need to be changed to open-data-etl-utility-kit
The log analysis tools (e.g., A_DatasetLogs.sh) are shown under open-data-etl-utility-kit/ETL/Utilities instead of the ../Tools directory
The connection between the diagram and the rest of the text is unclear to readers. Need to make the documentation clearer for each step.

Support log-analysis utilities on Windows

This repository contains three Bash scripts that help ETL administrators analyze the contents of Kettle/Pentaho log files: (1) A_DatasetLogs.sh (2) A_ETLRuntimes.sh (3) A_Today.sh

These should be converted to support Windows environment, likely use .bat files with similar functionality.

Image display in installation-configuration.rst

The first image displays properly. The second two do not display. I am not seeing why.

I can add specifics, if needed, but I suspect it is easier to see than describe.

Automated way to set good column widths for new datasets

New datasets often have column widths that are too narrow or too wide. It is possible to get and change the widths, using the views API (e.g., https://data.cityofchicago.org/api/views/ydr8-5enu). This raises the possibility of creating a tool to set widths for a new dataset.

Setting the widths probably is not that hard. The more challenging part of this idea would be figuring out what the widths should be. This is a function both of the column name and the longest value, as well as some subjective preferences about whether either needs to be fully displayed or if more columns, but truncated, are a better use of screen width. However, any solution need not be perfect. If it does a pretty good job so that it cuts the manual adjustments necessary, that is still of value.

Our best (really crude) formula so far is:

Width = (4.7 * Number of Characters in Column Name) + 127

We are happy to share more information with anyone who cares to take on this project.