Giter VIP home page Giter VIP logo

nifi-atlas's Introduction

img

nifi-atlas

A bridge to Apache Atlas for provenance metadata created from data transformations completed by Apache NiFi .

Getting Started

  1. Get your local maven repository populated with all the dependent JAR files related to NiFi ver. 1.5.0-SNAPSHOT
git clone https://github.com/apache/nifi.git

cd nifi

mvn install -DskipTests=true

Be patient, it will take about 15 minutes to run.

  1. Build the new nifi-atlas bundle
cd nifi-atlas/nifi-atlas-bundle

mvn install
  1. Build the dual-site nifi-cluster

Copy over the newly built .nar file from: nifi-atlas-bundle

cp ./nifi-atlas-bundle/nifi-atlas-nar/target/nifi-atlas-nar-1.5.0-SNAPSHOT.nar ./nifi-cluster-docker/nifi-node/.

The reminder of the setup can be followed in the nifi-cluster-docker README.

Prerequisites

• Java 8 • Apache Atlas 0.8+ • Apache NiFi 1.5+ • Apache Kafka 0.10+

How to Contribute

If you would like to contribute to this project, please see our CONTRIBUTING guidelines.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

License

This project is licensed under the Apache Version 2 License- see the LICENSE.md file for details

nifi-atlas's People

Contributors

gjlawran avatar repo-mountie[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nifi-atlas's Issues

Add project lifecycle badge

No Project Lifecycle Badge found in your readme!

Hello! I scanned your readme and could not find a project lifecycle badge. A project lifecycle badge will provide contributors to your project as well as other stakeholders (platform services, executive) insight into the lifecycle of your repository.

What is a Project Lifecycle Badge?

It is a simple image that neatly describes your project's stage in its lifecycle. More information can be found in the project lifecycle badges documentation.

What do I need to do?

I suggest you make a PR into your README.md and add a project lifecycle badge near the top where it is easy for your users to pick it up :). Once it is merged feel free to close this issue. I will not open up a new one :)

It's Been a While Since This Repository has Been Updated

This issue is a kind reminder that your repository has been inactive for 181 days. Some repositories are maintained in accordance with business requirements that infrequently change thus appearing inactive, and some repositories are inactive because they are unmaintained.

To help differentiate products that are unmaintained from products that do not require frequent maintenance, repomountie will open an issue whenever a repository has not been updated in 180 days.

  • If this product is being actively maintained, please close this issue.
  • If this repository isn't being actively maintained anymore, please archive this repository. Also, for bonus points, please add a dormant or retired life cycle badge.

Thank you for your help ensuring effective governance of our open-source ecosystem!

Add missing topics

TL;DR

Topics greatly improve the discoverability of repos; please add the short code from the table below to the topics of your repo so that ministries can use GitHub's search to find out what repos belong to them and other visitors can find useful content (and reuse it!).

Why Topic

In short order we'll add our 800th repo. This large number clearly demonstrates the success of using GitHub and our Open Source initiative. This huge success means its critical that we work to make our content as discoverable as possible; Through discoverability, we promote code reuse across a large decentralized organization like the Government of British Columbia as well as allow ministries to find the repos they own.

What to do

Below is a table of abbreviation a.k.a short codes for each ministry; they're the ones used in all @gov.bc.ca email addresses. Please add the short codes of the ministry or organization that "owns" this repo as a topic.

add a topic

That's in, you're done!!!

How to use

Once topics are added, you can use them in GitHub's search. For example, enter something like org:bcgov topic:citz to find all the repos that belong to Citizens' Services. You can refine this search by adding key words specific to a subject you're interested in. To learn more about searching through repos check out GitHub's doc on searching.

Pro Tip 🤓

  • If your org is not in the list below, or the table contains errors, please create an issue here.

  • While you're doing this, add additional topics that would help someone searching for "something". These can be the language used javascript or R; something like opendata or data for data only repos; or any other key words that are useful.

  • Add a meaningful description to your repo. This is hugely valuable to people looking through our repositories.

  • If your application is live, add the production URL.

Ministry Short Codes

Short Code Organization Name
AEST Advanced Education, Skills & Training
AGRI Agriculture
ALC Agriculture Land Commission
AG Attorney General
MCF Children & Family Development
CITZ Citizens' Services
DBC Destination BC
EMBC Emergency Management BC
EAO Environmental Assessment Office
EDUC Education
EMPR Energy, Mines & Petroleum Resources
ENV Environment & Climate Change Strategy
FIN Finance
FLNR Forests, Lands, Natural Resource Operations & Rural Development
HLTH Health
FLNR Indigenous Relations & Reconciliation
JEDC Jobs, Economic Development & Competitiveness
LBR Labour Policy & Legislation
LDB BC Liquor Distribution Branch
MMHA Mental Health & Addictions
MAH Municipal Affairs & Housing
BCPC Pension Corporation
PSA Public Safety & Solicitor General & Emergency B.C.
SDPR Social Development & Poverty Reduction
TCA Tourism, Arts & Culture
TRAN Transportation & Infrastructure

NOTE See an error or omission? Please create an issue here to get it remedied.

Nifi Atlas Bridge Development Opportunity #1

Value: $10,000.00Closes: 23:00 PST, Tuesday, September 19, 2017Location: Victoria In-person work NOT required

Opportunity Description

We are looking for help building an Apache NiFi-Atlas bridge. We want it to process NiFi provenance data and log this in Apache Atlas as lineage metadata to support our planned use of HortonWorks Data Governance framework.

More specifically, we want to be able to record in Atlas the "logical" lineage of a file from input all the way to being saved to HDFS or a database even though it was manipulated several times. Also, we would like to be able to process the provenance data as the processes proceed and not have to wait until the flow is finished. This would allow Atlas to show the current status of a file before the whole flow is finished.

We are aware of the code at https://github.com/vakshorton/NifiAtlasBridge and https://github.com/vakshorton/NifiAtlasLineageReporter, however it does not seem to do exactly what we want. The one problem with the above code is that it generates Atlas metadata as input and output from too many flow ingress and egress events. Many processors change or clone data, which means the FlowFile's id changes, which then in turn causes the metadata lineage to be broken in Atlas.

A typical simplified use-case would be to get a ZIP file from the file system, move it to a different directory based on filename and date, do various actions on it - unzip as CSV, split, join, update attributes, custom processors, manually edit it, save to a different directory, infer Avro schema, convert CSV to Avro, save file, convert file to orc, put in HDFS, generate Apache Hive DDL, create Hive table. The file names and directory names will not be hard-coded in the flow but will instead be parameter driven. In this scenario, we would like to be able to trace that Hive table lineage all the way back to the ZIP file input.

Acceptance Criteria

To be paid the fixed price for this opportunity, you need to meet all of the following criteria:

  • A merge request from your GitHub account to this repo (https://github.com/bcgov/nifi-atlas) that works with the following software versions:
    • Java 8
    • Apache Atlas 0.8 
    • Apache NiFi 1.3+
  • The solution must work in a clustered environment with provenance data coming from several nodes. We would like it to work in a site-to-site environment (cluster A does some processing, then hands it over to cluster B to do further processing - we would like to be able to have the lineage follow all the way through).
  • It is acceptable for the code to work with a local file system, even though we will implement it in the cloud using e.g. Azure, AWS etc.
  • The code must allow us to add more processors and custom processors quite easily.
  • The code needs to include a build script (Maven or Gradle) to compile the Java code and produce a NAR file.
  • The code needs to include some basic unit testing, as well as an example of use in a Dataflow template to demonstrate end-to-end functionality.
  • This code does not need to be production ready, but must be a good starting point for us to use.
  • The code needs to be commented where possible, especially in the areas of provenance processing.

How to Apply

Go to the Opportunity Page, click the Apply button above and submit your proposal by 16:00 PST on 23:00 PST, Tuesday, September 19, 2017.

We plan to assign this opportunity by Tuesday, September 26, 2017 with work to start on Tuesday, September 26, 2017.

If your proposal is accepted and you are assigned to the opportunity, you will be notified by email and asked to confirm your agreement to the Code With Us terms and contract.

Proposal Evaluation Criteria

We will score proposals by the following criteria:

  • Expressed knowledge of NiFi provenance system that will employ best practices for processing provenance data efficiently as expressed in pseudocode / structures (30 points),
  • Experience contributing Java code to any public code repositories with more than 5 contributors: (10 points),
  • Experience contributing Java code to either of the following projects https://github.com/apache/nifi ,https://github.com/apache/incubator-atlas (10 points),
  • Ability to deliver complete solution on or before October 13, 2017 (10 points).

Nifi Atlas Bridge Development Opportunity

Background:

Is Java one of your languages of choice? Are you familiar with Apache NiFi (data transformation and routing tool ) and/or Apache Atlas? If so, check out this opportunity to work with the Ministry of Jobs, Training and Technology.
Tags: Java, Apache NiFi, Apache Atlas, Metadata, HortonWorks
Amount: $10,000.00 CAD

Description:

We are looking for help building an Apache NiFi-Atlas bridge. We want it to process NiFi provenance data and log this in Apache Atlas as lineage metadata to support our planned use of HortonWorks Data Governance framework.

More specifically, we want to be able to record in Atlas the "logical" lineage of a file from input all the way to being saved to HDFS or a database even though it was manipulated several times. Also, we would like to be able to process the provenance data as the processes proceed and not have to wait until the flow is finished. This would allow Atlas to show the current status of a file before the whole flow is finished.

We are aware of the code at https://github.com/vakshorton/NifiAtlasBridge and https://github.com/vakshorton/NifiAtlasLineageReporter, however it does not seem to do exactly what we want. The one problem with the above code is that it generates Atlas metadata as input and output from too many flow ingress and egress events. Many processors change or clone data, which means the FlowFile's id changes, which then in turn causes the metadata lineage to be broken in Atlas.

A typical simplified use-case would be to get a ZIP file from the file system, move it to a different directory based on filename and date, do various actions on it - unzip as CSV, split, join, update attributes, custom processors, manually edit it, save to a different directory, infer Avro schema, convert CSV to Avro, save file, convert file to orc, put in HDFS, generate Apache Hive DDL, create Hive table. The file names and directory names will not be hard-coded in the flow but will instead be parameter driven. In this scenario, we would like to be able to trace that Hive table lineage all the way back to the ZIP file input.

Acceptance criteria:

A merge request from your GitHub account to this repo (https://github.com/bcgov/nifi-atlas) that works with the following software versions:
• Java 8
• Apache Atlas 0.8
• Apache NiFi 1.3+

• The solution must work in a clustered environment with provenance data coming from several nodes. We would like it to work in a site-to-site environment (cluster A does some processing, then hands it over to cluster B to do further processing - we would like to be able to have the lineage follow all the way through).

• It is acceptable for the code to work with a local file system, even though we will implement it in the cloud using e.g. Azure, AWS etc.

• The code must allow us to add more processors and custom processors quite easily.

• The code needs to include a build script (Maven or Gradle) to compile the Java code and produce a NAR file.

• The code needs to include some basic unit testing, as well as an example of use in a Dataflow template to demonstrate end-to-end functionality.

• This code does not need to be production ready, but must be a good starting point for us to use.

• The code needs to be commented where possible, especially in the areas of provenance processing.

How to apply:

To apply, please visit this opportunity on the BCDevExchange. Click the apply button and submit your proposal by 4:00 PM Pacific Standard Time (PST) on September 19, 2017.

With your proposal, you must attach a copy of the Code-with-Us Terms, with the required information asked for in the "Acceptance" section of the Terms inserted into the document.

If we are satisfied with the proposals we receive, we will assign this opportunity by September 26, 2017 with work proposed to start immediately.

Proposals will be evaluated based on the following criteria:

➢ Expressed knowledge of NiFi provenance system that will employ best practices for processing provenance data efficiently as expressed in pseudocode / structures (30 points),
➢ Experience contributing Java code to any public code repositories with more than 5 contributors: (10 points),
➢ Experience contributing Java code to either of the following projects https://github.com/apache/nifi ,https://github.com/apache/incubator-atlas (10 points),
➢ Ability to deliver complete solution on or before November 7, 2017 (10 points).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.