marquezproject / marquez Goto Github PK

Collect, aggregate, and visualize a data ecosystem's metadata

License: Apache License 2.0

Java 75.32% Dockerfile 0.05% Shell 1.31% JavaScript 0.64% HTML 1.01% TypeScript 17.70% CSS 0.05% Python 3.36% Mustache 0.19% PLpgSQL 0.37%

data-lineage data-discovery data-governance data-provenance metadata-service data-dictionary marquez metadata data-ecosystem-metadata data-ops

marquez's People

Contributors

Stargazers

Watchers

Forkers

julienledem alulc sshah-wework harels manisha-shetty olenavyshnevska violet-xiaoweihuang pbrahmbhatt3 davidli119 ashulmanwework codeboyyong hjpatel16 ravikamaraj alevine-wework ronthalanki fanrukun ajaymuppuri priyankalakhe jierfei1007 soumyasmruti nkijak shanyongzhou mars-lan databill86 animeshinvinci cybercypher w4nderlust trucnguyenlam syedatifakhtar elsonthomas sanjoy-bose mbrukman aimeepeng karimshn mysky528 akhil-ghatiki sreev priyankasrs jaweda dmariassy guidiniz doughamil keberox tharata mmusnjak bruce-sz-cn rheehot sathya-reddy-m henneberger stormphi younghai hyunjay phixme tjuljc nizardeen tianbuaa hhy5277 francoisjehl walter-hernandez colpal dwtcourses ecerulm anandhu-sk anandhusk zxf1864 northwesternmutual pratikmallya mobuchowski kulykdmytro tullytim eijidepaz bamitesh shakirzyanovarsen rossturk milkcoffeezhu szhorizon srravula1 martowu froberts71 ravwojdyla bose-sanjoy henrryvargas anshulpathak dystudio peetee kachontep twenty-zhang buom kriti-sc guptam yang040840219 edrinb hanbei yingyingqiqi kedar-cz oleksandrdvornik nicosuave liuhdev laopeng2021 gwasky

marquez's Issues

Clarify setup instructions for shadowJar / runShadow

Explore testcontainers + PG for database tests in future

Placeholder to continue conversation in #15 re: exploring testcontainers

DB schema: Convert type code columns into enums

Add docker compose

To simplify getting started / up and running with marquez, we'll want to add support for docker-compose

Note: depends on #44

Marquez 0.1.0

Key	Description
+	Public
-	Private

NamespaceService

Method	Throws	Description
+ `Namespace create(String name)`	NamespaceException	Creates a namespace.

JobService

Method	Description
+ `Job create(namespace, Job)`
- `JobVersion createVersion(namespace, Job)`
+ `Job[] getAll(namespace)`
+ `JobVersion[] getAllVersions(namespace, jobName)`
+ `JobVersion getVersionLatest(namespace, jobName)`

DatasetService

Method	Description
+ `JobRun createJobRunOutputs(namespace, jobName, runId, Dataset[])`
- `DatasetVersion createVersion(namespace, Dataset)`
+ `Dataset[] getAll(namespace)`

Define OpenAPI spec

Let's define an API specification using OpenAPI v3.0, the most popular API specification tool around

Fix HTTP 500 for GET on /owners

API versioning / evolution?

What are everyone's thoughts regarding how to version the API? Versioning (via URI or headers) or evolution? Or something else?

To date, we have encouraged quick iteration on features to allow for early API feedback. This ensured we understood how Marquez would collect metadata on running jobs as well as handle versioning of datasets (conversations that are still ongoing). Recently, our data model has seen many additions (namespaces!). As a result, we need to address our current app structure. Mainly, this means decoupling the DAO Layer from the Resource Layer and introduce a Service Layer to encapsulate all DAO interactions and allow each layer to evolve independently #66.

Note: App restructuring is required before opening up the project for contributions.

App Design

The multi-layer app design will consist of:

Resource Layer
Service Layer
DAO Layer

See Organizing Marquez Code doc for more details.

App Structure

We'll also have the following project structure:

marquez/
├── api
│   ├── exceptions
│   ├── health
│   ├── mappers
│   ├── models
│   └── validation
├── common
├── db
│   ├── mappers
│   └── models
└── service
    ├── exceptions
    ├── mappers
    └── models

For more details, see Marquez: App pkg structure

Add datasets.md

Defines a dataset

Reorganize Dropwizard service

The current organization of our Dropwizard has no clear home for business logic, leaving us to push too much logic down into the DAO or Resources (without any guiding principles on when or where). We should consider re-organizing the project. We should also use patterns to reinforce good hygiene like having separate Representation objects for both requests and responses, which can help avoid unexpected bugs.

The Dropwizard docs suggest separating the business logic into separate package from the reference objects and the resources. The Dropwizard docs also making passing mention of request / response entities, although it does not emphasize the importance of this for good service design. To that end, we probably want to introduce:

Controllers for request / response handling
Service classes
Request / Response entities

Thoughts?

Migrate all API endpoint arguments to camelCase

This will make transitioning to code generation with Lombok seamless.

API Errors

Define all API error responses using rfc7807

Investigate unchecked / unsafe operations in JobRunState.java

Seeing this during compilation:

...src/main/java/marquez/api/JobRunState.java uses unchecked or unsafe operations.

Vulnerability assessment tool scans

I recommend that we introduce vulnerability scans early in this project so that we can keep our security posture healthy. We can achieve this by performing security scans (with a tool) on PR's and rejecting them if they introduce new vulnerabilities. At a minimum, we should be scanning our dependencies since it's rather easy to do and incurs little overhead. Snyk is a solid choice for this purpose.

In the future, we'll want to scan custom code as well (the code we actually wrote), but those types of scans are usually more cumbersome to perform and orchestrate. I don't think this needs to be tackled right away, however, it should be considered carefully.

Create migration with triggers to update all updated_at fields

Use UUID regex in URI validation

Instead of validating UUID separately, look into: https://www.mkyong.com/webservices/jax-rs/jax-rs-path-uri-matching-example/

Review and improve errors

Restore Job serialization test

TestJobSerialization.java contents were lost during a merge. Needs to be restored from git history.

Fix checkstyle warnings

Let's address all style warnings that don't follow the google java style guide

$ ./gradlew checkstyleMain checkstyleTest
> Task :checkstyleMain
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:110: Abbreviation in name 'ownerDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:113: Abbreviation in name 'jobDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:116: Abbreviation in name 'jobRunDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:119: Abbreviation in name 'datasetDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:122: Abbreviation in name 'jobRunDefinitionDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/MarquezApp.java:123: Abbreviation in name 'jobVersionDAO' must contain no more than '2' consecutive capital letters. [AbbreviationAsWordInName]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/api/CreateNamespaceResponse.java:24: 'if' construct must use '{}'s. [NeedBraces]
[ant:checkstyle] [WARN] projectmarquez/marquez/src/main/java/marquez/api/CreateNamespaceResponse.java:25: 'if' construct must use '{}'s. [NeedBraces]
.
.
.

Issues

Fix lines longer than 100 chars #283
Fix abbreviation in name #284
Disable Javadoc check #285

List job runs

Let's support listing job runs. We'll want to update the OpenAPI spec accordingly #100 #104

GET /namespaces/:namespace/jobs

{
  "name": "my_first_job",
  "description": "Best job ever!",
  ...
  "runs": [
    "/jobs/runs/cfc4b5e6-c630-48d4-ad19-f2bd16c93a9d",
    "/jobs/runs/d33ef190-73bd-4a65-ab59-1bbd65364d0b",
    "/jobs/runs/5ced1097-8d59-46d8-933e-c9a688be8b8c",
    ...
  ]
}

We may want to list job runs when fetching a job by name. The endpoint above is not yet finalized, and something we can iterate on (just wanted to capture the functionality).

Publish docs for client (javadocs, and python api docs) to github pages

Will be nice to eventually have our docs published here (automatically?) https://readthedocs.org/

Rename Job Run Definition to Job Definition

This will require changes across the application, so should be in a separate PR.

Add jobs.md

Defines a job in docs/

Design document is not publicly accessible

I arrived here after viewing this (excellent) presentation. I'm very keen to understand Marquez in more detail as it appears to align with many of my metadata goals. It' be great to have some visibility on the design/roadmap of the project. I believe that the Google document linked to in the README might contain useful information but do not currently have permissions to view it. On following the link I see: You need permission.

Add namespaces.md

Defines a namespace

Add roadmap.md

We should define a clear timeline of milestones for Marquez and break them down into phases with expected release dates.

Add code of conduct

We feel it's important to have a welcoming community, let's follow https://www.contributor-covenant.org

Explore using DatabasePreparer when creating a database for integration tests

Do we need our own PreparedDbRule? We're dynamically generating the config.yml, which I'm not a fan of (and something I think we can hopefully avoid). Looking at the src for PreparedDbRule, on instantiation, it's creating the DB connection based on the details contained in DatabasePreparer. Worth exploring that as an option.

Rename all guid columns to uuid

Deleting an owner should implicitly end their ownerships

Currently, deleting an Owner only soft deletes the owner record, but will leave their ownerships unchanged. The Owner deletion should automatically end all of their ownerships.

Add Namespace Integration Tests with data store/retrieval when Namespace Service is implemented

Add CHANGELOG.md

Let's keep a changelog!

Use JDBI embedded postgres?

See: https://github.com/jdbi/jdbi/blob/a831d3314db43859c9894aa987d3ee4827edc459/testing/src/test/java/org/jdbi/v3/testing/JdbiRuleTest.java

Add CONTRIBUTING.md

Let's write up some clear guidelines for how to contribute to the project.

Verify and document default Postgres timezone behavior

Create a standard for dealing with timezones in the data, specifically around whether it's required/recommended to include them in time data.

This includes how the DB schema will address timestamp info, how the Jackson mapper treats timestamps that don't include TZ info, and how Marquez treats data in the Timestamp object.

By default, it seems that Postgres does not record TZ info in its timestamp. This can cause problems down the line, so I think for now it's advisable to create types as TIMESTAMP WITH TIME ZONE instead of just TIMESTAMP.

Reference:
https://www.postgresql.org/docs/9.6/static/datatype-datetime.html

Branch protection on master?

FYI I just enabled branch protection on master to prevent merging PRs which have not been reviewed by Marquez owners. If there's any reason we don't want this, let's discuss!

Add job version endpoint to the API

Revisit CI flow

In our current flow, we run the 'test' task both as part of the build step, and then also as part of the dedicated test step. We have two choices:

Skip the testing step in the build invocation by specifying -x test.
Remove the individual test step altogether.

numeric IDs or UUIDs?

Does anyone have a strong opinion about numeric IDs vs. using UUIDs from the outset? Would the numeric IDs lock us into a single database master and affect how Marquez can evolve to support a distributed datastore in the future?

Add gradle task to apply java formatting

Now that we're catching code formatting issues during the build, it will be nice to have a gradle task which can auto apply the java formatting. Coming from Go, I loved having go fmt do this for me, so having something similar for this project will be nice.

Standardize test file names

Some test files are named Test*.java and others are *Test.java. Will be nice to standardize on one. Seems like *Test.java is more conventional?

Feature based package structure vs layer based

For better cohesiveness and manageability, I think we should move to a feature based package structure over the current layer based. I've created a branch as an example.

https://github.com/lulciuca/marquez/tree/feature-based-package-structure

Here's a nice article on the subject: http://www.javapractices.com/topic/TopicAction.do?Id=205