ground-context / ground Goto Github PK

View Code? Open in Web Editor NEW

159.0 159.0 51.0 11.24 MB

An open-source, vendor-neutral data context service.

Home Page: http://www.ground-context.org

License: Apache License 2.0

Java 95.38% Shell 3.38% HTML 0.18% Scala 0.68% JavaScript 0.02% Batchfile 0.36%

data-context ground lineage provenance

ground's People

Contributors

Stargazers

Watchers

ground's Issues

Null serialization is incorrect

Currently, nulls are serialized as strings instead of Postgres nulls.

Tag lookup indexing and APIs

Add an API that allows for retrieval of all entities that are tagged with something.

As a part of the database setup scripts, automatically index Tags based on the key in order to allow quick lookups for these queries.

Downloaded ground v0.1.2.tar.gz but missing postgresql as part of this package.

I am trying to setup the latest version v0.1.2 on Ubuntu 16.04.

As per install/getting started steps: http://www.ground-context.org/wiki/index
After the unzip of v0.1.2.tar.gz, I should be able to start the postgresql as: ./bin/ground-postgres.*

But it looks like either there is some step missing or the package do not have postgresql.

Please clarify.

Hive Metastore Integration

Build Ground implementation of Hive's RawStore interface in order to allow Ground as a drop-in replacement for the Hive Metastore.

For the MVP, we won't be providing versioning semantics for Hive queries, but we do need to figure out how we're going to change the version of metadata that Ground returns.

Questions on the GroundClient packages

According to the Ground Wiki, it shows that we can use the client library to generate the data context.

GroundClient client = new GroundClient(host, port);

int managerId = client.createNode("manager");
int engineerId = client.createNode("engineer");

Is the GroundClient packages? Also, When I want to use POST API to generate the data, it always reply bad request. Then, where can I get start with?

Thanks.

File system metadata wrapper library

Build a wrapper library for file system metadata. The wrapper consists of a set of Ground Structures that contain information about file system entities as well as a set of import and export Python(?) scripts that

This should cover things such including (but not limited to):

Files
Directories
fs.stat information

Ideally, we'd like the the HDFS integration in #8 to use this wrapper and augment it with HDFS-specific information as necessary.

Broken Link

From:
http://www.ground-context.org/wiki/index.html
JIRA link is broken

Links to:
https://ground.atlassian.net/projects/GROUND/issues
Result:
404 Page Not found

Modeling Feeds

Swagger for client API generation

Neo4j not downloading correctly on builds

Seems like 4d8dc40 broke the Neo4j builds, as it doesn't seem like NEO4J_VERSION is being populated in the scripts.

Create DB on initialization/install

Git versioning integration

This integration is has three parts.

We need to build an API in Ground that listens for Github webhooks for certain repositories that are registered with Ground.
Send these events to an Aboveground server.
Have the Aboveground server clone the repository, analyze the git history (there should be a good way to calculate deltas instead of analyzing the whole git history), and update Ground with the new versions that were detected by Ground.

This depends on the pipeline that reads from Kafka and writes into Ground specified in #8.

post experiments from CIDR paper?

It'd be very helpful to get a sense of an application using ground. Maybe it'd be possible to post the code for the impact analysis experiment? That looks like it's not terribly complicated on external dependencies and should be easy to interpret.

API Documentation

Document all the external facing APIs and use something like Swagger to automatically generate API docs for us.

routes File selection error

I run this project with this result:

"Action not found
For request 'Get /'
These routes have been tried ,in this order:"

what is the situation？

Ground deployment Vagrant instance or Dockerfile

In order to remove some of the friction in setting up and playing with a Ground Alpha, we should have a Vagrant instance or Dockerfile(s) that set up HDFS, Kafka, Gobblin, and Ground and link them together. Users won't have to install any software to get started with Ground.

config.ini for github plugin wrong in docker image

The getting started wiki page makes it pretty easy to get started with - however, there's a slight hiccup with the docker images and the github plugin image (the one that runs 'python parsegitlog.py')

the config.ini file for kafka and ground services both point to localhost, but in the docker setup they're linked to 'kafka' and 'ground' for hostnames. Once I exec'ed into the image and fixed the config file it worked.

Without it, you'll get an error like so:

docker logs ea212d2e6ad9
Traceback (most recent call last):
File "parsegitlog.py", line 121, in
bootstrap_servers=[config['Kafka']['url'] + ":" + config['Kafka']['port']])
File "/usr/local/lib/python3.5/site-packages/kafka/consumer/group.py", line 284, in init
self._client = KafkaClient(metrics=self._metrics, **self.config)
File "/usr/local/lib/python3.5/site-packages/kafka/client_async.py", line 202, in init
self.config['api_version'] = self.check_version(timeout=check_timeout)
File "/usr/local/lib/python3.5/site-packages/kafka/client_async.py", line 791, in check_version
raise Errors.NoBrokersAvailable()
kafka.errors.NoBrokersAvailable: NoBrokersAvailable

Query Language Spec

JSON serialization for maps is broken

Right now, when creating JSON requests empty maps (for parameters for RichVersions and tags for everything) must be specified because they're being set to null. They should be autopopulated to empty Maps if null is passed in.

Broken link on wiki page

https://github.com/ground-context/ground/wiki - the API documentation link on this page appears to be broken.

Relational wrapper library

Build a wrapper library for relational metadata. The wrapper consists of a set of Ground Structures that contain information about relational entities as well as a set of import and export Python(?) scripts that

This should cover things such including (but not limited to):

Databases
Tables
Columns
Key constraints

Ideally, we'd like the the Hive Metastore integration in #7 to use this wrapper and augment it with Hive-specific information as necessary.

Website sprucing

Add Pointers to Docs, esp. Getting Started
Remove pointers to writing repo.
Put CIDR submission in repo.

Maven repo registration

Wrong ports in instructions on DockerHub

Hi,

this is really interesting work. I was just playing with the Docker containers and noticed that the example for running the code here https://hub.docker.com/r/groundcontext/ground/ states that the open port should be 8080, but in reality it should be 9090 and 9191.

docker run -d --rm --name ground -p 9090:9090 -p 9191:9191 --link neo:neo groundcontext/ground

Cheers, Jan

Gobblin Metadata crawler integration

Set up pipeline that extracts HDFS metadata from Gobblin, writes it into a Kafka topic, and reads from the Kafka topic to ingest into Ground. This notifies us of new files that are created in HDFS. Eventually, we want to take this metadata and spit these events off to a featurization or parsing Aboveground service that extracts additional metadata from the files detected by Gobblin..

The second half of the pipeline that reads from Gobblin and writes into Ground will also be used by the Git integration pipeline.

Wrong link in README.md

Wrong link to CIDR17.pdf is confusing new contributors

"docs/CIDR17.pdf" should be changed to "resources/docs/CIDR17.pdf".

Add code coverage and CI integration.

Add Apache v2 license to project files

The license is specified here. Should be easy to script.

Caching bug?

Hotels.com team identified a bug, but details needed.

Community communications

Stumbled upon this from the Wherehows page, interesting project, like the fundamental approach.
Are there any community communication channels to facilitate communication? I have questions to determine if the project would be useful for our organization / development / evaluation.

Linting and imports

Looks like we import java.util.Optional a lot but don't use it.

Suggests that we don't have very good linters running on the code. Would be good to put something into place to run automagically.

CIDR17.pdf paper is not accessible.

The wiki page [https://github.com/ground-context/ground/wiki] links to https://github.com/ground-context/ground/blob/master/CIDR17.pdf and returns Github's 404 page.

Code cleanup

General cleanup, deduplication of code, addressing of TODOs, etc. Also, write proper documentation for everything.

Getting Started with Ground v0.1 Docs

Put together a tutorial that briefly explains what Ground does and explains the integrations we've built. Get users started with the Docker / Vagrant instance set up by #13.

Demonstrate the usefulness of Ground by having users load canonical Hive example data, run simple Hive queries, and rewind time using Ground (i.e., the functionality provided in #7). We can then show "time travel" queries using Ground's older versions of Hive metadata. This obviously requires Hive integration from #7 and also requires HDFS integration from #8 and git integration from #9.

In addition, this demo should come with a simple demonstration of lineage (maybe a graph we show them that they can recreate with existing metadata) as well as a simple Aboveground service that does something like duplicate file detection.

Lastly, we refer them to the wrapper layers in #11 and #12 to show them "canonical" examples of building wrapper libraries. We need some documentation with best practices for building your own, so they can ingest metadata from their environment as well as tips for writing an Aboveground service that somehow consumes metadata from Ground.

ground-context / ground Goto Github PK

ground's People

Contributors

Stargazers

Watchers

Forkers

ground's Issues

Recommend Projects

Recommend Topics

Recommend Org