Giter VIP home page Giter VIP logo

ground's Introduction

Ground

Build Status codecov License

Ground is an open-source data context service under development in UC Berkeley's RISE Lab. Ground serves as a central model, API, and repository for capturing the broad context in which data is used. Our goal is to address practical problems for the Big Data community in the short term and to open up opportunities for long-term research and innovation.

For the vision behind Ground, please see our CIDR '17 publication.

Getting Started

You can download the latest version of Ground from our releases page. The most recent version of Ground is v0.1.2.

Once you have downloaded the latest version of Ground, a Ground database and corresponding tables are necessary for Ground to function. To create the tables, please run python postgres_setup.py <user> <dbname> in the db/ directory of the release. To drop the tables, run python postgres_setup.py <user> <dbname> drop. Alternatively, one can also use db/postgres.sql to set up the tables.

You can start the Ground server by running ./bin/ground-postgres. This starts a local Ground server running on port 9000.

For more information, see Hit the Ground Running, our getting started guide.

License

Ground is licensed under the Apache v2 License.

Contributing

Please see the guidelines in CONTRIBUTING.md.

ground's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ground's Issues

Pom.xml removed; build instructions deprecated

The pom.xml file was removed in commit 5959981 breaking the maven build instructions in the readme and on ground-context.org.

I see sbt artifacts in the project - if a switch to sbt was intentional then the documentation needs to be updated; else pom.xml should be put back.

routes File selection error

I run this project with this result:

"Action not found
For request 'Get /'
These routes have been tried ,in this order:"

what is the situation?

Questions on the GroundClient packages

According to the Ground Wiki, it shows that we can use the client library to generate the data context.

GroundClient client = new GroundClient(host, port);

int managerId = client.createNode("manager");
int engineerId = client.createNode("engineer");

Is the GroundClient packages? Also, When I want to use POST API to generate the data, it always reply bad request. Then, where can I get start with?

Thanks.

Code cleanup

General cleanup, deduplication of code, addressing of TODOs, etc. Also, write proper documentation for everything.

Ground deployment Vagrant instance or Dockerfile

In order to remove some of the friction in setting up and playing with a Ground Alpha, we should have a Vagrant instance or Dockerfile(s) that set up HDFS, Kafka, Gobblin, and Ground and link them together. Users won't have to install any software to get started with Ground.

Caching bug?

Hotels.com team identified a bug, but details needed.

Gobblin Metadata crawler integration

Set up pipeline that extracts HDFS metadata from Gobblin, writes it into a Kafka topic, and reads from the Kafka topic to ingest into Ground. This notifies us of new files that are created in HDFS. Eventually, we want to take this metadata and spit these events off to a featurization or parsing Aboveground service that extracts additional metadata from the files detected by Gobblin..

The second half of the pipeline that reads from Gobblin and writes into Ground will also be used by the Git integration pipeline.

Tag lookup indexing and APIs

Add an API that allows for retrieval of all entities that are tagged with something.

As a part of the database setup scripts, automatically index Tags based on the key in order to allow quick lookups for these queries.

Linting and imports

Looks like we import java.util.Optional a lot but don't use it.

Suggests that we don't have very good linters running on the code. Would be good to put something into place to run automagically.

Getting Started with Ground v0.1 Docs

Put together a tutorial that briefly explains what Ground does and explains the integrations we've built. Get users started with the Docker / Vagrant instance set up by #13.

Demonstrate the usefulness of Ground by having users load canonical Hive example data, run simple Hive queries, and rewind time using Ground (i.e., the functionality provided in #7). We can then show "time travel" queries using Ground's older versions of Hive metadata. This obviously requires Hive integration from #7 and also requires HDFS integration from #8 and git integration from #9.

In addition, this demo should come with a simple demonstration of lineage (maybe a graph we show them that they can recreate with existing metadata) as well as a simple Aboveground service that does something like duplicate file detection.

Lastly, we refer them to the wrapper layers in #11 and #12 to show them "canonical" examples of building wrapper libraries. We need some documentation with best practices for building your own, so they can ingest metadata from their environment as well as tips for writing an Aboveground service that somehow consumes metadata from Ground.

post experiments from CIDR paper?

It'd be very helpful to get a sense of an application using ground. Maybe it'd be possible to post the code for the impact analysis experiment? That looks like it's not terribly complicated on external dependencies and should be easy to interpret.

config.ini for github plugin wrong in docker image

The getting started wiki page makes it pretty easy to get started with - however, there's a slight hiccup with the docker images and the github plugin image (the one that runs 'python parsegitlog.py')

the config.ini file for kafka and ground services both point to localhost, but in the docker setup they're linked to 'kafka' and 'ground' for hostnames. Once I exec'ed into the image and fixed the config file it worked.

Without it, you'll get an error like so:

docker logs ea212d2e6ad9
Traceback (most recent call last):
File "parsegitlog.py", line 121, in
bootstrap_servers=[config['Kafka']['url'] + ":" + config['Kafka']['port']])
File "/usr/local/lib/python3.5/site-packages/kafka/consumer/group.py", line 284, in init
self._client = KafkaClient(metrics=self._metrics, **self.config)
File "/usr/local/lib/python3.5/site-packages/kafka/client_async.py", line 202, in init
self.config['api_version'] = self.check_version(timeout=check_timeout)
File "/usr/local/lib/python3.5/site-packages/kafka/client_async.py", line 791, in check_version
raise Errors.NoBrokersAvailable()
kafka.errors.NoBrokersAvailable: NoBrokersAvailable

Hive Metastore Integration

Build Ground implementation of Hive's RawStore interface in order to allow Ground as a drop-in replacement for the Hive Metastore.

For the MVP, we won't be providing versioning semantics for Hive queries, but we do need to figure out how we're going to change the version of metadata that Ground returns.

API Documentation

Document all the external facing APIs and use something like Swagger to automatically generate API docs for us.

File system metadata wrapper library

Build a wrapper library for file system metadata. The wrapper consists of a set of Ground Structures that contain information about file system entities as well as a set of import and export Python(?) scripts that

This should cover things such including (but not limited to):

  • Files
  • Directories
  • fs.stat information

Ideally, we'd like the the HDFS integration in #8 to use this wrapper and augment it with HDFS-specific information as necessary.

Relational wrapper library

Build a wrapper library for relational metadata. The wrapper consists of a set of Ground Structures that contain information about relational entities as well as a set of import and export Python(?) scripts that

This should cover things such including (but not limited to):

  • Databases
  • Tables
  • Columns
  • Key constraints

Ideally, we'd like the the Hive Metastore integration in #7 to use this wrapper and augment it with Hive-specific information as necessary.

Git versioning integration

This integration is has three parts.

  1. We need to build an API in Ground that listens for Github webhooks for certain repositories that are registered with Ground.
  2. Send these events to an Aboveground server.
  3. Have the Aboveground server clone the repository, analyze the git history (there should be a good way to calculate deltas instead of analyzing the whole git history), and update Ground with the new versions that were detected by Ground.

This depends on the pipeline that reads from Kafka and writes into Ground specified in #8.

Wrong link in README.md

Wrong link to CIDR17.pdf is confusing new contributors

"docs/CIDR17.pdf" should be changed to "resources/docs/CIDR17.pdf".

Community communications

Stumbled upon this from the Wherehows page, interesting project, like the fundamental approach.
Are there any community communication channels to facilitate communication? I have questions to determine if the project would be useful for our organization / development / evaluation.

Website sprucing

  • Add Pointers to Docs, esp. Getting Started
  • Remove pointers to writing repo.
  • Put CIDR submission in repo.

Solr as backend support

Please let me know , if there is a proposal to treat Solr as backend store besides elasticsearch. I would like to contribute in that area.

JSON serialization for maps is broken

Right now, when creating JSON requests empty maps (for parameters for RichVersions and tags for everything) must be specified because they're being set to null. They should be autopopulated to empty Maps if null is passed in.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.