Giter VIP home page Giter VIP logo

datahuborg / datahub Goto Github PK

View Code? Open in Web Editor NEW
210.0 44.0 60.0 50.58 MB

An experimental hosted platform (GitHub-like) for organizing, managing, sharing, collaborating, and making sense of data.

Home Page: https://datahub.csail.mit.edu

License: MIT License

Python 29.21% HTML 20.26% Shell 0.77% JavaScript 34.73% CSS 9.36% Makefile 0.34% C++ 0.08% Go 0.07% Java 0.21% Objective-C 4.51% Thrift 0.13% Batchfile 0.31%

datahub's Introduction

Build Status Code Climate

Note: This project is under development. It is not yet ready for production use.

DataHub

DataHub is an experimental hosted platform (GitHub-like) for organizing, managing, sharing, collaborating, and making sense of data. It provides an efficient platform and easy to use tools/interfaces for:

  • Publishing of your own data (hosting, sharing, collaboration)
  • Using other’s data (querying, linking)
  • Making sense of data (analysis, visualization)

Get Started

Example Code

Demo

Contact Info

Quickstart

Vagrant is the recommend method for developing with DataHub. It provides a VM matching the DataHub production server, regardless of your host system.

  1. Install VirtualBox https://www.virtualbox.org/.

  2. Install Vagrant https://www.vagrantup.com/downloads.html.

  3. Clone DataHub:

    $ git clone https://github.com/datahuborg/datahub.git
  4. Add this line to your hosts file (/etc/hosts on most systems):

    192.168.50.4    datahub-local.mit.edu
  5. From your clone, start the VM:

    $ vagrant up

This last step might take several minutes depending on your connection and computer.

When vagrant up finishes, you can find your environment running at http://datahub-local.mit.edu.

Vagrant keeps your working copy and the VM in sync, so edits you make to DataHub's code will be reflected on datahub-local.mit.edu. Changes to static files like CSS, JS, and documentation must be collected before the server will notice them. For more information, see the docs at https://datahub.csail.mit.edu/static/docs/html/index.html.

datahub's People

Contributors

anantb avatar b-carter avatar dnsserver avatar famien avatar hariharsubramanyam avatar jharia avatar justinanderson avatar kxzhang avatar rogertangos avatar sirrice avatar ygina avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datahub's Issues

Allow cross-user table joins

I'm copying a few of the frequently requested features/fixes from jira onto Github, just so people know that we're aware of the issues, and are working on them.

Go Thrift code generation throws errors

This seems to be a missing (non-vital) formatting package. Unfortunately, I'm unfamiliar with both Go and Thrift.

https://golang.org/cmd/gofmt/

Here's the error:

$ cd src/examples/go
$ /.setup.sh

sh: gofmt: command not found WARNING - Running 'gofmt -w /Users/arcarter/code/datahub/src/examples/go/gen-go/src///datahub/datahub.go' failed. sh: gofmt: command not found WARNING - Running 'gofmt -w /Users/arcarter/code/datahub/src/examples/go/gen-go/src///datahub/ttypes.go' failed. sh: gofmt: command not found WARNING - Running 'gofmt -w /Users/arcarter/code/datahub/src/examples/go/gen-go/src///datahub/constants.go' failed. sh: gofmt: command not found WARNING - Running 'gofmt -w /Users/arcarter/code/datahub/src/examples/go/gen-go/src///datahub/account/account_service-remote/account_service-remote.go' failed. sh: gofmt: command not found WARNING - Running 'gofmt -w /Users/arcarter/code/datahub/src/examples/go/gen-go/src///datahub/account/accountservice.go' failed. sh: gofmt: command not found WARNING - Running 'gofmt -w /Users/arcarter/code/datahub/src/examples/go/gen-go/src///datahub/account/ttypes.go' failed. sh: gofmt: command not found WARNING - Running 'gofmt -w /Users/arcarter/code/datahub/src/examples/go/gen-go/src///datahub/account/constants.go' failed.

Test Issue

I'm linking github and Jira issues. This is a test to see if Jira picks up new github issues. Apologies for the notifications.

Layout should be a compiled header

the current layout.html is not a compiled template (it's a raw template) -- in future we would compile it so that it separates out the static files (that can be server through either CDN) or a fast web server like nginx and put the template handling inside a reverse proxy. I'll put up a task for it but it's not super important.

Anant

In the datahub's layout.html we're importing 30+ header files. Many of these are for specific applications (i.e. terminal), and are hosted locally instead of using a CDN. Is this really necessary?

My impression is that it'd be faster/cleaner to put these in the application headers, and load them from cdn networks where possible.

Albert Carter

Client Throws "DBException(message:User matching query does not exist. Lookup parameters were {'username': None})"

The following piece of code throws the above error:

this.transport = new THttpClient("http://datahub.csail.mit.edu/service");
this.protocol = new  TBinaryProtocol(transport);
this.client = new DataHub.Client(protocol);

this.con_params = new ConnectionParams();
this.con_params.setUser("anantb");
this.con_params.setPassword("anant");
this.conn = this.client.open_connection(con_params);

ResultSet updatelogExists =  this.client.execute_sql(this.conn, "select * from anantb.test.demo", null);

Gives the following error:

DBException(message:User matching query does not exist. Lookup parameters were {'username': None})...

Would appreciate a fix for this! Thanks!

Views cannot be deleted

{"error": "\"viewtest\" is not a table\nHINT: Use DROP VIEW to remove a view.\n"}

table_delete in browser.views will need to determine if the table_name passed is a view, and it will have to call related methods in manager.py and pg.py

Apps should be listed in app center

The "Apps Center" link on the top nav bar currently points to root. There should either be a list of installed apps, or the link should be removed.

u'prefixed strings' being returned by thrift

reported by @karger:

In javascript I'm using the thrift api to send datahub a sql query that
includes the "array_agg" operator:
"select count(prequest.hilulim.uid) as num, prequest.hilulim.title as
title ,array_agg(prequest.names.name) as names from prequest.hilulim
join ...."

the aggregated column is being returned by the thrift api as a string
encoding of an array of unicode strings
which is really weird---ie, the
value in the cell is the string "[u'Fiana Sara Eber', u'David Karger']"
Why am I getting this python-language syntax exposed in theoretically
language-agnostic thrift? Why isn't coming back as an array of
strings? why the unicode encoding? will it always be a unicode
encoding? do I need to parse it myself? Unfortunately since it's a
python encoded string I can't hand it to JSON.parse()

Google Gadgets Like Apps

Users might be able to add javascript applications that affect tables through the DataHub API, for example Hands On Table
@karger

Console cannot list views

Go to the console, then ls reponame

Base tables will show, but views won't.

This happens because the list_tables method now only lists base tables. Unfortunately, I'm having trouble getting Thrift (0.9.2) to generate up-to-date javascript code which will allow a javascript client.ls_views function.

disconnect stack in console

It should be possible disconnect from the current repo in the terminal. In postgres, this should be with the disconnect command and/or ctrl+d, but I'm not sure how other databases manage it.

new account email addresses are case sensitive

account email addresses are case sensitive. It's possible to create an account with [email protected], and then another with [email protected], and then give them separate usernames and passwords.

in objective-c:

[account_client create_account:username email:@"[email protected]" password:password repo_name:@"getfit" app_id:appID app_token:appToken];

and then create another account:

[account_client create_account:username email:@"[email protected]" password:password repo_name:@"getfit" app_id:appID app_token:appToken];

ImportError: No module named datahub

/home/ubuntu/datahub/src/apps/dbwipes/views.py:6: DeprecationWarning: the md5 module is deprecated; use hashlib instead
import md5

Internal Server Error: /
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/django/core/handlers/base.py", line 103, in get_response
resolver_match = resolver.resolve(request.path_info)
File "/usr/local/lib/python2.7/dist-packages/django/core/urlresolvers.py", line 319, in resolve
for pattern in self.url_patterns:
File "/usr/local/lib/python2.7/dist-packages/django/core/urlresolvers.py", line 347, in url_patterns
patterns = getattr(self.urlconf_module, "urlpatterns", self.urlconf_module)
File "/usr/local/lib/python2.7/dist-packages/django/core/urlresolvers.py", line 342, in urlconf_module
self._urlconf_module = import_module(self.urlconf_name)
File "/usr/local/lib/python2.7/dist-packages/django/utils/importlib.py", line 35, in import_module
import(name)
File "/home/ubuntu/datahub/src/browser/urls.py", line 171, in
url(r'^apps/dbwipes/', include('dbwipes.urls')), # dbwipes app
File "/usr/local/lib/python2.7/dist-packages/django/conf/urls/init.py", line 25, in include
urlconf_module = import_module(urlconf_module)
File "/usr/local/lib/python2.7/dist-packages/django/utils/importlib.py", line 35, in import_module
import(name)
File "/home/ubuntu/datahub/src/apps/dbwipes/urls.py", line 2, in
import views
File "/home/ubuntu/datahub/src/apps/dbwipes/views.py", line 16, in
from service.handler import DataHubHandler
File "/home/ubuntu/datahub/src/service/handler.py", line 7, in
from datahub import DataHub
ImportError: No module named datahub
[27/Apr/2015 03:19:12] "GET / HTTP/1.1" 500 122244

Tables should be accessible by links WITHOUT username

I would like to be able to link people to tables/repositories without using their usernames:

for example, instead of
https://datahub.csail.mit.edu/browse/USERNAME/REPONAME/table/TABLENAME

use this and have username inferred by their login status.
https://datahub.csail.mit.edu/browse/REPONAME/table/TABLENAME

This is a thing that kept bugging me during getfit.

LIMIT statements do not work in repo interface

DataHub adds automatic limit statements to all sql statements in repository interface (http://datahub.csail.mit.edu/browse/USERNAME/REPONAME/), to support pagination. As a result, a statement like select * from getfit.deviceinfo LIMIT 1; should work, but instead returns an error:

{"error": "syntax error at or near \"LIMIT\"\nLINE 1: select * from getfit.deviceinfo LIMIT 1 LIMIT 50 OFFSET 0\n ^\n"}

DML "cards" don't escape characters

When a user creates a card using DML, they are asked to give it a name. Currently, names that contain blank spaces cannot be mapped to urls, and break.

Console is difficult to copy and paste into

When using the console, you sometimes have to click more than once and try to paste more than once before any text appears.

I'm using OSX and chrome.

This is possibly one of the simplest and most frustrating parts of datahub... because I no one wants to be editing sql in a terminal.

Manage a collection of queries

Maybe save create view statements as cards, or save all queries as temporary cards? It's too hard to retrieve queries that have previously been run successfully.

Dependencies?

Seriously, add a requirements.txt or setup.py or just a list to the README.

Client/Connection protocol is verbose

It takes a large number of lines to create a client. Maybe this could be done with one line and assume a default http client. If the user wanted a TCP client, they could specify.

responses should be gzipped

Doing big queries over HTTP is incredibly time consuming. Gzipping is pretty straightforward, and would significantly speed up load time.

Ctrl+c to interrupt console processes

Manipulating large datasets in the terminal causes it to hang. The only way to interrupt is to force quit the tab. It should be possible to keep listening for ctrl+c and interrupt the process if the user desires.

Add DataQ back into DataHub

Currently, the DataQ app is not accessible via the DataHub user interface. Add a button to table-browse-template.html to launch the app.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.