Giter VIP home page Giter VIP logo

mimir's Introduction

The Mimir Data-Ish Exploration Tool

One of the biggest costs in analytics is data wrangling: getting your messy, mis-labeled, disorganized data into a format that you can actually ask questions about. Unfortunately, most tools for data wrangling force you to do all of this work upfront — before you actually know what you even want to do with the data.

Mimir is about getting you to your analysis as fast as possible. It lets you harness the raw power of SQL, but also provides a ton of powerful langauge extensions:

  • Stop messing with data import and relational schema design. The versatile LOAD command allows you to quickly transform documents into relational tables without the muss and fuss of upfront schema design or defining complex transformation operators.
  • Stop writing messy scripts to visualize your data. The PLOT command lets you take SQL queries and see them directly – notebook style, PDF/PNG, or Javascript, take your pick.
  • Stop writing complex ETL pipelines for simple data. Lenses do the same work, but don't require nearly as much configuration (and we're doing more every day to make lenses easier to use).

Unlike most other SQL-based systems, Mimir lets you make decisions during and after data exploration. All of Mimir's functionality is based on three ideas: (1) Mimir provides sensible best guess defaults, and (2) Mimir warns you when one of its guesses is going to affect what it's telling you, and (3) Mimir lets you easily inspect what it's doing to your data with the ANALYZE query command.

Better still, you don't need any new infrastructure. Mimir attaches to ordinary relational databases through JDBC (We currently support SQLite, with SparkSQL and Oracle support in progress). If you don't care, Mimir just puts everything in a super portable SQLite database by default.

Quick-Start

Install with Homebrew

$> brew tap UBOdin/ubodin
$> brew install mimir
$> mimir --help

Manually download the JAR

Download the latest version of Mimir:

This is a self-contained jar. Run it with

$> java -jar Mimir.jar

Run with Docker

Install Docker and run the docker image:

$> docker run -i -t docker.mimirdb.info/mimir-core
...

Link with SBT (or Maven)

Add the following to your build.sbt

resolvers += "MimirDB" at "http://maven.mimirdb.info/"
libraryDependencies += "info.mimirdb" %% "mimir" % "0.2-SNAPSHOT"

User Guides

Mimir adds some useful language features to SQL. See the MimirSQL Docs for more details, as well as the Lens and Adaptive Schema Docs for more information about Mimir's data cleaning components.

Compiling Mimir

To compile from source, check out Mimir, and use one of the following to compile and run mimir.

$> git clone https://github.com/UBOdin/mimir.git
...
$> cd mimir
$> sbt run

OR

$> sbt assembly
...
$> ./bin/mimir

OR Install Docker and use the docker image:

$> docker run -i -t docker.mimirdb.info/mimir-core
...

Hacking on Mimir

Credit

Development of Mimir has been sponsored by

mimir's People

Contributors

anandsan1 avatar bizentass avatar codingsage avatar hube5462 avatar kyunghoj avatar legacy25 avatar michaelkulbacki avatar mikebrachmann avatar mrb24 avatar nickcellino avatar okennedy avatar shivang94 avatar snehakrishnamurthy avatar sophieyoung717 avatar willspoth avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mimir's Issues

OperatorParser breaks with nested VGTerms

The TYPE_INFERENCE lens takes the form -

PROJECT[PID <= {{ TR1CAST_0[ROWID, {{ TR1INFER_0[] }}] }}, RATING <= {{ TR1CAST_1[ROWID, {{ TR1INFER_1[] }}] }}, REVIEW_CT <= {{ TR1CAST_2[ROWID, {{ TR1INFER_2[] }}] }}]( RATINGS1(...) )

It seems passing a VGTerm as an argument to another VGTerm is confusing the operator parser. The lens works on its own, but when you try to compose it with another lens, the lens.load() step fails. The error can be reproduced by creating a type_inference lens, then creating a missing_value lens on top of it and trying to see any tooltip, or creating another lens on it or even just trying to do a SELECT * FROM MIMIR_LENSES

Type inference lens

Selects the type of each attribute based on the majority of values in the record. Allows for the possibility of errors in the type selection.

Regression: The percolator should hndle row-ids correctly

[info] x handle row-ids correctly
[error]    'PROJECT[A <= R_A, C <= R_C, N <= {{ test_0[__LHS_ROWID, R_A] }}, S_C <= S_C, S_D <= S_D](
[error]      SELECT[ (R_C=S_C) ](
[error]        JOIN(
[error]          PROJECT[__LHS_ROWID <= ROWID, __LHS_ROWID <= ROWID, __LHS_ROWID <= ROWID](
[error]            R(ROWID:int // ROWID:rowid, ROWID:rowid)
[error]          ),
[error]          PROJECT[S_C <= S_C, S_D <= S_D](
[error]            S(S_C:int, S_D:decimal)
[error]          )
[error]        )
[error]      )
[error]    )'
[error]     is not equal to 
[error]    'PROJECT[A <= R_A, C <= R_C, N <= {{ test_0[__LHS_ROWID, R_A] }}, S_C <= S_C, S_D <= S_D](
[error]      SELECT[ (R_C=S_C) ](
[error]        JOIN(
[error]          PROJECT[R_A <= R_A, R_B <= R_B, R_C <= R_C, __LHS_ROWID <= ROWID](
[error]            R(ROWID:int)
[error]          ),
[error]          PROJECT[S_C <= S_C, S_D <= S_D](
[error]            S(S_C:int, S_D:decimal)
[error]          )
[error]        )
[error]      )
[error]    )' (CompilerSpec.scala:240)
[error] Expected: ...OJECTA ..._0[__LHS_ROWID, R_A] }}, ...
[error] ...ELECT[ (R_C=S_C) ](
[error] ...JOIN(
[error] ...OJECT[]R[_A] <= R[_A], [R]_[B <= R]_[B, R_C] <= R[_C], __L...
[error] ...D:int[])
[error] ...   ),
[error] ...OJECT[S_C <= S_C, S_D <= S_D](
[error] ...imal)
[error] ...    )
[error]     )
[error]   )
[error] )
[error] Actual:   ...OJECTA ..._0[__LHS_ROWID, R_A] }}, ...
[error] ...ELECT[ (R_C=S_C) ](
[error] ...JOIN(
[error] ...OJECT[__LHS_]R[OWID] <= R[OWID], []_[_LHS]_[ROWID] <= R[OWID], __L...
[error] ...D:int[ // ROWID:rowid, ROWID:rowid])
[error] ...   ),
[error] ...OJECT[S_C <= S_C, S_D <= S_D](
[error] ...imal)
[error] ...    )
[error]     )
[error]   )
[error] )

File import through Web interface

A CSV file import feature would be helpful, and could form the basis for some later features for log parsing.

The semantics I'd be looking to see are something along the lines of:

SELECT * INTO new_table FROM uploaded_csv_file

Case sensitivity in lens names

Lens type definitions should not be case sensitive. Right now, these behave differently

create lens x as select * from ratings2 with SCHEMA_MATCHING (PID string, RATING float, REVIEW_COUNT float);
create lens x as select * from ratings2 with schema_matching (PID string, RATING float, REVIEW_COUNT float);

Create database directory

The root dir has a bunch of .db files piling up. These should get organized into a Databases directory.

CREATE LENS fails after SELECT

Running any SELECT query and following it with a CREATE LENS:

SELECT * FROM sane_r;
CREATE LENS insane_r AS SELECT * FROM r WITH missing_value('C')

results in the following exception

java.sql.SQLException: [SQLITE_BUSY]  The database file is locked (database is locked)
    at org.sqlite.core.DB.newSQLException(DB.java:890)
    at org.sqlite.core.DB.newSQLException(DB.java:901)
    at org.sqlite.core.DB.execute(DB.java:807)
    at org.sqlite.jdbc3.JDBC3PreparedStatement.execute(JDBC3PreparedStatement.java:50)
    at mimir.sql.JDBCBackend.update(JDBCBackend.scala:56)
    at mimir.Database.update(Database.scala:96)
    at mimir.lenses.LensManager.save(LensManager.scala:71)
    at mimir.lenses.LensManager.create(LensManager.scala:67)

Add row-level explanation box

The explain box should have a Confidence (probability of the row's presence) and a list of var terms in the __MIMIR_CONDITION column.

Possible bug in SqlToRA and RAToSql conversions

I was playing around with some tables for CSV import + Type Inference when I noticed that with more than a few columns, the order of the columns of the tables are getting messed up. For example -

screenshot from 2015-07-22 20 45 11

Name is getting displayed in Married, Married in Joining and Joining in Name

This is because in line 219 of SqlToRA, the toMap is returning a HashMap, which is not preserving the order of columns. Consequently, in RAToSql, the mappings of the SelectItems are wrong.

screenshot from 2015-07-22 20 50 30

screenshot from 2015-07-22 20 56 04

ret has incorrectly ordered mappings above.

Should we correct this?

Regression: Could not create an instance of SqlLoaderSpec

[error] Could not create an instance of mimir.ctables.SqlLoaderSpec
[error]   caused by java.lang.Exception: Can't find a constructor for class mimir.ctables.SqlLoaderSpec
[error]   org.specs2.reflect.Classes$class.tryToCreateObjectEither(Classes.scala:96)
[error]   org.specs2.reflect.Classes$.tryToCreateObjectEither(Classes.scala:207)
[error]   org.specs2.specification.SpecificationStructure$$anonfun$createSpecificationEither$2.apply(BaseSpecification.scala:119)
[error]   org.specs2.specification.SpecificationStructure$$anonfun$createSpecificationEither$2.apply(BaseSpecification.scala:119)
[error]   scala.Option.getOrElse(Option.scala:120)
[error]   org.specs2.specification.SpecificationStructure$.createSpecificationEither(BaseSpecification.scala:119)
[error]   org.specs2.runner.SbtRunner.org$specs2$runner$SbtRunner$$specificationRun(SbtRunner.scala:73)
[error]   org.specs2.runner.SbtRunner$$anonfun$newTask$1$$anon$5.execute(SbtRunner.scala:59)
[error]   sbt.ForkMain$Run$2.call(ForkMain.java:294)
[error]   sbt.ForkMain$Run$2.call(ForkMain.java:284)
[error]   java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error]   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[error]   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[error]   java.lang.Thread.run(Thread.java:744)

Fuzzing Lens

A simplified form of the archival lens that simply adds a user-specified gaussian to any or all of its input columns.

Database.getVGTerms should retrieve row-specific VG Terms

Consider the following expression:

CASE WHEN X IS NULL THEN {{foo}} ELSE X END

ResultIterator.isDeterministic(...) returns false for this expression only when X is in fact null. getVGTerms should follow suit. In fact, this may be better implemented as a method on resultIterator rather than on Database.

The simple way to implement this would to use Eval.inline() to assign all of the Column() values and then emit the VGTerms remaining in the reduced expression.

Query flow diagram

A GITFlow-style diagram of the query currently being displayed in the web view.

Trouble uploading CSV files

[error] - play.core.server.netty.PlayDefaultUpstreamHandler - Cannot invoke the action
java.sql.SQLException: near ".": syntax error
    at org.sqlite.core.NativeDB.throwex(NativeDB.java:397) ~[sqlite-jdbc-3.8.7.jar:na]
    at org.sqlite.core.NativeDB._exec(Native Method) ~[sqlite-jdbc-3.8.7.jar:na]
    at org.sqlite.jdbc3.JDBC3Statement.executeUpdate(JDBC3Statement.java:116) ~[sqlite-jdbc-3.8.7.jar:na]
    at mimir.sql.JDBCBackend.update(JDBCBackend.scala:48) ~[classes/:na]
    at mimir.Database.update(Database.scala:94) ~[classes/:na]
    at mimir.Database.handleLoadTable(Database.scala:291) ~[classes/:na]
    at mimir.WebAPI.configure(WebAPI.scala:50) ~[classes/:na]
    at controllers.Application$$anonfun$loadTable$1$$anonfun$apply$1.apply(Application.scala:123) ~[classes/:na]
    at controllers.Application$$anonfun$loadTable$1$$anonfun$apply$1.apply(Application.scala:117) ~[classes/:na]

This issue occurs when uploading the file https://github.com/UBOdin/mimir/blob/master/test/data/CPUSpeed.csv

screen shot 2015-07-27 at 6 04 02 pm

Bug in join over union

select * from (SELECT * FROM Matched UNION SELECT * FROM typedratings1) ratings, product where product.pid = ratings.pid

Regression: The percolator should handle full-nondeterministic join conflicts

[info] x handle full-nondeterministic join conflicts
[error]    'PROJECT[A1 <= R_A, B1 <= R_B, N <= {{ test_0[] }}, A2 <= R_A, B2 <= R_B, M <= {{ test_1[] }}](
[error]      SELECT[ (R_A=R_A) ](
[error]        JOIN(
[error]          PROJECT[__LHS_ROWID <= ROWID](
[error]            R(ROWID:int)
[error]          ),
[error]          PROJECT[__RHS_ROWID <= ROWID](
[error]            R(ROWID:int)
[error]          )
[error]        )
[error]      )
[error]    )'
[error]     is not equal to 
[error]    'PROJECT[A1 <= __LHS_R_A, B1 <= __LHS_R_B, N <= {{ test_0[] }}, A2 <= __RHS_R_A, B2 <= __RHS_R_B, M <= {{ test_1[] }}](
[error]      SELECT[ (__LHS_R_A=__RHS_R_A) ](
[error]        JOIN(
[error]          PROJECT[__LHS_R_A <= R_A, __LHS_R_B <= R_B, __LHS_R_C <= R_C, __LHS_ROWID <= ROWID](
[error]            R(ROWID:int)
[error]          ),
[error]          PROJECT[__RHS_R_A <= R_A, __RHS_R_B <= R_B, __RHS_R_C <= R_C, __RHS_ROWID <= ROWID](
[error]            R(ROWID:int)
[error]          )
[error]        )
[error]      )
[error]    )' (CompilerSpec.scala:211)
[error] Expected: ...OJECTA1...= [__LHS_]R_...= [__LHS_]R_..._0[] }...= [__]R[HS]_[R_]A,...= [__]R[HS]_[R_]B,..._1[] }}
[error] ...ELECT ([__LHS_]R_A=[__]R[HS]_[R_]A) 
[error] ...JOIN(
[error] ...OJECT__..._R[_A <= R_A, __LHS_R_B <= R_B, __LHS_R_C <= R_C, __LHS_R]OWID ...
[error] ...:int)
[error] ...   ),
[error] ...OJECT__..._R[_A <= R_A, __RHS_R_B <= R_B, __RHS_R_C <= R_C, __RHS_R]OWID ...
[error] ...:int)
[error] ...    )
[error]     )
[error]   )
[error] )
[error] Actual:   ...OJECTA1...= []R_...= []R_..._0[] }...= []R[]_[]A,...= []R[]_[]B,..._1[] }}
[error] ...ELECT ([]R_A=[]R[]_[]A) 
[error] ...JOIN(
[error] ...OJECT__..._R[]OWID ...
[error] ...:int)
[error] ...   ),
[error] ...OJECT__..._R[]OWID ...
[error] ...:int)
[error] ...    )
[error]     )
[error]   )
[error] )
[info] 

Lens builder menu

Add a menu to simplify building lenses

  • build lens 'AS' the current query
  • build lens 'WITH' based on some user-inputs to a dialogue box

Sampling

Sample(Expr) that produces a sample from one possible world of evaluating the expression.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.