mestway / scythe Goto Github PK

View Code? Open in Web Editor NEW

45.0 45.0 7.0 18.8 MB

Synthesizing SQL queries from input / output examples

Java 82.74% Shell 0.05% Python 14.54% Roff 2.67%

scythe's People

Contributors

Stargazers

Watchers

Forkers

ezig gaybro8777 qinbill ironchefnate mipl-group whatsmyname gowthamk

scythe's Issues

allow running queries on uploaded csv

Right now that option is grayed out.

Machine learning: build a dataset using StackOverflow questions, and use StackOverflow answers as source of hint for `constraint`

This is a follow-up idea of #3, an even more crazy idea.

Some StackOverflow posts contain input / output examples for SQL queries, for example:
https://stackoverflow.com/questions/4672523/sql-server-2000-how-do-i-rotate-the-results-of-a-join-in-the-final-results-of

Table A

ID | Name
---+-----
 1 | John
 2 | Jane
 3 | Bob
 4 | Doug

Table B

ID | NameID | Information
---+--------+------------
 1 |    1   | Apples
 2 |    1   | Apples
 3 |    2   | Pears
 4 |    2   | Grapes
 5 |    3   | Kiwi

Desired Result

ID | Name | InformationA | InformationB
---+------+--------------+-------------
 1 | John | Apples       | Apples
 2 | Jane | Pears        | Grapes
 3 | Bob  | Kiwi         | NULL
 4 | Doug | NULL         | NULL

Since StackOverflow provides open data for data mining / machine learning researchers, it wouldn't be too hard to extract all embedded tables with [sql] tag from StackOverflow questions. The hard part is to decide which table is input, which table is output, and which input is related to which output.

If we are able to create such a dataset, then we can try to train a joined model: the machine learning part extract features from high ranked answers in StackOverflow, and the Scythe part uses the extracted features as the source of heuristic to supervise the synthesizing process.

The dream is, the joined model takes only input, output and a problem description in natural language, queries StackOverflow using the natural language sentences, and try to generate an answer base on the result. Even nicer if it could be integrated to StackOverflow web UI, while this requires collaboration with StackOverflow team.

The realistic is, even the data mining step is very difficult and expensive, a lot of manual data cleaning and labeling is involved. Not sure if it is possible to bootstrap from a small dataset.

(Note, a latex dataset was created from math.stackexchange.com posts for training Image-to-Markup, but that dataset is much simpler than the SQL case)

Evaluate on larger testing set

Hey, this is an awesome project, great job!

Have you ever considered evaluate Scythe on a larger dataset?

One idea is to clone the git repository of SQLite, which contains 1000+ test files with thousands of SQL queries: https://github.com/mackyle/sqlite/tree/master/test

If we run the test suite, we can generate thousands of input/output pairs and have thousands of expected SQL statements as ground truth.

A typical test is like below:

do_test join-1.3.1 {
  execsql2 {
    SELECT * FROM t2 NATURAL JOIN t1;
  }
} {b 2 c 3 d 4 a 1 b 3 c 4 d 5 a 2}

Where t1 and t2 was created in previous lines of the same source code file.

Once we run Scythe on those thousands of input/output, we can even compare the Scythe result with the ground truth using Cosette, another great project from your team :)

One potential difficulty with the SQLite test suite is there is no constraint information, I have no idea what to do with that, as a toy task it might be interesting to use keywords from the ground truth as constraint to see how Scythe works with "leaked information". You are expert in this area so you must have better ideas :)

Good luck!

Jar file is created without no main manifest attribute.

Thanks a ton for this awesome Scythe :-)
I see that jar files are created with no main manifest attribute. Request you to rectify the same.

However I was able to run the same using
"java -cp SimpleSynthesizer.jar main.Main path/to/the/example/file SymbolicEnumerator 2"

Thanks
-Swamy

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.