Giter VIP home page Giter VIP logo

scythe's People

Contributors

mestway avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

scythe's Issues

Machine learning: build a dataset using StackOverflow questions, and use StackOverflow answers as source of hint for `constraint`

This is a follow-up idea of #3, an even more crazy idea.

Some StackOverflow posts contain input / output examples for SQL queries, for example:
https://stackoverflow.com/questions/4672523/sql-server-2000-how-do-i-rotate-the-results-of-a-join-in-the-final-results-of

Table A

ID | Name
---+-----
 1 | John
 2 | Jane
 3 | Bob
 4 | Doug

Table B

ID | NameID | Information
---+--------+------------
 1 |    1   | Apples
 2 |    1   | Apples
 3 |    2   | Pears
 4 |    2   | Grapes
 5 |    3   | Kiwi

Desired Result

ID | Name | InformationA | InformationB
---+------+--------------+-------------
 1 | John | Apples       | Apples
 2 | Jane | Pears        | Grapes
 3 | Bob  | Kiwi         | NULL
 4 | Doug | NULL         | NULL

Since StackOverflow provides open data for data mining / machine learning researchers, it wouldn't be too hard to extract all embedded tables with [sql] tag from StackOverflow questions. The hard part is to decide which table is input, which table is output, and which input is related to which output.

If we are able to create such a dataset, then we can try to train a joined model: the machine learning part extract features from high ranked answers in StackOverflow, and the Scythe part uses the extracted features as the source of heuristic to supervise the synthesizing process.

The dream is, the joined model takes only input, output and a problem description in natural language, queries StackOverflow using the natural language sentences, and try to generate an answer base on the result. Even nicer if it could be integrated to StackOverflow web UI, while this requires collaboration with StackOverflow team.

The realistic is, even the data mining step is very difficult and expensive, a lot of manual data cleaning and labeling is involved. Not sure if it is possible to bootstrap from a small dataset.

(Note, a latex dataset was created from math.stackexchange.com posts for training Image-to-Markup, but that dataset is much simpler than the SQL case)

Evaluate on larger testing set

Hey, this is an awesome project, great job!

Have you ever considered evaluate Scythe on a larger dataset?

One idea is to clone the git repository of SQLite, which contains 1000+ test files with thousands of SQL queries: https://github.com/mackyle/sqlite/tree/master/test

If we run the test suite, we can generate thousands of input/output pairs and have thousands of expected SQL statements as ground truth.

A typical test is like below:

do_test join-1.3.1 {
  execsql2 {
    SELECT * FROM t2 NATURAL JOIN t1;
  }
} {b 2 c 3 d 4 a 1 b 3 c 4 d 5 a 2}

Where t1 and t2 was created in previous lines of the same source code file.

Once we run Scythe on those thousands of input/output, we can even compare the Scythe result with the ground truth using Cosette, another great project from your team :)

One potential difficulty with the SQLite test suite is there is no constraint information, I have no idea what to do with that, as a toy task it might be interesting to use keywords from the ground truth as constraint to see how Scythe works with "leaked information". You are expert in this area so you must have better ideas :)

Good luck!

Jar file is created without no main manifest attribute.

Thanks a ton for this awesome Scythe :-)
I see that jar files are created with no main manifest attribute. Request you to rectify the same.

However I was able to run the same using
"java -cp SimpleSynthesizer.jar main.Main path/to/the/example/file SymbolicEnumerator 2"

Thanks
-Swamy

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.