Giter VIP home page Giter VIP logo

jkkummerfeld / text2sql-data Goto Github PK

View Code? Open in Web Editor NEW
508.0 18.0 104.0 30.99 MB

A collection of datasets that pair questions with SQL queries.

Home Page: http://jkk.name/text2sql-data/

License: Other

Python 98.56% CSS 0.11% JavaScript 0.51% Perl 0.60% Shell 0.18% HTML 0.05%
dataset sql nlp natural-language-processing evaluation neural-network dynet database natural-language-interface

text2sql-data's Introduction

text2sql-data

This repository contains data and code for building and evaluating systems that map sentences to SQL, developed as part of:

For a range of domains, we provide:

  • Sentences with annotated variables
  • SQL queries
  • A database schema
  • A database

These are improved forms of prior datasets and a new dataset we developed. We have separate files describing the datasets, systems, and tools.

Version Description
4 Data fixes
3 Data fixes and addition of data from Spider and WikiSQL
2 Data with fixes for variables incorrectly defined in questions
1 Data used in the ACL 2018 paper

Citing this work

If you use this data in your work, please cite our ACL paper and the appropriate original sources, and list the version number of the data. For example, in your paper you could write (using the BibTeX below):

In this work, we use version 4 of the modified SQL datasets from \citet{data-advising}, based on \citet{data-academic,data-atis-original,data-geography-original,data-atis-geography-scholar,data-imdb-yelp,data-restaurants-logic,data-restaurants-original,data-restaurants,data-spider,data-wikisql}

If you are only using one dataset, here are example citation commands:

Data Cite
Academic \citet{data-advising,data-academic}
Advising \citet{data-advising}
ATIS \citet{data-advising,data-atis-original,data-atis-geography-scholar}
Geography \citet{data-advising,data-geography-original,data-atis-geography-scholar}
Restaurants \citet{data-advising,data-restaurants-logic,data-restaurants-original,data-restaurants}
Scholar \citet{data-advising,data-atis-geography-scholar}
Spider \citet{data-advising,data-spider}
IMDB \citet{data-advising,data-imdb-yelp}
Yelp \citet{data-advising,data-imdb-yelp}
WikiSQL \citet{data-advising,data-wikisql}
@InProceedings{data-sql-advising,
  dataset   = {Advising},
  author    = {Catherine Finegan-Dollak, Jonathan K. Kummerfeld, Li Zhang, Karthik Ramanathan, Sesh Sadasivam, Rui Zhang, and Dragomir Radev},
  title     = {Improving Text-to-SQL Evaluation Methodology},
  booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month     = {July},
  year      = {2018},
  location  = {Melbourne, Victoria, Australia},
  pages     = {351--360},
  url       = {http://aclweb.org/anthology/P18-1033},
}

@InProceedings{data-sql-imdb-yelp,
  dataset   = {IMDB and Yelp},
  author    = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
  title     = {SQLizer: Query Synthesis from Natural Language},
  booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
  month     = {October},
  year      = {2017},
  pages     = {63:1--63:26},
  url       = {http://doi.org/10.1145/3133887},
}

@article{data-academic,
  dataset   = {Academic},
  author    = {Fei Li and H. V. Jagadish},
  title     = {Constructing an Interactive Natural Language Interface for Relational Databases},
  journal   = {Proceedings of the VLDB Endowment},
  volume    = {8},
  number    = {1},
  month     = {September},
  year      = {2014},
  pages     = {73--84},
  url       = {http://dx.doi.org/10.14778/2735461.2735468},
} 

@InProceedings{data-atis-geography-scholar,
  dataset   = {Scholar, and Updated ATIS and Geography},
  author    = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
  title     = {Learning a Neural Semantic Parser from User Feedback},
  booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year      = {2017},
  pages     = {963--973},
  location  = {Vancouver, Canada},
  url       = {http://www.aclweb.org/anthology/P17-1089},
}

@article{data-atis-original,
  dataset   = {ATIS, original},
  author    = {Deborah A. Dahl, Madeleine Bates, Michael Brown, William Fisher, Kate Hunicke-Smith, David Pallett, Christine Pao, Alexander Rudnicky, and Elizabeth Shriber},
  title     = {{Expanding the scope of the ATIS task: The ATIS-3 corpus}},
  journal   = {Proceedings of the workshop on Human Language Technology},
  year      = {1994},
  pages     = {43--48},
  url       = {http://dl.acm.org/citation.cfm?id=1075823},
}

@inproceedings{data-geography-original
  dataset   = {Geography, original},
  author    = {John M. Zelle and Raymond J. Mooney},
  title     = {Learning to Parse Database Queries Using Inductive Logic Programming},
  booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
  year      = {1996},
  pages     = {1050--1055},
  location  = {Portland, Oregon},
  url       = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}

@inproceedings{data-restaurants-logic,
  author    = {Lappoon R. Tang and Raymond J. Mooney},
  title     = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
  booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
  year      = {2000},
  pages     = {133--141},
  location  = {Hong Kong, China},
  url       = {http://www.aclweb.org/anthology/W00-1317},
}

@inproceedings{data-restaurants-original,
 author    = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
 title     = {Towards a Theory of Natural Language Interfaces to Databases},
 booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
 year      = {2003},
 location  = {Miami, Florida, USA},
 pages     = {149--157},
 url       = {http://doi.acm.org/10.1145/604045.604070},
}

@inproceedings{data-restaurants,
  author    = {Alessandra Giordani and Alessandro Moschitti},
  title     = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
  booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
  year      = {2012},
  location  = {Montpellier, France},
  pages     = {59--76},
  url       = {https://doi.org/10.1007/978-3-642-45260-4_5},
}

@InProceedings{data-spider,
  author    = {Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev},
  title     = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  year      = {2018},
  location  = {Brussels, Belgium},
  pages     = {3911--3921},
  url       = {http://aclweb.org/anthology/D18-1425},
}

@article{data-wikisql,
  author    = {Victor Zhong, Caiming Xiong, and Richard Socher},
  title     = {Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning},
  year      = {2017},
  journal   = {CoRR},
  volume    = {abs/1709.00103},
}

Contributions

We put substantial effort into fixing bugs in the datasets, but none of them are perfect. If you find a bug, please submit a pull request with a fix. We will be merging fixes into a development branch and only infrequently merging all of those changes into the master branch (at which point this page will be adjusted to note that it is a new release). This approach is intended to balance the need for clear comparisons between systems, while also improving the data.

For some ideas of issues to address, see our list of known issues.

Acknowledgments

This material is based in part upon work supported by IBM under contract 4915012629. Any opinions, findings, conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views of IBM.

text2sql-data's People

Contributors

cfdollak avatar jkkummerfeld avatar prasad83 avatar todpole3 avatar whwang299 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

text2sql-data's Issues

SPARQL for Question Answering

The approach of making dual predictions on the choice of template and the slots to be filled within that template offered valuable insights into handling structured query language translations.

I have been attempting to extend its application to SPARQL datasets within my own project focused on converting text to SPARQL queries for knowledge representation. However, I encountered difficulties in adapting the methodology for SPARQL.

Are there any adaptations or modifications you would recommend for effectively generating SPARQL templates from training sets,?
this is the dataset I'm working on
SPARQL.json

Downloading the database

Hi @jkkummerfeld,

Thanks for open-sourcing the code and datasets.

The google drive links for Academic (MAS), IMDB, Yelp, and Scholar databases are not publicly accessible.
Could you please share them with my email id ([email protected]) ?

Thanks :)

Hyperparameters for small datsets

Hi Jonathan - just wondering how the hyperparameters for the slot filling baseline were chosen? The code doesn't do any early stopping based on a development split - were these based off some intuition about the dataset's relative sizes compared to the datasets which do have train/dev/test splits?

Thanks!

Possible Bug in the Template-Baseline Evaluation?

Hi!

Thanks for providing the datasets and the baseline.
I think I have encountered a bug in the evaluation of the template baseline, which might be inflating the figures for this baseline:

The run_eval() function checks if the tags and the template match exactly. If so, it marks the prediction as correct. However, there are some examples, where not all the fields in the template can be directly copied over as 'tags' from the question. Here are some examples:

['what', 'are', 'the', 'nonstop', 'flights', 'between', 'SAN', 'JOSE', 'and', 'HOUSTON', 'TEXAS']
['O', 'O', 'O', 'O', 'O', 'O', 'city_name0', 'city_name0', 'O', 'city_name1', 'state_name0']
SELECT DISTINCT FLIGHTalias0.FLIGHT_ID FROM FLIGHT AS FLIGHTalias0 WHERE ( ( FLIGHTalias0.FROM_AIRPORT IN ( SELECT AIRPORT_SERVICEalias0.AIRPORT_CODE FROM AIRPORT_SERVICE AS AIRPORT_SERVICEalias0 WHERE AIRPORT_SERVICEalias0.CITY_CODE IN ( SELECT CITYalias0.CITY_CODE FROM CITY AS CITYalias0 WHERE CITYalias0.CITY_NAME = " city_name0 " ) ) 
AND FLIGHTalias0.TO_AIRPORT IN ( SELECT AIRPORT_SERVICEalias1.AIRPORT_CODE FROM AIRPORT_SERVICE AS AIRPORT_SERVICEalias1 WHERE AIRPORT_SERVICEalias1.CITY_CODE IN ( SELECT CITYalias1.CITY_CODE FROM CITY AS CITYalias1 
WHERE ( CITYalias1.CITY_NAME = " city_name1 " AND CITYalias1.STATE_CODE IN ( SELECT STATEalias0.STATE_CODE FROM STATE AS STATEalias0 WHERE STATEalias0.STATE_NAME = " state_name0 " ) ) ) ) ) AND FLIGHTalias0.STOPS = stops0 ) ;

(In this example, stops0 parameter in the SQL query is never filled)

`

['los', 'angeles', 'to', 'pittsburgh', 'afternoon', 'tuesday']
['city_name1', 'city_name1', 'O', 'city_name0', 'O', 'day_name0']
SELECT DISTINCT FLIGHTalias0.FLIGHT_ID FROM AIRPORT AS AIRPORTalias0 , AIRPORT AS AIRPORTalias1 , AIRPORT_SERVICE AS AIRPORT_SERVICEalias0 , AIRPORT_SERVICE AS AIRPORT_SERVICEalias1 , CITY AS CITYalias0 , CITY AS CITYalias1 , DAYS AS DAYSalias0 , FLIGHT AS FLIGHTalias0 
WHERE AIRPORTalias0.AIRPORT_CODE = AIRPORT_SERVICEalias0.AIRPORT_CODE AND AIRPORTalias1.AIRPORT_CODE = AIRPORT_SERVICEalias1.AIRPORT_CODE AND CITYalias0.CITY_CODE = AIRPORT_SERVICEalias0.CITY_CODE AND CITYalias0.CITY_NAME = " city_name0 " AND CITYalias1.CITY_CODE = AIRPORT_SERVICEalias1.CITY_CODE AND CITYalias1.CITY_NAME = " city_name1 " AND DAYSalias0.DAY_NAME = " day_name0 " 
AND FLIGHTalias0.DEPARTURE_TIME BETWEEN departure_time0 AND departure_time1 AND FLIGHTalias0.FLIGHT_DAYS = DAYSalias0.DAYS_CODE AND FLIGHTalias0.FROM_AIRPORT = AIRPORTalias1.AIRPORT_CODE AND FLIGHTalias0.TO_AIRPORT = AIRPORTalias0.AIRPORT_CODE ; 

(In this example, departure_time0, departure_time1 parameters in the template are never filled)

Is this a bug, or am I missing something?

`

installing Dynet

i have a problem regarding installing Dynet as i'm following the documentation it gives Repository not found when trying to insatll Eigen
is there another approach to install dynet directly

Beam Search for Attention Copying seq2seq

When setting the model_param inference.beam_search.beam_width to a value greater than 1, I get the following error during inference:

ValueError: Tried to convert 'input' to a tensor and failed. Error: None values not supported.

Any help would be appreciated. Thanks! :)

Full trace:

Traceback (most recent call last):
  File "bin/infer.py", line 129, in <module>
    tf.app.run()
  File "/home/xxxxxx/miniconda3/envs/tensorflow1.3cpu/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "bin/infer.py", line 106, in main
    batch_size=FLAGS.batch_size)
  File "/home/xxxxxx/text2sql/text2sql-data/systems/sequence-to-sequence/seq2seq/inference/inference.py", line 57, in create_inference_graph
    return model(features=features, labels=labels, params=None)
  File "/home/xxxxxx/text2sql/text2sql-data/systems/sequence-to-sequence/seq2seq/models/model_base.py", line 146, in __call__
    return self._build(features, labels, params)
  File "/home/xxxxxx/text2sql/text2sql-data/systems/sequence-to-sequence/seq2seq/models/seq2seq_model.py", line 298, in _build
    decoder_output, _, = self.decode(encoder_output, features, labels)
  File "/home/xxxxxx/text2sql/text2sql-data/systems/sequence-to-sequence/seq2seq/graph_utils.py", line 38, in func_wrapper
    return templated_func(*args, **kwargs)
  File "/home/xxxxxx/miniconda3/envs/tensorflow1.3cpu/lib/python2.7/site-packages/tensorflow/python/ops/template.py", line 278, in __call__
    result = self._call_func(args, kwargs, check_for_new_variables=False)
  File "/home/xxxxxx/miniconda3/envs/tensorflow1.3cpu/lib/python2.7/site-packages/tensorflow/python/ops/template.py", line 217, in _call_func
    result = self._func(*args, **kwargs)
  File "/home/xxxxxx/text2sql/text2sql-data/systems/sequence-to-sequence/seq2seq/models/basic_seq2seq.py", line 113, in decode
    decoder = self._create_decoder(encoder_output, features, labels)
  File "/home/xxxxxx/text2sql/text2sql-data/systems/sequence-to-sequence/seq2seq/models/attention_copying_seq2seq.py", line 64, in _create_decoder
    multiples=[self.params["inference.beam_search.beam_width"]])
  File "/home/xxxxxx/miniconda3/envs/tensorflow1.3cpu/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3847, in tile
    name=name)
  File "/home/xxxxxx/miniconda3/envs/tensorflow1.3cpu/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 508, in apply_op
    (input_name, err))
ValueError: Tried to convert 'input' to a tensor and failed. Error: None values not supported.

originally defined at:
  File "/home/xxxxxx/text2sql/text2sql-data/systems/sequence-to-sequence/seq2seq/graph_utils.py", line 37, in func_wrapper
    templated_func = tf.make_template(name_, func)

Test leakage in datasets which use cross-validation and `query-split=True`

I believe there is a test set leakage in the case that the dataset requires the use of cross validation ( Academic, IMDB, Restaurant and Yelp datasets) and the args.query_split flag is set to True.

This bug affects the baseline script due to this function copied verbatim from https://github.com/jkkummerfeld/text2sql-data/blob/master/systems/baseline-template/text2sql-template-baseline.py#L114 with comments added to demonstrate the problem:

def get_tagged_data_for_query(data):

	# In the case that we are using a dataset which uses cross validation,
	# `dataset` will be a numeric value.
    dataset = data['query-split']
    for sent_info in data['sentences']:
        if not args.query_split:
            dataset = sent_info['question-split']
		
		# args.split specifies what split we should use for cross validation.
        if args.split is not None:
			# Here is the problematic section - on the first iteration,
			# this check does the correct thing, because both "args.split" and
			# `dataset` are numeric values. However, on following iterations,
			# dataset becomes equal to either `train` or `test` (because it is set
			# to either of these values below in the first loop), meaning for
			# all but the first sentence in the query split, it will be incorrect
			# (i.e the split the datapoint gets assigned to will be "train").
            if str(args.split) == str(dataset):
                dataset = "test"
            else:
                dataset = "train"

        for sql in data['sql']:
            sql_vars = {}
            for sql_var in data['variables']:
                sql_vars[sql_var['name']] = sql_var['example']
            text = sent_info['text']
            text_vars = sent_info['variables']

            yield (dataset, insert_variables(sql, sql_vars, text, text_vars))

            if not args.use_all_sql:
                break

Note that this bug doesn't affect the results for the question split, because it is guarded against here, meaning the dataset variable is reset correctly each time.

As an example, this following JSON blob (with --query-split=True and --split=1) will get split across the test(first sentence) and training set(second two sentences), when it should only be in the test set. This is a particularly tricky problem, because some of the sentences which have the same SQL query are actually identical, meaning that the test split essentially contains duplicates from the training split.

    {
        "comments": [],
        "old-name": "",
        "query-split": "1",
        "sentences": [
            {
                "question-split": "0",
                "text": "how many name0 are there in city_name0 ?",
                "variables": {
                    "city_name0": "san francisco",
                    "name0": "buttercup kitchen"
                }
            },
            {
                "question-split": "6",
                "text": "how many name0 are there in city_name0 ?",
                "variables": {
                    "city_name0": "san francisco",
                    "name0": "buttercup kitchen"
                }
            },
            {
                "question-split": "6",
                "text": "how many name0 are there in city_name0 ?",
                "variables": {
                    "city_name0": "san francisco",
                    "name0": "buttercup kitchen"
                }
            }
        ],
        "sql": [
            "SELECT COUNT( * ) FROM LOCATION AS LOCATIONalias0 , RESTAURANT AS RESTAURANTalias0 WHERE LOCATIONalias0.CITY_NAME = \"city_name0\" AND RESTAURANTalias0.ID = LOCATIONalias0.RESTAURANT_ID AND RESTAURANTalias0.NAME = \"name0\" ;"
        ],
        "variables": [
            {
                "example": "san francisco",
                "location": "unk",
                "name": "city_name0",
                "type": "city_name"
            },
            {
                "example": "buttercup kitchen",
                "location": "unk",
                "name": "name0",
                "type": "name"
            }
        ]
    },

Fortunately, I think this bug actually reinforces the point of the paper - the validation scores are only going to fall for the datasets for which this happens for, which are as you note in section 5.4 of your paper:

"For the three datasets developed by the databases community, the effect of question-querysplit is far less pronounced."

Possibly, this bug is an underlying cause of this interpretation, given that these 3 datasets are all ones which use cross-validation.

This is a great resource btw - currently we are working on adding it to allennlp to do some type constrained decoding based off of a SQL grammar. If you are interested, we'd be happy to accept PRs to get your baseline into allennlp, where it would be tested and reviewed etc.

data in spider format

Hi,
Can you please share any tips to construct the imdb, scholar, and yelp datasets in spider format (including tables.json train.json dev.json)?
Thank you

Anonymised Variables should have consistent naming corresponding to their column

It's a little annoying that the anonymised variable names sometimes but not always correspond to the table/column name they come from. E.g in some datasets like academic, the variable name is derived from the column name:

        "sql": [
            "SELECT JOURNALalias0.HOMEPAGE FROM JOURNAL AS JOURNALalias0 WHERE JOURNALalias0.NAME = \"journal_name0\" ;"
        ],
        "variables": [
            {
                "example": "PVLDB",
                "location": "both",
                "name": "journal_name0",
                "type": "journal_name"
            }
        ]

whereas in geography, variables are named var1, from which you cannot directly infer their type from either the name or the type key.

        "sql": [
            "SELECT CITYalias0.CITY_NAME FROM CITY AS CITYalias0 WHERE CITYalias0.POPULATION = ( SELECT MAX( CITYalias1.POPULATION ) FROM CITY AS CITYalias1 WHERE CITYalias1.STATE_NAME = \"var0\" ) AND CITYalias0.STATE_NAME = \"var0\" ;"
        ],
        "variables": [
            {
                "example": "arizona",
                "location": "both",
                "name": "var0",
                "type": "state"
            }
        ]

Spacing error(?) in geography.json

Hi @jkkummerfeld

I think "(SELECT" in the code below should change into "( SELECT" for the consistency.

"SELECT DISTINCT RIVERalias0.RIVER_NAME FROM RIVER AS RIVERalias0 WHERE RIVERalias0.LENGTH = (SELECT MAX( RIVERalias1.LENGTH ) FROM RIVER AS RIVERalias1 WHERE RIVERalias1.RIVER_NAME NOT IN ( SELECT RIVERalias2.RIVER_NAME FROM RIVER AS RIVERalias2 WHERE RIVERalias2.TRAVERSE = \"state_name0\" ) ) AND RIVERalias0.RIVER_NAME NOT IN ( SELECT RIVERalias3.RIVER_NAME FROM RIVER AS RIVERalias3 WHERE RIVERalias3.TRAVERSE = \"state_name0\" ) ;"

Thanks for the great work.

Wonseok

How should "sql-only" variables in the Advising dataset be handled?

In the advising dataset, there are a few examples which raise some questions.

  1. Some of the questions paired with a single SQL query do not have equivalent semantics (no problem if this is just part and parcel of building a dataset, I can imagine there is some noise.)
    An example of this are the first two questions below, where you would expect one to return a list of classes which don't have lab sessions, whereas the other you would expect to return a boolean value. Is this a broad distinction that is not drawn in the standardisation of the datasets, or is this just an error?

  2. This query has a dangling "AND" statement, which I don't think is valid SQL?

  3. What is the significance of "sql-only" variables? Are these variables for which there is only one possible value, or that there is a default value?

  4. There exist some variables in the query which appear to be default values, but which differ in value and are not extracted as sql variables (perhaps this is just an issue with the automated extraction and there's nothing you can do about it). For instance, if you search for "2016" and "2017" in the dataset, they are used interchangeably for queries containing "next year/next semester". Do you have a recommendation for how to treat these?

{
   "query-split":"dev",
   "sentences":[
      {
         "question-split":"train",
         "text":"What classes do n't have lab sessions ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"Are there any classes that do n't have lab sessions ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"As far as labs go , are there any classes without them ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"test",
         "text":"Do any classes not have lab sessions ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"For classes , which ones do n't have labs ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"If I do n't want to have a lab session , which classes should I take ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"test",
         "text":"Is there a list of classes that do n't have lab sessions ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"List classes without lab sessions .",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"test",
         "text":"Of these classes , which do n't have lab sessions ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"What classes do not have sessions in the lab ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"test",
         "text":"What classes have no lab sessions ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"What courses are not in the lab ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"Which classes do n't have any labs ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"Which classes do not have lab components ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"Which classes do not require a lab session ?",
         "variables":{
            "department0":""
         }
      },
      {
         "question-split":"train",
         "text":"Which of the classes do n't have lab sessions ?",
         "variables":{
            "department0":""
         }
      }
   ],
   "sql":[
      "SELECT DISTINCT COURSEalias0.NAME , COURSEalias0.NUMBER FROM COURSE AS COURSEalias0 WHERE COURSEalias0.DEPARTMENT = \"department0\" AND COURSEalias0.HAS_LAB = \"N\" AND ;"
   ],
   "variables":[
      {
         "example":"EECS",
         "location":"sql-only",
         "name":"department0",
         "type":"department"
      }
   ]
}

Allen NLP Text2SQL

Hi,
Can i use the data of ATIS data set from here to train AllenNLP Text2SQL model?

Thank you!

Any canonicalization for quotes?

It looks like the "canonicaliser.py" file does not contain a procedure for normalizing quotes around values in SQL such as "EMNLP". For example, some quotes are removable and sometimes single quotes and double quotes are exchangeable (I'm not entirely sure that this is true).

Hence I wonder if the dataset has any canonicalization regarding this?

American vs UK spelling - Yelp Dataset

The actual yelp database uses American spelling of NEIGHBORHOOD, but the schema and data use NEIGHBOURHOOD.

These should probably be canonicalised (or is that canonicalized?!).

Run sequence-to-sequence baseline models

I was trying to run the sequence-to-sequence baseline models and noticed that the model implementations text2sql-data/systems/sequence-to-sequence/seq2seq/models is missing from the repository hence the experiment script ./experiment.sh broke because it is looking for the class InputAttentionCopyingSeq2Seq.

Would you please complete the repo with these files? Thanks.

Wrong output of Geography using json_to_flat.py

Hi, I use json_to_flat.py to produce data for query split of Geography ,but it seems some data are missing in its SQL query. In the last line of the test data, there is only human query and the SQL programm is missing.

imdb has a single example which is not valid SQL

https://github.com/jkkummerfeld/text2sql-data/blob/master/data/imdb.json#L2442
I think:

SELECT
  MAX(*) # I'm not sure if this is valid?
FROM
  (
    SELECT
      COUNT(DISTINCT (MOVIEalias0.TITLE))
    FROM
      ACTOR AS ACTORalias0,
      CAST AS CASTalias0,
      MOVIE AS MOVIEalias0
    WHERE
      ACTORalias0.NAME = "actor_name0" # Note no AND here
	  CASTalias0.AID = ACTORalias0.AID
      AND MOVIEalias0.MID = CASTalias0.MSID 
      GROUP BY MOVIEalias0.RELEASE_YEAR ) AS DERIVED_TABLEalias0 ;

Should be:

SELECT
 # I'm not sure about this part, maybe the original is just correct
 # if you don't specify an alias for the COUNT column
  MAX(DERIVED_TABLEalias0.COUNT_alias0)
FROM
  (
    SELECT
      COUNT(DISTINCT (MOVIEalias0.TITLE)) AS COUNT_alias0
    FROM
      ACTOR AS ACTORalias0,
      CAST AS CASTalias0,
      MOVIE AS MOVIEalias0
    WHERE
      ACTORalias0.NAME = "actor_name0" `AND`
      CASTalias0.AID = ACTORalias0.AID AND
      MOVIEalias0.MID = CASTalias0.MSID
    GROUP BY MOVIEalias0.RELEASE_YEAR ) 
AS DERIVED_TABLEalias0 ;

Correct way of handling the data split

I would like to make sure I'm using correct dataset handling to obtain the results in Table 3 of the paper. Would you please help clarify the following questions?

Based on my understanding of the paper, Table 3 reports test set results over all datasets.

For Advising, Geo, Scholar and ATIS the datasets were split into train/dev/test. When testing the model on test set, shall we train the models using only the train set or the train + dev set combined?

For Academic, IMDB, Restaurants, Yelp there are only cross validation split in the data release. Hence are the numbers over these dataset in Table 3 cross-validation numbers? If not, how are the test set defined for these datasets?

Thanks for your attention.

Naming error (?) in restaurants.json

Hi @jkkummerfeld

There seem to be small naming inconsistency between restaurants.json and other files.

In restaurants.json, RESTAUANT table (when it is represented as RESTAURANTalias0 or RESTAURANTalias01) has ID

"SELECT COUNT( * ) FROM LOCATION AS LOCATIONalias0 , RESTAURANT AS RESTAURANTalias0 WHERE LOCATIONalias0.CITY_NAME = \"city_name0\" AND RESTAURANTalias0.ID = LOCATIONalias0.RESTAURANT_ID AND RESTAURANTalias0.NAME = \"name0\" ;"

On the other hand, restaurants-field.txt has RESTAURANT_ID instead of ID.

RESTAURANT RESTAURANT_ID

Same goes for restaurants-schema.csv

Thanks!

Wonseok

SQL Patterns of WikiSQL

Hi,

I am wondering if you have the SQL patterns identified for each dataset available somewhere. I am particularly interested in the patterns of WikiSQL. Thanks,

--Ahmed

This is more of a question than issue.

Hi, thank you for releasing the dataset.

What does 'query-split' and 'question-split' mean, in the dataset?
For example, in the last entry of restaurant dataset,

where can i find a name0 in city_name0 ?
where is name0 in city_name0 ?
where is a name0 in city_name0 ?

are duplicated 3 times but their question-split are different.
Any pointer would be appreciated. (Can't find it on the paper.)

WikiSQL

Hello,

First, thanks a lot for taking time to standardize all these datasets and publishing the analysis! It's super helpful.

I noticed that in the paper you mention WikiSQL as one of the community datasets, you include it in the redundancy analysis, but then you don't include it in the comparison of results and this repository also doesn't have a standardized version of WikiSQL. Are there any plans to include it in the next version? Thanks.

Error for wikisql.json and spider.json while generating test,train and dev split

Hi,

I am getting below mentioned unicode encode error while running json_to_flat.py file to generate split on both wikisql and spider dataset :

json_to_flat.py", line 50, in convert_instance
print(text, "|||", sql, file=output_file)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\uff1f' in position 123: character maps to

Please help.

Best Regards
Anshu

location.restaurant_id should be foreign key in "Restaurants"?

I am looking at the restaurants database. It contains tables Restaurants and Location. As far as I can understand from eyeballing the db content, Location.restaurant_id is referring to Restaurants.id, i.e. table Location contains addresses of the restaurants in table Restaurants. But in the database creation code Location.restaurant_id is made a primary key, not a foreign key:

https://github.com/jkkummerfeld/text2sql-data/blob/master/tools/create_restaurant_database.py#L32

What would be the right thing to do here?

Document which datasets require the --split flag

https://github.com/jkkummerfeld/text2sql-data/blob/master/systems/baseline-template/text2sql-template-baseline.py#L29

This flag is required to process some of the datasets - would it be possible to document which ones these are, how the splits were created and how the format differs from the other data?

P.S This is an exceptional resource, thanks a lot! I'm sure it will lead to big progress in semantic parsing. I enjoyed the paper a lot, so thanks.

Add hyperparameters used in paper to seq2seq configs

Currently, the sequence-to-sequence/experimental_configs directory contains a sample configuration file to enable users to ensure that they are able to run the code. In the future, I would like to add config files containing exactly the hyperparameters we used in the paper to make replication easier.

Geography dataset does not alias derived tables consistently

The Geography dataset contains examples where a derived table is aliased without using the "AS" keyword. There are 25 examples of this in the dataset which you can find by CTRL-F ") DERIVED" in the json file. Although this is still valid SQL, it would be nice if it was reformatted, because it is inconsistent with the column aliasing throughout the rest of the datasets.

Here is a line which demonstrates the issue: https://github.com/jkkummerfeld/text2sql-data/blob/master/data/geography.json#L3099

Interaction history in ATIS dataset

Many of the utterances in ATIS highly depends on the history of the interaction, and one of the core challenges is figuring how how to reason about this context. For example, the utterance "which ones arrive at 7pm" requires the model to resolve references to previous utterances (What does "ones" refer to?).

{
        "comments": [],
        "old-name": "",
        "query-split": "train",
        "sentences": [
            {
                "text": "list all the flights that arrive at airport_code0 from various cities",
                "question-split": "train",
                "variables": {
                    "airport_code0": "MKE"
                }
            },
            {
                "text": "what flights from any city land at airport_code0",
                "question-split": "train",
                "variables": {
                    "airport_code0": "MKE"
                }
            },
            {
                "text": "show me the flights into airport_code0",
                "question-split": "train",
                "variables": {
                    "airport_code0": "DAL"
                }
            },
            {
                "text": "show me the flights arriving at airport_code0",
                "question-split": "train",
                "variables": {
                    "airport_code0": "DAL"
                }
            },
            {
                "text": "list all the flights that arrive at airport_code0",
                "question-split": "train",
                "variables": {
                    "airport_code0": "MKE"
                }
            },
            {
                "text": "list all the arriving flights at airport_code0",
                "question-split": "train",
                "variables": {
                    "airport_code0": "MKE"
                }
            },
            {
                "text": "what flights land at airport_code0",
                "question-split": "train",
                "variables": {
                    "airport_code0": "MKE"
                }
            },
            {
                "text": "show me the flights to airport_code0",
                "question-split": "train",
                "variables": {
                    "airport_code0": "DAL"
                }
            },
            {
                "text": "list all the landings at airport_code0",
                "question-split": "train",
                "variables": {
                    "airport_code0": "MKE"
                }
            },
            {
                "text": "show me the flights into airport_code0",
                "question-split": "train",
                "variables": {
                    "airport_code0": "DAL"
                }
            },
            {
                "text": "list all the landings at airport_code0",
                "question-split": "train",
                "variables": {
                    "airport_code0": "MKE"
                }
            },
            {
                "text": "list all flights arriving at airport_code0",
                "question-split": "train",
                "variables": {
                    "airport_code0": "MKE"
                }
            },
            {
                "text": "show me the flights arriving at airport_code0",
                "question-split": "train",
                "variables": {
                    "airport_code0": "DAL"
                }
            },
            {
                "text": "list all the flights that fly into airport_code0",
                "question-split": "dev",
                "variables": {
                    "airport_code0": "MKE"
                }
            },
            {
                "text": "list all the flights that arrive at airport_code0 airport",
                "question-split": "dev",
                "variables": {
                    "airport_code0": "MKE"
                }
            },
            {
                "text": "show me all flights arriving at airport_code0 from other airports",
                "question-split": "train",
                "variables": {
                    "airport_code0": "DAL"
                }
            },
            {
                "text": "show me the flights from all airports to airport_code0",
                "question-split": "train",
                "variables": {
                    "airport_code0": "DAL"
                }
            },
            {
                "text": "show me the flights arriving at airport_code0 from all other airports",
                "question-split": "train",
                "variables": {
                    "airport_code0": "DAL"
                }
            },
            {
                "text": "show me flights from all airports to airport_code0",
                "question-split": "train",
                "variables": {
                    "airport_code0": "DAL"
                }
            },
            {
                "text": "show me the flights arriving at airport_code0 from other airports",
                "question-split": "train",
                "variables": {
                    "airport_code0": "DAL"
                }
            },
            {
                "text": "show me the flights to airport_code0 from all other airports",
                "question-split": "dev",
                "variables": {
                    "airport_code0": "DAL"
                }
            }
        ],
        "sql": [
            "SELECT DISTINCT FLIGHTalias0.FLIGHT_ID FROM AIRPORT AS AIRPORTalias0 , AIRPORT_SERVICE AS AIRPORT_SERVICEalias0 , CITY AS CITYalias0 , FLIGHT AS FLIGHTalias0 WHERE AIRPORTalias0.AIRPORT_CODE = \"airport_code0\" AND CITYalias0.CITY_CODE = AIRPORT_SERVICEalias0.CITY_CODE AND FLIGHTalias0.FROM_AIRPORT = AIRPORT_SERVICEalias0.AIRPORT_CODE AND FLIGHTalias0.TO_AIRPORT = AIRPORTalias0.AIRPORT_CODE ;",
            "SELECT DISTINCT FLIGHTalias0.FLIGHT_ID FROM AIRPORT AS AIRPORTalias0 , FLIGHT AS FLIGHTalias0 WHERE AIRPORTalias0.AIRPORT_CODE = \"airport_code0\" AND FLIGHTalias0.TO_AIRPORT = AIRPORTalias0.AIRPORT_CODE ;",
            "SELECT DISTINCT FLIGHTalias0.FLIGHT_ID FROM AIRPORT AS AIRPORTalias0 , AIRPORT AS AIRPORTalias1 , FLIGHT AS FLIGHTalias0 WHERE AIRPORTalias1.AIRPORT_CODE = \"airport_code0\" AND FLIGHTalias0.FROM_AIRPORT = AIRPORTalias0.AIRPORT_CODE AND FLIGHTalias0.TO_AIRPORT = AIRPORTalias1.AIRPORT_CODE ;"
        ],
        "variables": [
            {
                "example": "MKE",
                "location": "unk",
                "name": "airport_code0",
                "type": "airport_code"
            }
        ]
    }

Looking at the dataset, it seems like 1) the utterances do not have these kind of references, eg "ones" does not appear in the dataset and 2) there isn't any information on which interaction the utterances come from since they are grouped by SQL query. Were there any modifications to the original utterances and is there any way to reconstitute the structure of the original interactions from the data? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.