x-lance / text2sql-lgesql Goto Github PK

[ACL 2021] This is the project containing source codes and pre-trained models about ACL2021 Long Paper ``LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations".

Home Page: https://arxiv.org/abs/2106.01093

License: Apache License 2.0

Python 97.93% Shell 2.07%

database heterogeneous-graph-neural-network natural-language-interface semantic-parsing structured-prediction text-to-sql

text2sql-lgesql's People

Contributors

Stargazers

Watchers

text2sql-lgesql's Issues

Can the code be modify to distributed training?

The program crashes when I use the argument "--load_optimizer"

I wanted to continue training the model with the saved optimizer, but it crashed. The traceback is shown as follows:
Traceback (most recent call last):
File "lgesql/text2sql.py", line 105, in
optimizer.step()
File "lib/python3.6/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "lib/python3.6/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "lgesql/utils/optimization.py", line 220, in step
exp_avg.mul_(beta1).add_(grad, alpha=1.0 - beta1)
RuntimeError: The size of tensor a (768) must match the size of tensor b (2) at non-singleton dimension 0

Have you met this problem? And how can I fix it?

Question about the edge features

Hi，I want to confirm that how to Initial the edge features and non-local edge features from the parameter matrix, is similar to RAT-SQL?

Why not use cross-entropy loss?

Why not use the cross-entropy loss function but only maximize the probability of a positive category？

How about your training time per epoch ?

I reproduce the data preprocess ，and then train the model with electra-large-discriminator plm / msde local_and_nonlocal strategy.
I found it takes me around 50 min per epoch on a Tesla V100 32G with the same hyper-parameters on paper
Besides, I do some modify to use the DDP with 4 GPU, but the time only reduce to 40 min per epoch
Is the same for your training time ？
I want to do some experiment with the LGESQL basemodel, but the time counsuming is .....[SAD]

支持cspider数据集训练

我想用lgesql在中文数据集cspider上训练，我要怎么做呢？

Question about the code of schema linking

Hi there! Thanks for sharing the code of your interesting paper!

In line 253:

q_tab_mat = np.array([['question-table-nomatch'] * t_num for _ in range(q_num)], dtype=dtype)

It built 'q_tab_mat' with size (q_num, t_num), t_num is the length of 'table_token' .

But in line 263:

q_tab_mat[range(i, j), idx] = 'question-table-exactmatch'

the 'idx' refers to the index of 'table_name' rather than of 'table_token'.

Could I ask why we should build q_tab_mat with the size of table_token but assign its value with the index of table_name?

download dependencies

When lgesql download dependencies :python -c "from embeddings import GloveEmbedding; emb = GloveEmbedding('common_crawl_48', d_emb=300)"
Traceback (most recent call last):
File "", line 1, in
File "D:\anaconda3\lib\site-packages\embeddings\glove.py", line 43, in init
self.db = self.initialize_db(self.path(path.join('glove', '{}:{}.db'.format(name, d_emb))))
File "D:\anaconda3\lib\site-packages\embeddings\embedding.py", line 22, in path
root = environ.get('EMBEDDINGS_ROOT', path.join(environ['HOME'], '.embeddings'))
File "D:\anaconda3\lib\os.py", line 675, in getitem
raise KeyError(key) from None
KeyError: 'HOME'

About Data Preprocessing

This is an amazing job. Therefore, I want to apply it to other dataset. I see that you use hyper parameter "skip-large". However, though my data just has 3 databases, the column number of each database is large. So can I continue to use this model? When I preprocessed my own dataset using process_graphs.py, it took too much time. Will my large database also lead to slow training?

code available?

Hello, I am interested in your work.
May I ask when your code will be available?
Best regards!

Google CoLab LGESQL

Have the authors or anyone in the community implemented this model in Google CoLab?

If so, a copy of the .ipynb would be much appreciated.

About the generation of "ON" clause

Hi there! Hope you are doing well. I ran your code and found that there isn't any "ON" clause in the generated SQL. I wonder if this is normal and if LGESQL can generate SQL with "ON" clause.

unable to preprocess data

I ran the command for preprocessing the train data in your run_preprocessing.sh:
python3 -u preprocess/process_dataset.py --dataset_path 'data/train.json' --raw_table_path 'data/tables.json' --table_path 'data/tables.bin' --output_path 'data/train.bin' --skip_large

but got error FileNotFoundError: [Errno 2] No such file or directory: 'data/tables.bin\r'.

Then I found this missing file in the processed dataset you provided. But isn't this tables.bin the output of preprocess phase?

vocab.txt for GloVe

There appear to be inconsistencies between the vocabulary contained within vocab.txt used for GloVe embeddings and the actual vocabulary of the Spider dataset. In terms of unique words, there is no correlation between vocab.txt and the amount of unique words seen in only questions, only queries and questions + queries combined from the Spider dataset.

Furthermore, when I order the contents of each by frequencies, vocab.txt does not appear to match frequencies of the Spider dataset either.

Since my investigations have yielded no conclusions, what is the meaning of vocab.txt? Where do the words and frequencies actually come from, if not the Spider dataset?

Note, I downloaded vocab.txt from the download link on the Code page: https://drive.google.com/file/d/1L8sWlp7J9LWjw9MP2bHGsf0wC4xLAyxO/view

How I can do inference on the model only with a question?

Hi, I want to know if is possible to use the model to get directly the SQL statment, after giving it a question in natural language. only use "Database + Question" like the picture
Thank you for help.

How to reproduce the experimental results

Hi, I hope you are doing well.
Your work is amazing!
I ran your code three times but only got a maximum of 70.5% ACC on dev.
Then I download your model from https://drive.google.com/u/0/uc?id=1ALf5ycxMViHrT5WGuFO3g9eT7R2S1rgy&export=download
I run it by run/run_evaluation.sh command.
But I can only get 71.5% ACC on dev. The output SQL looks different from that in the results_electra.txt file.
I was wondering how to reach the 75% ACC on dev.

Question about EXACT MATCH ACC

Hello, your work has made amazing progress on the spider data set, LGESQL + ELECTRA achieves 75.1 on the dev set, my question is whether this result is beam acc or sql acc?

How to add values generation ?

I am interested in your work and wanted to add the values to the generated SQL code. Unfortunately, despite several attempts I could not do it. I suspect it must be possible since the model seems to "understand" what the values are in the initial sentence. Is this possible and if so how?
Thank you!

Code Release Date

Hi @rhythmcao,

Your paper looks quite promising! Do you have a date in mind when you gonna release the code?

Thanks,
Ursin

Data batching time takes up most of the training time, how to improve it?

I find that data batching takes up most of the training time, Have you tried to use Dataloader class to accelerate the data batching? Or say Dataloader is not suitable in this project, because it may be out of memory size when using multiple workers in Dataloader.

x-lance / text2sql-lgesql Goto Github PK

text2sql-lgesql's People

Contributors

Stargazers

Watchers

Forkers

text2sql-lgesql's Issues

Recommend Projects

Recommend Topics

Recommend Org