Comments (13)
Thanks for your work. But when I ran the program, I meet some errors.
When I execute
python run.py train experiments/spider-bert-run.jsonnet
I got an error about Segmentation fault (core dumped)
I found that when the code running at the rat-sql/ratsql/utils/random_state.py
line 12 :self.torch_cpu_state = torch.get_rng_state()
I use gdb to debug, the log is shown below:
Thread 1 "python" received signal SIGSEGV, Segmentation fault. 0x00007fa00000561d in ?? () (gdb) where #0 0x00007fa00000561d in ?? () #1 0x00007fa069add1d9 in c10::detail::LogAPIUsageFakeReturn(std::string const&) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libc10.so #2 0x00007fa069ace17d in c10::TensorImpl::TensorImpl(c10::Storage&&, c10::TensorTypeSet, caffe2::TypeMeta const&, c10::optional<c10::Device>) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libc10.so #3 0x00007fa069acec2e in c10::TensorImpl::TensorImpl(c10::Storage&&, c10::TensorTypeSet) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libc10.so #4 0x00007fa06bae0ff7 in at::Tensor at::detail::make_tensor<c10::TensorImpl, c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >, c10::TensorTypeId>(c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >&&, c10::TensorTypeId&&) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libtorch.so #5 0x00007fa06bad2cf8 in at::native::empty_cpu(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libtorch.so #6 0x00007fa06bcb76fb in at::CPUType::(anonymous namespace)::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libtorch.so #7 0x00007fa0b1b37592 in torch::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libtorch_python.so #8 0x00007fa0b1b973b6 in THPGenerator_getState(THPGenerator*, _object*) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libtorch_python.so #9 0x0000561d32553ea1 in _PyMethodDef_RawFastCallKeywords () #10 0x0000561d3255414f in _PyMethodDescr_FastCallKeywords () #11 0x0000561d325affa9 in _PyEval_EvalFrameDefault () #12 0x0000561d3255307b in _PyFunction_FastCallKeywords () #13 0x0000561d325afe6e in _PyEval_EvalFrameDefault () #14 0x0000561d324f206b in _PyFunction_FastCallDict () #15 0x0000561d32508a03 in _PyObject_Call_Prepend () #16 0x0000561d3254baaa in slot_tp_init () #17 0x0000561d32554298 in _PyObject_FastCallKeywords () #18 0x0000561d325aff56 in _PyEval_EvalFrameDefault () #19 0x0000561d324f1059 in _PyEval_EvalCodeWithName () #20 0x0000561d324f2134 in _PyFunction_FastCallDict () #21 0x0000561d32508a03 in _PyObject_Call_Prepend () #22 0x0000561d3254baaa in slot_tp_init () #23 0x0000561d32554298 in _PyObject_FastCallKeywords () #24 0x0000561d325b06b2 in _PyEval_EvalFrameDefault () #25 0x0000561d324f206b in _PyFunction_FastCallDict () #26 0x0000561d32508a03 in _PyObject_Call_Prepend () #27 0x0000561d3254baaa in slot_tp_init () #28 0x0000561d32554298 in _PyObject_FastCallKeywords () #29 0x0000561d325aff56 in _PyEval_EvalFrameDefault () #30 0x0000561d3255307b in _PyFunction_FastCallKeywords () #31 0x0000561d325afe6e in _PyEval_EvalFrameDefault () #32 0x0000561d3255307b in _PyFunction_FastCallKeywords () #33 0x0000561d325aba66 in _PyEval_EvalFrameDefault () #34 0x0000561d324f1059 in _PyEval_EvalCodeWithName () #35 0x0000561d324f1f24 in PyEval_EvalCodeEx () #36 0x0000561d324f1f4c in PyEval_EvalCode () #37 0x0000561d3260aa14 in run_mod () #38 0x0000561d32613f11 in PyRun_FileExFlags () #39 0x0000561d32614104 in PyRun_SimpleFileExFlags () #40 0x0000561d32615bbd in pymain_main.constprop () #41 0x0000561d32615e30 in _Py_UnixMain () #42 0x00007fa1065d9b97 in __libc_start_main (main=0x561d324d1d20 <main>, argc=4, argv=0x7ffeccd7fc68, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffeccd7fc58) at ../csu/libc-start.c:310 #43 0x0000561d325bb052 in _start () at ../sysdeps/x86_64/elf/start.S:103
Thanks.
when you execute python run.py preprocess experiment_config_file, Is it wrong? if success,can you send me a processed data。Thankyou verymuch。my email is [email protected]
from rat-sql.
self.torch_cpu_state = torch.get_rng_state()
I execute it( self.torch_cpu_state = torch.get_rng_state() ) separately on my computer,It didn't report an error
from rat-sql.
self.torch_cpu_state = torch.get_rng_state()
I execute it( self.torch_cpu_state = torch.get_rng_state() ) separately on my computer,It didn't report an error
Yes. Me too. But when the code running at this line, it is break.
from rat-sql.
from rat-sql.
can you send me a processed data。thank you
…
Wait a minute. I will send you.
from rat-sql.
@corezhen The original issue looks like a problem with CUDA configuration on the machine (I noticed that you're running in conda). You've closed the issue since – have you solved it? Just making sure.
from rat-sql.
@corezhen The original issue looks like a problem with CUDA configuration on the machine (I noticed that you're running in conda). You've closed the issue since – have you solved it? Just making sure.
Yes. I solved it by using torch==1.1.0. In your requirement.txt, the version is 1.3.0, it did't work.
from rat-sql.
from rat-sql.
Thank you very much.
When I use CUDA 10.1 and torch 1.3, it didn't work.
Now, I use CUDA 10.1 and torch 1.1. There is not any error in the train and eval.
Also, I want to know how many GPUs in your experiment? I don't have 16GB GPU, so I use Bert-base on a Nvidia 1080ti. But I found that 10 steps will take 5min. So, It will cost 31 days to run 90000 steps.
Could you tell me the number of the GPUs and the time cost?
Thanks.
Thanks. I believe PyTorch packages the relevant CUDA libraries as part of its wheel, so an incompatibility could've been introduced by your GPU drivers. Note that 1.3 by default runs on CUDA 10, and 1.1 runs on CUDA 9.2. (I also can't be sure that everything in the codebase is compatible with the 1.1 API, but maybe it will run fine.)
…
from rat-sql.
We use only one GPU, usually V100 or P100. Note that BERT-base in our experiments did not bring any improvement over GLoVE, only BERT-large did. (We did not even write about it in the paper.)
Are you sure PyTorch is running on a GPU device? These numbers seem low.
For comparison, our typical run with effective batch size of 21 (bs=3
and num_batch_accumulated=7
in the config) gets to ~60% accuracy in 25K steps, which takes 1-2 days. 90K steps took about a week.
from rat-sql.
Yes. It is running on GPU.
I think the reason is that in the original effective batch size is 64 (bs=8
and num_batch_accumulated=8
) in the file experiments/spider-bert-run.jsonnet
.
I use the small bs=2
, it will take 1min to run 10 steps. But it will out of memory in a 1080Ti (11177MiB).
I think I should use multi GPUs to do experiment.
from rat-sql.
I just fixed the default batch size, it was an oversight on my part when I did the release. The default should be bs=6
, num_batch_accumulated=4
. Feel free to tinker with it further, anything in the effective batch size range of 16..24 should perform similarly.
You can try adapting the code to multi-GPU, but that would be (a) non-trivial refactoring, (b) might not help as much as you think. The bottleneck is the decoder. It's a tree-based AST decoder, un-batched and sequential. It spends most of its time on the CPU.
from rat-sql.
Thank you very mush for your advice. I'll think about it seriously.
from rat-sql.
Related Issues (20)
- How did you generate the heatmap plots for the attentions? HOT 2
- Issue during preprocessing: 'CoreNLP' object has no attribute 'client' HOT 6
- Do you have notebook version for RATSQL ? HOT 5
- Can`t train the model with GPU on a server with RTX3090 HOT 1
- Execution accuracy
- An error in preprocess HOT 3
- preporcessing issue HOT 2
- 0% accuracy during evaluation from trained model. - GLOVE HOT 1
- Decoder vocabulary file not created/found HOT 1
- How to find the 'terminal' value in the questions HOT 1
- The process is killed every 280 steps HOT 2
- Schema Modeling HOT 3
- How to execute my own queries in BERT
- Failed to generate val.jsonl after the preprocess of BERT-version rat-sql
- Trained model
- System Requirements
- 这个模型代码,可以应用与中文的数据么
- Colab Implementation of RatSQL and SPIDER dataset HOT 1
- How to customize for other dataset?
- 如何将数据库列名与给定的query对齐
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rat-sql.