Running Error: Segmentation fault (core dumped) about rat-sql HOT 13 CLOSED

microsoft commented on June 18, 2024

Running Error: Segmentation fault (core dumped)

from rat-sql.

Comments (13)

489597448 commented on June 18, 2024

Thanks for your work. But when I ran the program, I meet some errors.
When I execute
python run.py train experiments/spider-bert-run.jsonnet
I got an error about Segmentation fault (core dumped)
I found that when the code running at the rat-sql/ratsql/utils/random_state.py
line 12 : self.torch_cpu_state = torch.get_rng_state()

I use gdb to debug, the log is shown below:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fa00000561d in ?? ()
(gdb) where
#0  0x00007fa00000561d in ?? ()
#1  0x00007fa069add1d9 in c10::detail::LogAPIUsageFakeReturn(std::string const&) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libc10.so
#2  0x00007fa069ace17d in c10::TensorImpl::TensorImpl(c10::Storage&&, c10::TensorTypeSet, caffe2::TypeMeta const&, c10::optional<c10::Device>) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libc10.so
#3  0x00007fa069acec2e in c10::TensorImpl::TensorImpl(c10::Storage&&, c10::TensorTypeSet) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libc10.so
#4  0x00007fa06bae0ff7 in at::Tensor at::detail::make_tensor<c10::TensorImpl, c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >, c10::TensorTypeId>(c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >&&, c10::TensorTypeId&&) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libtorch.so
#5  0x00007fa06bad2cf8 in at::native::empty_cpu(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libtorch.so
#6  0x00007fa06bcb76fb in at::CPUType::(anonymous namespace)::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) ()
   from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libtorch.so
#7  0x00007fa0b1b37592 in torch::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#8  0x00007fa0b1b973b6 in THPGenerator_getState(THPGenerator*, _object*) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#9  0x0000561d32553ea1 in _PyMethodDef_RawFastCallKeywords ()
#10 0x0000561d3255414f in _PyMethodDescr_FastCallKeywords ()
#11 0x0000561d325affa9 in _PyEval_EvalFrameDefault ()
#12 0x0000561d3255307b in _PyFunction_FastCallKeywords ()
#13 0x0000561d325afe6e in _PyEval_EvalFrameDefault ()
#14 0x0000561d324f206b in _PyFunction_FastCallDict ()
#15 0x0000561d32508a03 in _PyObject_Call_Prepend ()
#16 0x0000561d3254baaa in slot_tp_init ()
#17 0x0000561d32554298 in _PyObject_FastCallKeywords ()
#18 0x0000561d325aff56 in _PyEval_EvalFrameDefault ()
#19 0x0000561d324f1059 in _PyEval_EvalCodeWithName ()
#20 0x0000561d324f2134 in _PyFunction_FastCallDict ()
#21 0x0000561d32508a03 in _PyObject_Call_Prepend ()
#22 0x0000561d3254baaa in slot_tp_init ()
#23 0x0000561d32554298 in _PyObject_FastCallKeywords ()
#24 0x0000561d325b06b2 in _PyEval_EvalFrameDefault ()
#25 0x0000561d324f206b in _PyFunction_FastCallDict ()
#26 0x0000561d32508a03 in _PyObject_Call_Prepend ()
#27 0x0000561d3254baaa in slot_tp_init ()
#28 0x0000561d32554298 in _PyObject_FastCallKeywords ()
#29 0x0000561d325aff56 in _PyEval_EvalFrameDefault ()
#30 0x0000561d3255307b in _PyFunction_FastCallKeywords ()
#31 0x0000561d325afe6e in _PyEval_EvalFrameDefault ()
#32 0x0000561d3255307b in _PyFunction_FastCallKeywords ()
#33 0x0000561d325aba66 in _PyEval_EvalFrameDefault ()
#34 0x0000561d324f1059 in _PyEval_EvalCodeWithName ()
#35 0x0000561d324f1f24 in PyEval_EvalCodeEx ()
#36 0x0000561d324f1f4c in PyEval_EvalCode ()
#37 0x0000561d3260aa14 in run_mod ()
#38 0x0000561d32613f11 in PyRun_FileExFlags ()
#39 0x0000561d32614104 in PyRun_SimpleFileExFlags ()
#40 0x0000561d32615bbd in pymain_main.constprop ()
#41 0x0000561d32615e30 in _Py_UnixMain ()
#42 0x00007fa1065d9b97 in __libc_start_main (main=0x561d324d1d20 <main>, argc=4, argv=0x7ffeccd7fc68, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffeccd7fc58) at ../csu/libc-start.c:310
#43 0x0000561d325bb052 in _start () at ../sysdeps/x86_64/elf/start.S:103

Thanks.

when you execute python run.py preprocess experiment_config_file, Is it wrong? if success，can you send me a processed data。Thankyou verymuch。my email is [email protected]

from rat-sql.

489597448 commented on June 18, 2024

self.torch_cpu_state = torch.get_rng_state()
I execute it（ self.torch_cpu_state = torch.get_rng_state() ） separately on my computer，It didn't report an error

from rat-sql.

corezhen commented on June 18, 2024

self.torch_cpu_state = torch.get_rng_state()
I execute it（ self.torch_cpu_state = torch.get_rng_state() ） separately on my computer，It didn't report an error

Yes. Me too. But when the code running at this line, it is break.

from rat-sql.

489597448 commented on June 18, 2024

can you send me a processed data。thank you

…

------------------ 原始邮件 ------------------ 发件人: "corezhen"<[email protected]>; 发送时间: 2020年7月9日(星期四) 下午3:28 收件人: "microsoft/rat-sql"<[email protected]>; 抄送: "胡晓辉"<[email protected]>;"Comment"<[email protected]>; 主题: Re: [microsoft/rat-sql] Running Error: Segmentation fault (core dumped) (#1) self.torch_cpu_state = torch.get_rng_state() I execute it（ self.torch_cpu_state = torch.get_rng_state() ） separately on my computer，It didn't report an error Yes. Me too. But when the code running at this line, it is break. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

from rat-sql.

corezhen commented on June 18, 2024

can you send me a processed data。thank you
…

Wait a minute. I will send you.

from rat-sql.

alexpolozov commented on June 18, 2024

@corezhen The original issue looks like a problem with CUDA configuration on the machine (I noticed that you're running in conda). You've closed the issue since – have you solved it? Just making sure.

from rat-sql.

corezhen commented on June 18, 2024

@corezhen The original issue looks like a problem with CUDA configuration on the machine (I noticed that you're running in conda). You've closed the issue since – have you solved it? Just making sure.

Yes. I solved it by using torch==1.1.0. In your requirement.txt, the version is 1.3.0, it did't work.

from rat-sql.

alexpolozov commented on June 18, 2024

Thanks. I believe PyTorch packages the relevant CUDA libraries as part of its wheel, so an incompatibility could've been introduced by your GPU drivers. Note that 1.3 by default runs on CUDA 10, and 1.1 runs on CUDA 9.2. (I also can't be sure that everything in the codebase is compatible with the 1.1 API, but maybe it will run fine.)

…

from rat-sql.

corezhen commented on June 18, 2024

Thank you very much.
When I use CUDA 10.1 and torch 1.3, it didn't work.
Now, I use CUDA 10.1 and torch 1.1. There is not any error in the train and eval.

Also, I want to know how many GPUs in your experiment? I don't have 16GB GPU, so I use Bert-base on a Nvidia 1080ti. But I found that 10 steps will take 5min. So, It will cost 31 days to run 90000 steps.
Could you tell me the number of the GPUs and the time cost?
Thanks.

Thanks. I believe PyTorch packages the relevant CUDA libraries as part of its wheel, so an incompatibility could've been introduced by your GPU drivers. Note that 1.3 by default runs on CUDA 10, and 1.1 runs on CUDA 9.2. (I also can't be sure that everything in the codebase is compatible with the 1.1 API, but maybe it will run fine.)
…

from rat-sql.

alexpolozov commented on June 18, 2024

We use only one GPU, usually V100 or P100. Note that BERT-base in our experiments did not bring any improvement over GLoVE, only BERT-large did. (We did not even write about it in the paper.)

Are you sure PyTorch is running on a GPU device? These numbers seem low.
For comparison, our typical run with effective batch size of 21 (bs=3 and num_batch_accumulated=7 in the config) gets to ~60% accuracy in 25K steps, which takes 1-2 days. 90K steps took about a week.

from rat-sql.

corezhen commented on June 18, 2024

Yes. It is running on GPU.
I think the reason is that in the original effective batch size is 64 (bs=8 and num_batch_accumulated=8) in the file experiments/spider-bert-run.jsonnet.
I use the small bs=2, it will take 1min to run 10 steps. But it will out of memory in a 1080Ti (11177MiB).
I think I should use multi GPUs to do experiment.

from rat-sql.

alexpolozov commented on June 18, 2024

I just fixed the default batch size, it was an oversight on my part when I did the release. The default should be bs=6, num_batch_accumulated=4. Feel free to tinker with it further, anything in the effective batch size range of 16..24 should perform similarly.

You can try adapting the code to multi-GPU, but that would be (a) non-trivial refactoring, (b) might not help as much as you think. The bottleneck is the decoder. It's a tree-based AST decoder, un-batched and sequential. It spends most of its time on the CPU.

from rat-sql.

corezhen commented on June 18, 2024

Thank you very mush for your advice. I'll think about it seriously.

from rat-sql.

Running Error: Segmentation fault (core dumped) about rat-sql HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent