Giter VIP home page Giter VIP logo

Comments (13)

489597448 avatar 489597448 commented on June 18, 2024

Thanks for your work. But when I ran the program, I meet some errors.
When I execute
python run.py train experiments/spider-bert-run.jsonnet
I got an error about Segmentation fault (core dumped)
I found that when the code running at the rat-sql/ratsql/utils/random_state.py
line 12 : self.torch_cpu_state = torch.get_rng_state()

I use gdb to debug, the log is shown below:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fa00000561d in ?? ()
(gdb) where
#0  0x00007fa00000561d in ?? ()
#1  0x00007fa069add1d9 in c10::detail::LogAPIUsageFakeReturn(std::string const&) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libc10.so
#2  0x00007fa069ace17d in c10::TensorImpl::TensorImpl(c10::Storage&&, c10::TensorTypeSet, caffe2::TypeMeta const&, c10::optional<c10::Device>) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libc10.so
#3  0x00007fa069acec2e in c10::TensorImpl::TensorImpl(c10::Storage&&, c10::TensorTypeSet) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libc10.so
#4  0x00007fa06bae0ff7 in at::Tensor at::detail::make_tensor<c10::TensorImpl, c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >, c10::TensorTypeId>(c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >&&, c10::TensorTypeId&&) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libtorch.so
#5  0x00007fa06bad2cf8 in at::native::empty_cpu(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libtorch.so
#6  0x00007fa06bcb76fb in at::CPUType::(anonymous namespace)::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) ()
   from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libtorch.so
#7  0x00007fa0b1b37592 in torch::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#8  0x00007fa0b1b973b6 in THPGenerator_getState(THPGenerator*, _object*) () from /app/miniconda3_docker/envs/ratsql/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#9  0x0000561d32553ea1 in _PyMethodDef_RawFastCallKeywords ()
#10 0x0000561d3255414f in _PyMethodDescr_FastCallKeywords ()
#11 0x0000561d325affa9 in _PyEval_EvalFrameDefault ()
#12 0x0000561d3255307b in _PyFunction_FastCallKeywords ()
#13 0x0000561d325afe6e in _PyEval_EvalFrameDefault ()
#14 0x0000561d324f206b in _PyFunction_FastCallDict ()
#15 0x0000561d32508a03 in _PyObject_Call_Prepend ()
#16 0x0000561d3254baaa in slot_tp_init ()
#17 0x0000561d32554298 in _PyObject_FastCallKeywords ()
#18 0x0000561d325aff56 in _PyEval_EvalFrameDefault ()
#19 0x0000561d324f1059 in _PyEval_EvalCodeWithName ()
#20 0x0000561d324f2134 in _PyFunction_FastCallDict ()
#21 0x0000561d32508a03 in _PyObject_Call_Prepend ()
#22 0x0000561d3254baaa in slot_tp_init ()
#23 0x0000561d32554298 in _PyObject_FastCallKeywords ()
#24 0x0000561d325b06b2 in _PyEval_EvalFrameDefault ()
#25 0x0000561d324f206b in _PyFunction_FastCallDict ()
#26 0x0000561d32508a03 in _PyObject_Call_Prepend ()
#27 0x0000561d3254baaa in slot_tp_init ()
#28 0x0000561d32554298 in _PyObject_FastCallKeywords ()
#29 0x0000561d325aff56 in _PyEval_EvalFrameDefault ()
#30 0x0000561d3255307b in _PyFunction_FastCallKeywords ()
#31 0x0000561d325afe6e in _PyEval_EvalFrameDefault ()
#32 0x0000561d3255307b in _PyFunction_FastCallKeywords ()
#33 0x0000561d325aba66 in _PyEval_EvalFrameDefault ()
#34 0x0000561d324f1059 in _PyEval_EvalCodeWithName ()
#35 0x0000561d324f1f24 in PyEval_EvalCodeEx ()
#36 0x0000561d324f1f4c in PyEval_EvalCode ()
#37 0x0000561d3260aa14 in run_mod ()
#38 0x0000561d32613f11 in PyRun_FileExFlags ()
#39 0x0000561d32614104 in PyRun_SimpleFileExFlags ()
#40 0x0000561d32615bbd in pymain_main.constprop ()
#41 0x0000561d32615e30 in _Py_UnixMain ()
#42 0x00007fa1065d9b97 in __libc_start_main (main=0x561d324d1d20 <main>, argc=4, argv=0x7ffeccd7fc68, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffeccd7fc58) at ../csu/libc-start.c:310
#43 0x0000561d325bb052 in _start () at ../sysdeps/x86_64/elf/start.S:103

Thanks.

when you execute python run.py preprocess experiment_config_file, Is it wrong? if success,can you send me a processed data。Thankyou verymuch。my email is [email protected]

from rat-sql.

489597448 avatar 489597448 commented on June 18, 2024

self.torch_cpu_state = torch.get_rng_state()
I execute it( self.torch_cpu_state = torch.get_rng_state() ) separately on my computer,It didn't report an error

from rat-sql.

corezhen avatar corezhen commented on June 18, 2024

self.torch_cpu_state = torch.get_rng_state()
I execute it( self.torch_cpu_state = torch.get_rng_state() ) separately on my computer,It didn't report an error

Yes. Me too. But when the code running at this line, it is break.

from rat-sql.

489597448 avatar 489597448 commented on June 18, 2024

from rat-sql.

corezhen avatar corezhen commented on June 18, 2024

can you send me a processed data。thank you

Wait a minute. I will send you.

from rat-sql.

alexpolozov avatar alexpolozov commented on June 18, 2024

@corezhen The original issue looks like a problem with CUDA configuration on the machine (I noticed that you're running in conda). You've closed the issue since – have you solved it? Just making sure.

from rat-sql.

corezhen avatar corezhen commented on June 18, 2024

@corezhen The original issue looks like a problem with CUDA configuration on the machine (I noticed that you're running in conda). You've closed the issue since – have you solved it? Just making sure.

Yes. I solved it by using torch==1.1.0. In your requirement.txt, the version is 1.3.0, it did't work.

from rat-sql.

alexpolozov avatar alexpolozov commented on June 18, 2024

from rat-sql.

corezhen avatar corezhen commented on June 18, 2024

Thank you very much.
When I use CUDA 10.1 and torch 1.3, it didn't work.
Now, I use CUDA 10.1 and torch 1.1. There is not any error in the train and eval.

Also, I want to know how many GPUs in your experiment? I don't have 16GB GPU, so I use Bert-base on a Nvidia 1080ti. But I found that 10 steps will take 5min. So, It will cost 31 days to run 90000 steps.
Could you tell me the number of the GPUs and the time cost?
Thanks.

Thanks. I believe PyTorch packages the relevant CUDA libraries as part of its wheel, so an incompatibility could've been introduced by your GPU drivers. Note that 1.3 by default runs on CUDA 10, and 1.1 runs on CUDA 9.2. (I also can't be sure that everything in the codebase is compatible with the 1.1 API, but maybe it will run fine.)

from rat-sql.

alexpolozov avatar alexpolozov commented on June 18, 2024

We use only one GPU, usually V100 or P100. Note that BERT-base in our experiments did not bring any improvement over GLoVE, only BERT-large did. (We did not even write about it in the paper.)

Are you sure PyTorch is running on a GPU device? These numbers seem low.
For comparison, our typical run with effective batch size of 21 (bs=3 and num_batch_accumulated=7 in the config) gets to ~60% accuracy in 25K steps, which takes 1-2 days. 90K steps took about a week.

from rat-sql.

corezhen avatar corezhen commented on June 18, 2024

Yes. It is running on GPU.
I think the reason is that in the original effective batch size is 64 (bs=8 and num_batch_accumulated=8) in the file experiments/spider-bert-run.jsonnet.
I use the small bs=2, it will take 1min to run 10 steps. But it will out of memory in a 1080Ti (11177MiB).
I think I should use multi GPUs to do experiment.

from rat-sql.

alexpolozov avatar alexpolozov commented on June 18, 2024

I just fixed the default batch size, it was an oversight on my part when I did the release. The default should be bs=6, num_batch_accumulated=4. Feel free to tinker with it further, anything in the effective batch size range of 16..24 should perform similarly.

You can try adapting the code to multi-GPU, but that would be (a) non-trivial refactoring, (b) might not help as much as you think. The bottleneck is the decoder. It's a tree-based AST decoder, un-batched and sequential. It spends most of its time on the CPU.

from rat-sql.

corezhen avatar corezhen commented on June 18, 2024

Thank you very mush for your advice. I'll think about it seriously.

from rat-sql.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.