Comments (6)
Hi,
which kind of model do you use for training? Note, that the data layout used in the write_to_hdf function from demos/mdlstm/IAM/create_IAM_dataset.py is only suitable for a 2D LSTM network, while the 1D networks use a different layout.
Do the demos work for you? If yes, then you should try to stick as close as possible to the way the demo creates the data.
Also note, that in the example, the data is put under the "inputs" key and not under the "data" key (although I'm not sure, if this matters)
Please also have a look at https://github.com/rwth-i6/returnn/blob/master/demos/mdlstm/artificial/create_test_h5.py which is a very simple script which shows how to properly create a data file for a 2D LSTM network.
If this still does not work for you, please let us now.
from returnn.
Thank you for your reply!
I'm using a model trained with the demo included in the code (demos/mdlstm/IAM/go.sh). The data I want to send are just images from the IAM dataset, so it's 2D.
Yeah. The demos work. I'm creating the data pretty much in the same way it's being done in the code.
What I'm trying to do is to setup a demo on a web service, where you can load a trained model and send it data to recognize, get the result and show it on a web page. The issue I'm having is that when I send just one image, I get result a result, decode it and it's fine, but when sending more than one I get a long sequence, and when I decode it, it's gibberish. I want to send a request with more than one image at the same time, if possible.
Working on a AWS instance (2GB GPU), If I send three images to the daemon, it crashes:
python2.7: mod.cu:3443: int _GLOBAL__N__38_tmpxft_00005d36_00000000_9_mod_cpp1_ii_5bcebdd5::__struct_compiled_op_b7d1f699ec8aa72531b9afc40db7fbc6::run(): Assertion `V49' failed.
Fatal Python error: Aborted
Current thread 0x00007f8738f30740 (most recent call first):
File "/home/mmedina/ReturRNN/dlenv/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 859 in __call__
File "/home/mmedina/ReturRNN/returnn/Device.py", line 759 in compute_run
File "/home/mmedina/ReturRNN/returnn/Device.py", line 1034 in process_inner
File "/home/mmedina/ReturRNN/returnn/Device.py", line 887 in process
File "/home/mmedina/ReturRNN/returnn/TaskSystem.py", line 1195 in _asyncCall
File "/home/mmedina/ReturRNN/returnn/TaskSystem.py", line 470 in funcCall
File "/home/mmedina/ReturRNN/returnn/TaskSystem.py", line 957 in checkExec
File "/home/mmedina/ReturRNN/returnn/TaskSystem.py", line 1304 in <module>
Dev gpu0 proc died: recv_bytes EOFError:
device crashed on batch 0
If I do the same on a local server with 3 Titan X GPUs (36GB RAM) it does not crash, but I only get a long sequence as result, and when decoding it I just get nonsense. Judging by these factors, my guess is that somehow the library thinks that the data contained in "data" is just one image and processes it as one single sequence.
This is the code I use to create the data and the JSON I pass to the daemon:
def normalize_image(imgfile, pad_x=15, pad_y=15):
img = imread(imgfile)
img = 255 - img;
img = np.pad(img, ((pad_y, pad_y), (pad_x, pad_x)), 'constant')
padded_shape = img.shape
img = img.reshape(img.size, 1)
img = img.astype("float32") / 255.0
return img, padded_shape
def build_json_from_file(imgfile):
data_structure = {}
data_structure['classes'] = [79,1]
data_structure['data'] = []
data_structure['sizes'] = []
imgs = []
padded_shapes = []
with open(imgfile, 'r') as f:
for image in f.read().splitlines():
img, padded_shape = normalize_image(image, 15, False)
imgs.append(img)
padded_shapes.append(padded_shape)
imgs = np.concatenate(imgs, axis=0)
print np.array(imgs).shape
padded_shapes = np.concatenate(padded_shapes, axis=0)
data_structure['sizes'] += padded_shapes.tolist()
for img in imgs:
as_list = img.tolist()
data_structure['data'].append(as_list)
return data_structure
If sending to the daemon multiple images in one request is not possible, I was thinking on receiving the images in a generic request, create an .h5 file with it, fire up ./rnn.py with a custom configuration file that includes the path of the created .h5 file, and then, when finished, get the results somehow, create a response and send it back to the caller, but I think it's too much.
Please let me know if I'm not clear enough. I've been stuck on this issue for a couple of days now and I may be missing or omitting something.
I really appreciate your help.
Thanks!
from returnn.
Hi,
first of all, the the error "Assertion `V49' failed." indicates out of GPU memory (sorry for the unspecific error message there, we should improve this).
And yes, it should be possible to forward multiple images at the same time.
You said, that the demo for training works. Does it also work, when you use the demo data for forwarding?
So far I wasn't able to see, where the problem comes from. Can you please send me the config and a small h5 data file you are using to p.voigtlaender [at] gmail.com ?
Edit: Please note that when forwarding to hdf, the result is stored as one long sequence, which has to be splitted using the seqLengths, so that "it does not crash, but I only get a long sequence as result" is expected, however the result should not be nonsense but the concatenation of the contents of both images in this case.
Btw, the daemon you are using is an experimental and undocumented feature. Maybe first try to get everything working with a "normal" forwarding to hdf5, so we can isolate the problem
from returnn.
Thanks again for your reply. Really appreciate that you're taking time to do this.
I have not tried forwarding. I'll send you the config file. Also, I'm not using any h5 file so far. I read the image from disk, convert it in the format the JSON expects (based on how you prepare the data for writing in an h5 file in the code demos/mdlstm/IAM/create_IAM_dataset.py I'll send you the full program I'm using so you can have a better look at what I'm trying to achieve.
About the daemon: I understand. I found it while studying the code and thought it was the easiest option to get something working.
from returnn.
Hi Manuel,
So you are currently using the daemon functionality within RETURNN as defined in Engine.py? This is a very experimental feature which so far was only used in a toy chat bot experiment.
There might be bugs, but in general you should be able to use it. Each call to classify only accepts a single sequence for now, but you can simply make multiple classify requests asynchronously and remember the hashes it returns and ask for them in any order to get the results (or a message that they are not done yet). There is no need to wait for previous results before making new requests.
If there are more performance requirements then I can look into extending the server to support batches of sequences.
from returnn.
from returnn.
Related Issues (20)
- RF CausalAttention get_sequence_mask_broadcast bug HOT 3
- PT potential CUDA mem leak? HOT 2
- `psutil` `_read_smaps_file` takes lots of time HOT 4
- Hang in `uvm_ioctl` in kernel HOT 2
- PyTorch CUDA OOM in distributed training HOT 7
- PyTorch distributed training, could not unlink the shared memory file
- PyTorch distributed training, hang in `all_reduce(_has_data ...`, after exception, Timed out waiting 1800000ms for send operation to complete HOT 4
- PyTorch training, some epochs very slow HOT 5
- PyTorch training RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR HOT 2
- PyTorch collect model statistics
- PyTorch recover after CUDA OOM with restart does not work with CUDA HOT 3
- PyTorch distributed training CPU OOM with sync_on_cpu HOT 1
- Support `torch.compile` for RF
- RF backend: PyTorch code
- Different effective learning rate reported over gpus HOT 11
- CUDA error: initialization error HOT 3
- MultiProcDataset inside PyTorch DataLoader with num_workers>0, multiple issues HOT 4
- RuntimeError: CUDA error: unspecified launch failure HOT 2
- NonDaemonicSpawnProcess hangs at exit HOT 2
- High memory usage with datasets (specifically when multi procs are used)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from returnn.