Hey In <a href="https://github.com/rwth-i6/returnn/blob/cb7714b44a19

The behavior of get_seq_length is correct. <p dir

CachedDataset2 memory issue about returnn HOT 4 CLOSED

rwth-i6 commented on July 17, 2024

CachedDataset2 memory issue

from returnn.

Comments (4)

albertz commented on July 17, 2024 1

The behavior of get_seq_length is correct.

load_seqs is supposed to load some seqs and make them available, so that get_data etc will work for those seqs. If there is no other call to load_seqs, those seqs must be kept available - that is the expected behavior of load_seqs.

get_seq_length can get called for some future seq which was not yet loaded (but usually only the next one), that is why it internally calls load_seqs in that case, but in such a way that it will not remove the other seqs from memory.

init_seq_order with the argument "sorted" (or related) will normally use get_seq_length but that will not work as you describe. But that should not be the case for CachedDataset2 - its implementation does not support "sorted" or any other option. Any sorting will not work for the CachedDataset2 because this dataset is implemented in such a way that it has no real control over the sorting logic. So, usually you are supposed to implement init_seq_order yourself if you want to have some control. See the derived versions of CachedDataset2.init_seq_order for some examples.

I just looked at the RawWavDataset which you might relate to. You either should load in advance the length of each sequence (you can just take the file-length instead, that will preserve the right order) to support the "sorted" or related sorting options. Maybe do that in a lazy way, so the passed function get_seq_len to get_seq_order_for_epoch can load it at the first call. Or otherwise, just pass get_seq_len=None and that will also be fine - in that case, "sorted" or related sorting options will not be supported.

from returnn.

mennetob commented on July 17, 2024

Thanks.

So for the completion of the Issue my solution is the following:

If I understand correctly the "sorted" option is not really supported by CachedDataset2.

But it seems to me that due to the initialization of dev dataset, the seq_ordering option of the corss validation set in training is set to "sorted" by default?

The config option "batching" only affects the training_set but not the dev_set.
So simply adding the option "seq_ordering": "default" to my specification of the dev_dataset seems to be a sufficient work aroung for my Problem.

from returnn.

albertz commented on July 17, 2024

CachedDataset2 itself will ignore the seq_ordering option. In all cases, it's up to your implementation to also ignore that (e.g. the ExternSprintDataset will just ignore it, like some others do as well), throw an error or do something sensible. Using get_seq_order_for_epoch with the default get_seq_length will not work, as you described.

from returnn.

mennetob commented on July 17, 2024

Yeah sure.

I'm now also setting get_seq_len=None in the call of get_seq_order_for_epoch(...) as you suggested. But doing this alone will still make me end up in the sorted branch of Dataset.get_seq_order_for_epoch(...) and thus cause an assertion error.

By setting seq_ordering": "default" in the config of the cross validation set, I only avoid entering that if branch. But you are right I might better handle that in the implementation of my dataset.

Thanks.

from returnn.

CachedDataset2 memory issue about returnn HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent