Giter VIP home page Giter VIP logo

dataload's People

Contributors

anibali avatar ikostrikov2 avatar manojelement avatar nicholas-leonard avatar ywelement avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataload's Issues

MultiSequence Issue

I'm trying to use MultiSequence as part of a natural language processing thingamajig. I just reinstalled torch, cutorch, nn, cunn, rnn, tds, and dataloader, so most recent versions of all of them (tds might be unnecessary, but I figure it can't hurt, right?).

During calls to subiter(), I was getting erratic crashes at Line 88,

input:narrow(1,start,size):copy(sequence:sub(tracker.idx, stop))

due to argument 3 of narrow() being out of range. Whenever this happened, size was 0. This happens when there's a sequence of length 1, which sets stop to 0 at Line 85, and when tracker.idx is 1.

Interestingly, this happens despite my data set containing no such sequences. I don't know how we're getting here - I adjusted things to print the sequence tensors as I add them to the list I feed the MultiSequence constructor, and not one of them is shorter than length 2, even on a set that causes this error. Nevertheless, printing self.sequences[tracker.seqid] just before the crash shows a length 1 tensor. This is the point at which I've convinced myself that I do not, in fact, grok the code well enough to find the real source of the bug. In any case, the sequence in question always seems to consist entirely of the EOS token by the time the code gets to this point, so there may be something to do with sequences getting truncated from the left.

With the "fix" below, the model does seem to be learning, but it could just be doing that in spite of a formatting error. Let me know if you need any code, data, etc.

For what it's worth, I "fixed" it by wrapping lines 85 through 96 with

if sequence:size(1) > 1 then  
  <existing code> 
else  
  start = start + 1  
  tracker.seqid = nil  
end

But that's mostly because I don't know exactly what I'm doing in the guts of this thing and it was easier to just kludge out the error case.

Thanks for your time

Serialization of AsyncIterator

I recently came across this problem:
I have a Sequenceloader wrapped around my dataset of multiple speech utterances. One utterance has ~ 500000 samples. Now if I want to wrap a AsyncIterator around the Sequenceloader, I get a reallocation error, which comes from line 19 in Asynciterator:
local datasetstr = torch.serialize(dataset, "ascii")

If I change the serialization on both sides to "binary" it works flawlessly. Now my question is, why is the ascii the preferred type and could I maybe pull request that small change to incorporate binary dumps?

EDIT:
Pull request is here #5

SequenceLoader issue

SequenceLoader seems to only work when the size of the input is 1. I seem to get errors with the following line:
self.data = sequence:sub(1,seqlen2*batchsize):view(batchsize, seqlen2):t():contiguous()

In particular, view does not work because of the missing data size. Is there a quick fix?

Improper index building in MultiImageSequence

The MultiImageSequence:buildIndex() looks for inputpattern like so:

local seqdir, idx = filepath:match("/([^/]+)/input([%d]*)[.][^/]+$")

This does not respect self.inputpattern. Accepting PRs?

SequenceLoader subiter method does not reset self._start

I don't know if it is a bug or intended behavior but currently SequenceLoader subiter method does not reset self._start field in DataLoader class when called more than one time.

For example, after inserting prints into DataLoader and supplying iteration loops for each subiter instance:

local textloader = dl.SequenceLoader(text, batchsize)
local train_it1 = textloader:subiter(opt.seqlen, textloader.data:size()[1])

while true do
  local i = train_it1()
  if not i then break end
  --print(i)
end

local train_it2 = textloader:subiter(opt.seqlen, textloader.data:size()[1])

while true do
  local i = train_it2()
  if not i then break end
  --print(i)
end
training...	
train_it1
DataLoader:subiter(batchsize, epochsize, ...) is called!	
5	13353	
self._start:1	


train_it2
DataLoader:subiter(batchsize, epochsize, ...) is called!	
5	13353	
self._start:2	

Notice the self._start starts with 2 not 1 the second time we call it. I can't think of a use case when such behavior is either needed or expected. I can of course create data iterator by wrapping it into SequenceLoader all the time I need one but I think it would be more transparent to know that subiter resets the state.

GBW data format preparation

Hello,
Is there any "already done script" to prepare the raw text from GBW to the required .th7 format for the gbwloader ?

I would like to experiment with some other datasets.

thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.