mila-iqia / fuel Goto Github PK
View Code? Open in Web Editor NEWA data pipeline framework for machine learning
License: MIT License
A data pipeline framework for machine learning
License: MIT License
I'm thinking it would be nice to display a progress bar when doing a super-long conversion, but I don't like the idea of littering the converter code itself with calls to progressbar.
One possibility is using the 'extra' dictionary on LogRecords and installing a custom handler in fuel-convert. That way it's just passed as usable metadata in regular logging calls and fuel-convert (or any other client) can use it as it likes. This seems fairly clean. Any thoughts?
Hi,
I'd like to run an experiment using a subset of the binarized MNIST 28 dataset.
I just want to run some quick experiments using say the first 12,000 images.
How do I do that?
Thanks,
Aj
This discussion started at the mailing list.
ref_sources
is a class attribute in H5PYDataset
, which causes it to not be serialized. However, if you use a DataStream
in your code, a handle to the HDF5 file is saved as an instance attribute, which causes it to be serialized. However, most objects in h5py can't be serialized (more info here), which causes loading a pickled DataStream
to fail. I am not sure on what could be done, as saving the DataStream state is useful when the dataset is not an H5PYDataset
. The handle to the HDF5 file does indeed hold a state while the file is loaded (which driver is being used, buffering settings, etc), but this information is not useful if reloading an HDF5 file from scratch.
Following discussion on #35,
HDF5DatasetFile
class with load_in_memory
attribute.fuel-convert
utility that uses the low-level utilities to load data and spit them out in HDF5.HDF5DatasetFIle
.We should be able to create a split that is a list of disjoint indices, as well as specify the subset of an existing split as a list of disjoint indices.
I'm trying to test Blocks RNNs, and I think the markov chain example is a little bit hard to read. I can understand the RNN bricks though. I wish I could know what is the easiest/right way to translate Pylearn2 datasets to Fuel datasets? More specifically, I have a pylearn2.VectorSpaceDataset
based on pylearn2.SequenceDataSpace
and I wish I could use it with Blocks. Also, the dataset iterator is reading from a HDF5 file and swapping the axis to make time the first dimension. How can I just wrap this dataset with a Fuel class and throw it to blocks.DataStream
?
Do you have any guidelines for doing so?
It is often possible to determine from the iteration scheme and the dataset how many batches will be returned (e.g. ceil(num_examples / batch_size)
). This information could be useful (e.g. for the progress bar in Blocks) but the question is which component would implement this logic (because data streams and iteration schemes are agnostic to each other, and it would be a shame to break that). It might also become quite complicated quite quickly (e.g. an iteration scheme that produces more requests than the data stream provides, many data stream wrappers don't have wrappers but simply maintain the number of examples, etc.)
Related to this: right now there is no documented interface for datasets. The question would be whether we want to make e.g. num_examples
compulsory (allowing NaN
and Inf
as values).
Further to my comment at #2, things that should probably be factored out:
All datasets that load files have similar logic: They concatenate the FUEL_DATA_PATH
with the name of the folder in which the dataset is stored, and with the name of the file that we want to read. This should probably be factored out into a get_path()
method which can be part of a mix-in DataFiles
class or something. This can also raise a custom error with information on how to set the data path in the configuration.
This one might be a bit trickier, but should probably be factored out into a mix-in class as well so that every dataset that is loaded into memory can simply use this mixin class to support the start
and stop
arguments.
Maybe this could be combined with the current InMemoryDataset
class, or maybe we could factor it out into a separate class, but the logic of raising an error when the state is not None
and simply returning the indices of the data could also be factored out, because it will be needed for any dataset whose get_data
logic is basically just indexing.
The datasets should probably just return the original images, and we should have a flattening stream that does the flattening batch-wise.
Just wondering: It seems your requirements for configuration-file parsing are not very high. Maybe it would be sufficient to use https://docs.python.org/2/library/configparser.html ?
Pros: it's part of the std. lib for Python 2 and 3
Cons: a) it's Windows-INI style syntax; b) arguably less powerful than YAML.
From mila-iqia/blocks#267
Something that @arasmus ran into: When combining multiple data streams, the names of the sources might clash. I guess a general solution for this would be to create a RenameStream
which just renames the sources and passes on the data directly. That way you can put than into the yet-to-be-written data stream-chainer.
It is currently impossible to iterate over MNIST examples, not batches of examples.
Even SequentialScheme(num_examples, 1)
does not fill this gap, because the requests it provides are singleton lists.
I would solve that by adding a new scheme, let's say ExamplesScheme
, whose requests would be integers, not lists.
From mila-iqia/blocks#329
It would be nice to have two new wrappers: one that sorts examples in a batch according to a given key function and one that unpacks batches to compose a stream of examples (like itertools.chain
does). These two wrappers could be used one after another to make the data stream more uniform, which can yield very significant speed up in some cases, e.g. when the examples are sequences and it is desirable to have sentences of similar lengths in a batch.
Altogether it should look like that:
data_stream = DataStream(dataset, iteration_scheme=ShuffledScheme(2000))
data_stream = Sort(data_stream, key=_get_input_length)
# this one has long segments of sorted examples: uniform batches can be formed
data_stream = Unpack(data_stream)
# a data stream with uniform batches!
data_stream = BatchDataStream(data_stream, iteration_scheme=Constant(10))
When the external interface of Fuel changes e.g. by renaming/removing methods/classes that could be imported, it could break libraries that depend on Fuel (Blocks right now, eventually Pylearn2).
In this case, we will need to coordinate changes to these libraries and Fuel in the following way:
.travis.yml
, running the tests from the new Blocks branch.When setting ~/.theanorc floatX=float32 and device=gpu, I couldn't run the Blocks MNIST tutorial. I thought that Fuel could check theano.config.floatX and serve downcasted when necessary.
Right now lazy properties are part of the InMemoryDataset
, however, their principle is useful in other cases as well, see e.g. #9 (comment).
We need to rename lazy properties to something more descriptive, move it out of InMemoryDataset
, and redo the documentation to highlight its more general applicability.
I followed the tutorial here to create a custom dataset. But when loading data using H5PYDataset
, some of the data seems to be lost when there is a split array.
Here is an example.
https://gist.github.com/bobogei81123/6a2d93caa6f809526e4f
It seems like setting load_in_memory=True
could solve the problem.
(Also, ShuffleScheme won't work if load_in_memory=False
)
See these two tutorials (in order) for a crash-course on how to implement a new built-in dataset in Fuel:
Launching a combo dataset server/experiment script is going to be a pretty common use case. It'd be nice to include some sort of helper to make this as painless as possible.
Could you possibly write some documentation on how to create a new datastream, and which methods to overload? I'm trying but having some trouble knowing what to implement.
Title explains it.
See #44. mila-iqia/picklable-itertools@825eb0f solved this (I think), but we might as well add a test to make sure this doesn't happen again, just in case.
This could be eventual CCW fodder.
Fuel currently depends on both PyTables and h5py, using them for different datasets. Ideally I think we should at least make PyTables an optional dependency.
See these two tutorials (in order) for a crash-course on how to implement a new built-in dataset in Fuel:
Following offline discussion with @lamblin and @vdumoulin, we agreed that the most important kind of type checking to perform in the data processing pipeline is probably the semantics of the axes of the data. Data streams should provide information about what each axis of the input and output represents e.g.
(channel, height, width)
(features)
(maybe a labels
role?) or before going into Blocks: (time, batch, features)
(batch, features)
The behaviour of data streams regarding these labels should be configurable, so they can either ignore, warn or raise errors if the data input is not what they expected.
Some things that need to be thought about:
This is closely related to mila-iqia/blocks#30
Currently I don't see any clean way to do traditional k-fold or leave-one-out cross-validation or any more flexible division of the training data to the validation set. There is only this (start, end) pair available and it is not flexible enough for this purpose (not possible to define two ranges for training and leave in-between samples for validation). This should be a very typical approach in machine learning in general, but with DL datasets are so big that people don't afford to do this. Still, I'd like fuel to support it. What do you think?
I would be personally happy with just adding a shuffle switch that would randomize the order of samples before division. But of course it would need to be internally randomized only once when opened to avoid train and validation from overlapping.
So this is what I traditionally do:
train = MNIST("train", end=50000) # first 50k samples
valid = MNIST("train", start=50000) # the last 10k samples
test = MNIST("test")
One work-around would be to only create one dataset from "train" and use Transformers somehow, but this data division is so important part of the process that I think fuel should support it directly. Especially because it is pretty easy to implement and support.
You can find info about the Iris dataset here.
See these two tutorials (in order) for a crash-course on how to implement a new built-in dataset in Fuel:
See #44.
One of the goals for factoring Fuel out of Blocks was to be able to re-use it as a new dataset back-end in Pylearn2.
Here is a proposal and a starting plan for that. Ideas, comments and changes are very welcome.
See ticket #13. Important parts would be:
label
attribute of each dimension (default is the empty string). We do not need "dimension scales" for that (they are for associating numeric information to the whole slice along that dimension). Done in #78 for H5PYDataset
."axis_0"
.Given a Pylearn2 data_specs and a Fuel Dataset, it should be possible to create a Pylearn2 iterator that is usable by TrainingAlgorithms.
FuelWrapperDataset
in Pylearn2, constructed from a Fuel pipeline, with an option so that its iterator
method can either:
When the proof of concept of using Fuel datasets from Pylearn2 works, we can start working on porting existing Pylearn2 datasets to Fuel.
fuel/downloaders/
and integrate with bin/fuel-download
.fuel/converters/
and integrate with bin/fuel-convert
.Transformer
We should introduce versioning into the HDF5-based format we save, so that we can detect what set of conventions were being followed at that point in time, and read and interpret files that were not written with the current version of the code.
My personal feeling is that format version need not be tied to (or rely upon) Fuel package versioning. This is the approach adopted by NumPy for NPY, and it has worked well.
DataStreamWrapper
made sense when we were designing this, because the point was that data streams and data streams wrappers share the same interface. (Initially we actually tried making them one and the same class, but that ended up not happening, and now they just share a base class.)
The problem with DataStreamWrapper
is that it's a bit obscure of a name, and I think it actually confuses people sometimes, because they're not sure what the difference between DataStream
and DataStreamWrapper
is. I actually like Pylearn2's Transformer
better. Another alternative is Preprocessor
, but that sounds too much like just doing whitening, etc. and not routine transformations like sorting, merging, etc.
So the terminology then becomes: Instantiate a dataset, and create a data stream that reads from this dataset (potentially using a particular iteration scheme). Afterwards you can apply a series of transformations on the data stream (some transformers can support the iteration_scheme
argument, some don't, but the get_epoch_iterator
interface is the same).
Thoughts? Here I go randomly tagging people for opinions again: @rizar @lamblin @vdumoulin @laurent-dinh @dwf @pbrakel @jbornschein
This would allow people to ask questions without opening issues.
See discussion at mila-iqia/blocks#336 (comment)
See these two tutorials (in order) for a crash-course on how to implement a new built-in dataset in Fuel:
Copy from mila-iqia/blocks#105
We could use the https://github.com/jaberg/skdata to download popular datasets and even to load them into memory.
Integrating with skdata
might not be necessary anymore now that we have a separate data framework. But we should still come up with a solution one way or the other.
I was just wondering whether we should make this a kind of policy: It's okay (and expected) for transformers to only act on examples, not on batches. There are basically two arguments that I can think of:
It makes code a lot simpler. This is n-grams for batches (and it's actually still not complete):
features,
for _, sentence in enumerate(self.cache[0]):
features.append(list(
sliding_window(self.ngram_order,
sentence[:-1]))[:request - len(features)])
targets.append(
sentence[self.ngram_order:][:request - len(targets)])
self.cache[0][0] = self.cache[0][0][request:]
if not self.cache[0][0]:
self.cache[0].pop(0)
if not self.cache[0]:
self._cache()
if len(features) == request:
break
return tuple(numpy.asarray(data) for data in (features, targets))
and this is it for examples:
while not self.index < len(self.sentence) - self.ngram_order:
self.sentence, = next(self.child_epoch_iterator)
self.index = 0
ngram = self.sentence[self.index:self.index + self.ngram_order]
target = self.sentence[self.index + self.ngram_order]
self.index += 1
return (ngram, target)
If NGramStream
had to deal both with batches and with examples, the code would be very long for such a simple operation. This goes for many, many cases. Hence, I'd prefer transformers to work on examples, and expect the user to add a BatchStream
at the end.
Speed. Performing operations on batches can often be faster.
My take on it is that we can aim at one of two things:
We can try to make Fuel as efficient as possible. That means quite a bit of code to make sure that we handle large batches efficiently, and it might limit our ability to easily add new transformers (because they need all this logic coded up).
Alternatively, we can just say that our primary goal is the easy creation of processing pipelines. We will care more about prototyping e.g. testing dozens of different combinations of transformers to see which one works best, and making it very easy to add new ones. This means that the pipelines might not be as fast as they could be, but I think (hope) not slow enough to be prohibitive. Once you have found your optimal pre-processing pipeline and really need the speed, it should be easy to code up a single, specialized transformer that does everything you want more efficiently on batches/in Cython/using GPU/etc.
Add to Readme.rst that libraries certifi and urllib3 are prerequisites, and add instructions on how to install these @vdumoulin.
Isn't that the point of do_not_pickle_attributes
after all?
Currently Indexable
requires data to be passed as a constructor argument. I do not find it nice, because some questions about the dataset can be answered even before actually loading the data. See https://groups.google.com/forum/#!topic/blocks-users/Bh9HtqmXjJk, for instance.
(Sorry if creating a new issue is not the best way to communicate ... is there a preferred method?)
I have been working in pylearn2 for some time now and I am experiencing frustration in its limitations regarding hdf5 datasets. I'm very interested in using fuel for managing my datasets and I was excited to see in the documentation that there is a future direction to integrate fuel and pylearn2. However, I'm a little impatient to start taking advantage of fuel :).
Do you have any recommendations or prototypes towards pylearn2 integration? Seems like communication through the pylearn2 iterator is key. I started to hole-fill the fuel H5PYDataset so that it implemented a pylearn2-capable iterator, but it's turning out to be a big job and I'm not 100% sure that it's the right direction. At this point I'd like to hear your thoughts before I start cowboy coding a fork. I looked in the branches/forks and didn't immediately find anything relevant. Thanks!
You can find info about the Adult dataset here.
See these two tutorials (in order) for a crash-course on how to implement a new built-in dataset in Fuel:
A bit like Bokeh, it would be great to have the option of launching a Fuel server which can do preprocessing in a separate thread.
I imagine a scenario like this: you'd ask the Fuel server to maintain a queue of 10 preprocessed batches of examples from some dataset (e.g. ZCA-whitened CIFAR10 images) according to some iteration scheme, and your client application (e.g. Blocks, Pylearn2) could "consume" this data while Fuel replenishes the queue in parallel.
Seems, that there is an issue with MultiProcessing
transformer when HDF5 file is used.
Here is a stack trace:
HDF5ExtError: HDF5 error back trace
File "H5Dio.c", line 182, in H5Dread
can't read data
File "H5Dio.c", line 550, in H5D__read
can't read data
File "H5Dchunk.c", line 1866, in H5D__chunk_read
chunked read failed
File "H5Dscatgath.c", line 542, in H5D__scatgath_read
datatype conversion failed
File "H5T.c", line 4809, in H5T_convert
data type conversion failed
File "H5Tconv.c", line 3216, in H5T__conv_vlen
can't read VL data
File "H5Tvlen.c", line 891, in H5T_vlen_disk_read
Unable to read VL information
File "H5HG.c", line 622, in H5HG_read
unable to protect global heap
File "H5HG.c", line 262, in H5HG_protect
unable to protect global heap
File "H5AC.c", line 1329, in H5AC_protect
H5C_protect() failed.
File "H5C.c", line 3570, in H5C_protect
can't load entry
File "H5C.c", line 7950, in H5C_load_entry
unable to load entry
File "H5HGcache.c", line 141, in H5HG_load
bad global heap collection signature
End of HDF5 error back trace
Is it assumed that sources are fully "parallel", i.e. same number of examples in each?
Two issues come to mind:
My first attempt to use it ended up with the message below. I could not immediately find the reason in the code, will look later.
cPickle.PicklingError: Pickling failed.
Blocks relies on the ability to pickle the entire main loop, which includes all bricks, data streams, extensions, etc. All of these must be serializable using Pythons pickling library. This means certain things such as nested functions, lambda expressions, generators, etc. should be avoided. Please see the documentation for more detail.
Original exception:
PicklingError: Can't pickle <type 'iterator'>: attribute lookup __builtin__.iterator failed
Original exception:
PicklingError: Pickling failed.
Blocks relies on the ability to pickle the entire main loop, which includes all bricks, data streams, extensions, etc. All of these must be serializable using Pythons pickling library. This means certain things such as nested functions, lambda expressions, generators, etc. should be avoided. Please see the documentation for more detail
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.