Comments (8)
Thanks for this suggestion!
We try to define DatasetBuilder
s in a way that accurately represents the source data instead of making opinionated decisions on how a user may want to use it. This allows TFDS to be modular and useful across a wide variety of use cases. The tf.data
API is simple, flexible, and powerful enough to allow users to manipulate the data and it seems that simple casts and data reshapes are quite easy to do.
Could you elaborate a bit more on your concerns using the tf.data
API to massage the data?
And a pointer to those FastAI examples would be great! We can take a look at them and see how we can provide an equally nice experience.
Thanks again for these suggestions and helping make TFDS a better library!
from datasets.
Thanks for considering this.
TL;DR I don't think it would hurt anyone to optionally provide TF-ecosystem-standard representations, and it would certainly help a lot of people immensely.
Somewhat basic question: Is this package targeted toward end users, or to developers?
If end users are a major audience, needing to transform the data to make it useful is a high barrier to entry. Experienced TF folks can do it in a few minutes, but a beginner? They have to actually understand Dataset, tensor types and shapes, the required arguments of the downstream functions (like convolution), and know the pattern for solving this particular puzzle, where to look for the necessary transform ops, etc, etc, etc.
Compare this to the fast.ai example (https://docs.fast.ai/):
path = untar_data(URLs.MNIST_SAMPLE)
data = ImageDataBunch.from_folder(path)
learn = create_cnn(data, models.resnet18, metrics=accuracy)
learn.fit(1)
As a Hello World example, this is a lot better. Don't need to understand the details of ImageDataBunch, only that it provides data to some high-level model fitting function create_cnn, which itself is mostly contains just a pretrained network models.resnet18. Don't need to understand the inputs to models.resnet18 or how they align with the outputs of ImageDataBunch.
TFDS can actually be cooler than this, since it can support the 1-liner formulation, but also has more thought-out lower level APIs for when you need custom.
From a design POV I could not agree more that TFDS should not be in the business of innovating tensor representations for raw data. That is why its a great thing tf.hub exists: it IS in that business and provides a well reasoned, conservative standard for the subsets of the problem that make sense. Having a canonical way to receive image data for example makes sense for the TF platform, allowing components like the data source and the model input layer to automatically interoperable.
from datasets.
I agree that getting the data as float is a really standard use case (required by Hub, Keras,...). The problem is that there isn't really a standard way of normalizing input image.
Mnist may be normalized between 0-1, but people uses a lot of different normalization for ImageNet (https://github.com/keras-team/keras-applications/blob/master/keras_applications/imagenet_utils.py#L21). So it's not clear what the API should looks like for tfds
.
The current way of normalizing an image is something like:
def normalize_image(ex):
ex['image'] = tf.to_float(ex['image']) / 255.
return ex
ds = ds.map(normalize_image)
I don't know if we can reduce the boilerplate code in a way which would be flexible enough.
from datasets.
That is a pretty good example to consider.
For background, I helped work on this example of integrating data into a modelling system: https://reference.wolfram.com/language/ref/CountryData.html
The analogous issue with CountryData is map projections: should we make users project the data themselves, because there is no standard projection? Fortunately, the set of commonly used projections is pretty small. So we just allowed accessing any of them as if they were a property of the data, in addition of course to providing access to the raw data.
Overall I think this design philosophy of providing the most common data transformations has worked out pretty well. Even when there is not a single "canonical" algorithm, there is almost always a small set of very common transformations that can be provided to accelerate the downstream users. With CountryData we could easily see that a large fraction of users will want to project the geographic data; in the case of images, a large fraction of users will want to perform 1 of a handful of normalizations.
We could consider providing either dataset statistics (mean, variance, min/max), and/or scaled versions of the data according to the statistics.
In the case of supplying dataset statistics, it would be reasonable to attach the results of TF Data Validation on the DatasetInfo object.
In the case of supplying scaling versions of the features, a simple API would be to provide them as (lazily computed) additional features.
If the user wants to normalize according to some other scheme, its fine to let them use Dataset API to do it themselves. A totally general solution is not necessary to support the most common cases, and I'd argue that for data APIs that is the most reasonable design pattern because data is quirky and irregular by nature.
Anyway I understand this might not be the highest priority for the project at this early stage but I do hope this can be considered, or at least that the burden on the users to do this work themselves is tracked and revisited as this gets popular.
from datasets.
What would you think of an API which would apply transformation to each individual features. Would this satify your use case ?
ds = tfds.load(
'imagenet',
split='train',
transform={
'image': [
tfds.transform.Normalize(),
tfds.transform.RandomCrop((128, 128)),
],
}
)
This would allow people to apply arbitrary augmentation to their pipeline.
This could also be used for text to apply custom "in graph" decoding.
However, it's not clear yet when we'll have time to implement this but I'll have a look those following weeks.
from datasets.
I like this idea a lot. Simple semantics, obvious utility, seems easy to build on in the future in terms of where the transforms are sourced.
If, in addition to this, it was possible to provide simple dataset statistics, then it would be a complete solution. It would be easy to take the statistic and plop it into a transform applied on the right feature.
from datasets.
We started a public discussion on what design to have for this transform API: #662
Don't hesitate to share your thoughts on this
from datasets.
Designing a good API for this which would satisfy most use cases while being simple and performant ended up more complicated than first though. It's not clear how to balance the different trade-offs (see discussion in #662).
I think this is still one of the main pain point of TFDS / tf.data, but. closing this issue in favor of #662
from datasets.
Related Issues (20)
- TFDS fails to import on Windows HOT 2
- cannot import tensorflow_datasets HOT 21
- tfds.data_source errors HOT 1
- Cannot load huggingface:imagenet-1k dataset due to parse error HOT 5
- Manual download resisc45 dataset
- Need help building LVIS locally HOT 11
- wheat rust HOT 1
- Winogrande checksum changed HOT 1
- CycleGAN datasets not accessible HOT 1
- [data request] <dataset name>
- [data request] <speech_commands> HOT 1
- Dataset in tfds-nightly not loadable
- failed to make build_dataset HOT 2
- tensorflow_datasets v4.9.4 introduces bug that prevents loading datasets HOT 4
- How to use tff.simulation.datasets.ClientData no argument? HOT 2
- [data request] <Berkeley DeepDrive Dataset(images)> HOT 1
- convert byte string to string in tf.data.Dataset HOT 1
- [data request] <dataset name> HOT 1
- wider_face download fails with error 404 HOT 6
- NonMatchingChecksumError while downloading 'multi_news' or 'cnn_dailymail' dataset HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datasets.