Giter VIP home page Giter VIP logo

Comments (8)

rsepassi avatar rsepassi commented on May 21, 2024 1

Thanks for this suggestion!

We try to define DatasetBuilders in a way that accurately represents the source data instead of making opinionated decisions on how a user may want to use it. This allows TFDS to be modular and useful across a wide variety of use cases. The tf.data API is simple, flexible, and powerful enough to allow users to manipulate the data and it seems that simple casts and data reshapes are quite easy to do.

Could you elaborate a bit more on your concerns using the tf.data API to massage the data?

And a pointer to those FastAI examples would be great! We can take a look at them and see how we can provide an equally nice experience.

Thanks again for these suggestions and helping make TFDS a better library!

from datasets.

kovasb avatar kovasb commented on May 21, 2024

Thanks for considering this.

TL;DR I don't think it would hurt anyone to optionally provide TF-ecosystem-standard representations, and it would certainly help a lot of people immensely.

Somewhat basic question: Is this package targeted toward end users, or to developers?

If end users are a major audience, needing to transform the data to make it useful is a high barrier to entry. Experienced TF folks can do it in a few minutes, but a beginner? They have to actually understand Dataset, tensor types and shapes, the required arguments of the downstream functions (like convolution), and know the pattern for solving this particular puzzle, where to look for the necessary transform ops, etc, etc, etc.

Compare this to the fast.ai example (https://docs.fast.ai/):
path = untar_data(URLs.MNIST_SAMPLE)
data = ImageDataBunch.from_folder(path)
learn = create_cnn(data, models.resnet18, metrics=accuracy)
learn.fit(1)

As a Hello World example, this is a lot better. Don't need to understand the details of ImageDataBunch, only that it provides data to some high-level model fitting function create_cnn, which itself is mostly contains just a pretrained network models.resnet18. Don't need to understand the inputs to models.resnet18 or how they align with the outputs of ImageDataBunch.

TFDS can actually be cooler than this, since it can support the 1-liner formulation, but also has more thought-out lower level APIs for when you need custom.

From a design POV I could not agree more that TFDS should not be in the business of innovating tensor representations for raw data. That is why its a great thing tf.hub exists: it IS in that business and provides a well reasoned, conservative standard for the subsets of the problem that make sense. Having a canonical way to receive image data for example makes sense for the TF platform, allowing components like the data source and the model input layer to automatically interoperable.

from datasets.

Conchylicultor avatar Conchylicultor commented on May 21, 2024

I agree that getting the data as float is a really standard use case (required by Hub, Keras,...). The problem is that there isn't really a standard way of normalizing input image.

Mnist may be normalized between 0-1, but people uses a lot of different normalization for ImageNet (https://github.com/keras-team/keras-applications/blob/master/keras_applications/imagenet_utils.py#L21). So it's not clear what the API should looks like for tfds.

The current way of normalizing an image is something like:

def normalize_image(ex):
  ex['image'] = tf.to_float(ex['image']) / 255.
  return ex

ds = ds.map(normalize_image)

I don't know if we can reduce the boilerplate code in a way which would be flexible enough.

from datasets.

kovasb avatar kovasb commented on May 21, 2024

That is a pretty good example to consider.

For background, I helped work on this example of integrating data into a modelling system: https://reference.wolfram.com/language/ref/CountryData.html

The analogous issue with CountryData is map projections: should we make users project the data themselves, because there is no standard projection? Fortunately, the set of commonly used projections is pretty small. So we just allowed accessing any of them as if they were a property of the data, in addition of course to providing access to the raw data.

Overall I think this design philosophy of providing the most common data transformations has worked out pretty well. Even when there is not a single "canonical" algorithm, there is almost always a small set of very common transformations that can be provided to accelerate the downstream users. With CountryData we could easily see that a large fraction of users will want to project the geographic data; in the case of images, a large fraction of users will want to perform 1 of a handful of normalizations.

We could consider providing either dataset statistics (mean, variance, min/max), and/or scaled versions of the data according to the statistics.

In the case of supplying dataset statistics, it would be reasonable to attach the results of TF Data Validation on the DatasetInfo object.

In the case of supplying scaling versions of the features, a simple API would be to provide them as (lazily computed) additional features.

If the user wants to normalize according to some other scheme, its fine to let them use Dataset API to do it themselves. A totally general solution is not necessary to support the most common cases, and I'd argue that for data APIs that is the most reasonable design pattern because data is quirky and irregular by nature.

Anyway I understand this might not be the highest priority for the project at this early stage but I do hope this can be considered, or at least that the burden on the users to do this work themselves is tracked and revisited as this gets popular.

from datasets.

Conchylicultor avatar Conchylicultor commented on May 21, 2024

What would you think of an API which would apply transformation to each individual features. Would this satify your use case ?

ds = tfds.load(
    'imagenet',
    split='train',
    transform={
        'image': [
            tfds.transform.Normalize(),
            tfds.transform.RandomCrop((128, 128)),
        ],
    }
)

This would allow people to apply arbitrary augmentation to their pipeline.
This could also be used for text to apply custom "in graph" decoding.

However, it's not clear yet when we'll have time to implement this but I'll have a look those following weeks.

from datasets.

kovasb avatar kovasb commented on May 21, 2024

I like this idea a lot. Simple semantics, obvious utility, seems easy to build on in the future in terms of where the transforms are sourced.

If, in addition to this, it was possible to provide simple dataset statistics, then it would be a complete solution. It would be easy to take the statistic and plop it into a transform applied on the right feature.

from datasets.

Conchylicultor avatar Conchylicultor commented on May 21, 2024

We started a public discussion on what design to have for this transform API: #662

Don't hesitate to share your thoughts on this

from datasets.

Conchylicultor avatar Conchylicultor commented on May 21, 2024

Designing a good API for this which would satisfy most use cases while being simple and performant ended up more complicated than first though. It's not clear how to balance the different trade-offs (see discussion in #662).

I think this is still one of the main pain point of TFDS / tf.data, but. closing this issue in favor of #662

from datasets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.