tensorflow / datasets Goto Github PK
View Code? Open in Web Editor NEWTFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
Home Page: https://www.tensorflow.org/datasets
License: Apache License 2.0
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
Home Page: https://www.tensorflow.org/datasets
License: Apache License 2.0
tf.contrib.data.LMDBDataset will not make it to TF2. It may be moving to a new repo/package. If/when that happens, we should switch to it.
Part of #31 (TF 2.0 support)
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Is there a simple example showing how to import MNIST and train a simple neural network to make inferences on the data. It would help to see how to use this library in an end to end manner. Right now, I know how to import a dataset but not how to actually train a model with it.
I'm a lot more used to working with numpy datasets that I can feed directly into a TensorFlow feed dict.
Example given in README not implemented as of yet.
AttributeError: module 'tensorflow_datasets' has no attribute 'load'
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Good example for people to see use case of AI for Good.
Can tried to help to do TFRecords files if needed (I am learning it)
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
I'm receiving the following error after run pip install tensorflow-datasets
Could not find a version that satisfies the requirement tensorflow-datasets (from versions: )
No matching distribution found for tensorflow-datasets
Currently, the default download directory for dataset caching appears to be ~/tensorflow_datasets
. However, since it's not a folder that is meant to be accessed through a file manager, I'd suggest to make it hidden by default, e.g. ~/.tensorflow_datasets
.
pytz
is in setup.py but doesn’t seem to be used anywhere. Rm?
This is a tracking bug for extra-large dataset generation using Apache Beam (i.e. for datasets that cannot feasibly be generated within a day on a single machine). Follow it to be notified of updates on this support.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Is your feature request related to a problem? Please describe.
I just tried the minist dataset in a colab. It took a few minutes to download & convert the data to tfrecords, before I can try anything at all.
The keras built-in MNIST dataset loaded in a few seconds.
This is concerning because MNIST is pretty small. If I tried to use a bigger dataset, seems like I might be waiting for an hour.
Describe the solution you'd like
When not otherwise prohibited by dataset licensing, it would be great if the TFDS team could convert the datasets AOT to their TFRecord format and host the converted data as a public dataset in the cloud. So when users try to use the dataset, there is not an extensive pause.
Describe alternatives you've considered
Using the dataset apis of other frameworks.
You have a file called Text.md
and one called text.md
in the same directory datasets/docs/api_docs/python/tfds/features/
.
This is causing an issue in filesystems that ignore capitalization in filenames. (My filesystem overwrote one of the files with the other, and now git shows unstaged changes/modification.)
Could you resolve the conflict between the two files?
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
It would be great to have all TFDS datasets indexed in Dataset Search.
We need a library function that goes from a DatasetBuilder
to the schema.org markup needed. See the Google Dataset type docs and the schema.org docs. The markup should include usage instructions similar to the Kaggle example above (i.e. show off using the TFDS APIs for that dataset).
Then we need a script to generate an HTML page for each dataset and write them to a directory.
If Google indexes GitHub, then we'd be done. If not, we can copy those files over to the TF site, which is definitely indexed.
If anybody has experience with schema.org or is interested in having TFDS have wider exposure, this would be a great issue to pick up.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Hi,
Thanks for this package; its look really good.
I have a question regarding the optimal workflow and usage of this package for below mentioned scenario :
Let's say that there is a very large database (20+ GBs) of images (and access is password protected). In the tensorflow ecosystem it is recommended to use tfrecords to consolidate the labels and data and speed up training etc so this data needs to be converted to tfrecords.
Since it is 20+ GB and password protected I would tend to think that automatic download would not be recommended and hence this library provides a way to specify the folder.
However, since original dataset needs to be converted to tfrecords first -:
Would you suggest that the conversion (to tfrecords) is done as part of this library (sub classes) or should it be a separate step (equivalent to manual downloading of images)
Regards
Kapil
Using colab.research.google.com GCS access doesn’t work (tfds.load(“mnist”)
hangs). It seems due to tf.io.gfile
and not TFDS. TFDS is trying to access the dataset_info.json
file from GCS.
TensorFlow or Colab should fix this.
For now, a possible alternative is to use requests
to access the GCS files through the http API: http://storage.googleapis.com/tfds-data/dataset_info/mnist/1.0.0/dataset_info.json
.
For large datasets being processed on many workers, it is useful to be able to read separate shards of the dataset on each worker. Is this possible with the tfds
API?
I experimented a bit with the Split.subsplit
functionality, but it looks like it works by reading every dataset element and masking out selected elements (see here). This means that every worker ends up reading the whole dataset, which can be costly. In particular, this makes it impossible to use tf.data.Dataset.cache
.
Short description
With TF 2.0, there's a new Defun implementation that breaks sequence_feature_test.py
and open_images_test.py
. Currently these are disabled.
Environment information
tensorflow-datasets
/tfds-nightly
version: HEAD
tensorflow
/tensorflow-gpu
/tf-nightly
/tf-nightly-gpu
version: tf-nightly-2.0-preview
Reproduction instructions
See Travis failure: https://travis-ci.org/tensorflow/datasets/jobs/491913842
Link to logs
E tensorflow.python.framework.errors_impl.InvalidArgumentError: Tried to stack elements of an empty list with non-fully-defined element_shape: <unknown>
E [[{{node sequence_decode/TensorArrayV2Stack/TensorListStack}}]] [Op:IteratorGetNextSync]
Is your feature request related to a problem? Please describe.
Dataset features do not immediately compose with tf.hub modules. For example I want to fine tune a model for the cats_vs_dogs dataset, by reusing a tfhub image feature module. cats_vs_dogs provides the image as uint8 and undefined image size, but tfhub image modules expect float32 at specific sizes.
This can be solved with Dataset.map etc, but this is not obvious for beginners and otherwise is just a friction for everyone that dilutes the value of this project. The expectation is that Datasets of this project are ready to go for plugging into downstream computations, not that there is more massaging/transformation that needs to happen.
Describe the solution you'd like
I want to trivially compose tfds datasets with tfhub modules, without having to manually check and align details of tensor shapes and tensor types, or figure out where in the pipeline to insert a conversion function.
One solution could be to provide explicit features that are targeted for compatibility with tf.hub. Another solution could be to have the Builder parameterize/generate a set of FeatureColumns in correspondence to the FeaturesDict.
Describe alternatives you've considered
Dataset.map, ugly and against the spirit of this project which seems to be "make it easy to plug existing datasets into TF"
Additional context
FastAI has very slick examples for fine tuning a model for this dataset, TF solution could be competitive or better but the pieces need to fit together.
Is your feature request related to a problem? Please describe.
Using tf-datasets
with private datasets stored on Google Cloud Storage returns an error code 403 since the current download method using request does not handle the authentification. Also, paths specified in the form of gs://...
are not currently supported
Describe the solution you'd like
Using tf.GFile
whenever possible (in place of the current requests
based approach) would solve both of these issues.
Additional context
A possible implementation could add a check (or something like that) during the call of _sync_download()
and then if the given URI matches gs://
(or any supported URL) use tf.GFile to download it.
Short description
Spurious logging from GCS access when using tfds.load
. From metadata file access internally.
Environment information
tensorflow-datasets
/tfds-nightly
version: tfds-nightly
tensorflow
/tensorflow-gpu
/tf-nightly
/tf-nightly-gpu
version: tf-nightly
Reproduction instructions
import tensorflow as tf
tf.io.gfile.exists("gs://tfds-data")
Link to logs
2019-02-03 02:03:50.095060: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.92628 seconds (attempt 10 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-02-03 02:03:52.022467: W tensorflow/core/platform/cloud/google_auth_provider.cc:157] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Aborted: All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'".
Additional context
Clearly a problem with TensorFlow. But would be nice to not have these logs dumped (10 retries). They go away on subsequent access to GCS. Seems to be just on first access. And nothing crashes or breaks, just wait for the 10 retries to be done and move on. Annoying.
One alternative for now may be to use requests
to use the HTTP API for GCS access to the TFDS bucket (similar to #36).
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Is your feature request related to a problem? Please describe.
Ideally, the datasets API would be available cross language, like Keras or TensorFlow. Many TF learners are coming to TensorFlow from JavaScript, and would benefit from the access to known datasets.
Describe the solution you'd like
npm add @tensorflow/tensorflow-datasets
import * as tfds from '@tensorflow/tensorflow-datasets'
const ds = tfds.load(name='mnist');
Additional context
js.tensorflow.org
Related URL that uses the data: https://github.com/abisee/cnn-dailymail
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Hi all, thanks for your interest in tensorflow/datasets
. We're actively working on the project and hope to release soon with a starter set of datasets. Please follow this issue to be notified of when we have an initial version on PyPI.
We have an alpha nightly release that you can try out: pip install tfds-nightly
. Please leave feedback through Issues if you try it out.
If you're interested in contributing a dataset implementation, please feel free to start looking through the new dataset documentation and familiarize yourself with the codebase. MNIST
might be a good starting point.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
ModuleNotFoundError: No module named 'cPickle'
Jupyter does not seem to be respecting the virtualenv. Update the test script to make the notebook respect the virtualenv.
Results here seem relevant: https://www.google.com/search?q=jupyter+running+in+virtualenv
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Name of dataset: MultiNLI
URL of dataset: https://www.nyu.edu/projects/bowman/multinli/
License of dataset:
License
See details in the data description paper: https://www.nyu.edu/projects/bowman/multinli/paper.pdf
Short description of dataset and use case(s):
The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. The corpus served as the basis for the shared task of the RepEval 2017 Workshop at EMNLP in Copenhagen.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
It seems importing tensorflow_datasets always enables eager execution, which I don't want. Is there a way to disable it? Thank you very much!
If I execute the following piece, the eager mode will be enabled by importing tensorflow_datasets
In [1]: import tensorflow as tf
In [2]: import tensorflow_datasets as tfds
In [3]: tf.executing_eagerly()
Out[3]: True
However, if I execute the following, an error will be raised!
In [1]: import tensorflow as tf
In [2]: tf.executing_eagerly()
Out[2]: False
In [3]: import tensorflow_datasets as tfds
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-46a8a2031c9c> in <module>
----> 1 import tensorflow_datasets as tfds
C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\__init__.py in <module>
49 # Imports for registration
50 # pylint: disable=g-import-not-at-top
---> 51 from tensorflow_datasets import audio
52 from tensorflow_datasets import image
53 from tensorflow_datasets import text
C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\audio\__init__.py in <module>
16 """Audio datasets."""
17
---> 18 from tensorflow_datasets.audio.librispeech import Librispeech
19 from tensorflow_datasets.audio.librispeech import LibrispeechConfig
20 from tensorflow_datasets.audio.nsynth import Nsynth
C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\audio\librispeech.py in <module>
26
27 from tensorflow_datasets.core import api_utils
---> 28 import tensorflow_datasets.public_api as tfds
29
30 _CITATION = """\
C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\public_api.py in <module>
61
62
---> 63 testing = _import_testing()
C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\public_api.py in _import_testing()
55 def _import_testing():
56 try:
---> 57 from tensorflow_datasets import testing # pylint: disable=redefined-outer-name
58 return testing
59 except:
C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\testing\__init__.py in <module>
16 """Testing utilities."""
17
---> 18 from tensorflow_datasets.testing.dataset_builder_testing import DatasetBuilderTestCase
19 from tensorflow_datasets.testing.test_case import TestCase
20 from tensorflow_datasets.testing.test_utils import DummyDatasetSharedGenerator
C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow_datasets\testing\dataset_builder_testing.py in <module>
37 from tensorflow_datasets.testing import test_utils
38
---> 39 tf.compat.v1.enable_eager_execution()
40
41 # `os` module Functions for which tf.io.gfile equivalent should be preferred.
C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in enable_eager_execution(config, device_policy, execution_mode)
5421 device_policy=device_policy,
5422 execution_mode=execution_mode,
-> 5423 server_def=None)
5424
5425
C:\Users\Liyu\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in enable_eager_execution_internal(config, device_policy, execution_mode, server_def)
5489 else:
5490 raise ValueError(
-> 5491 "tf.enable_eager_execution must be called at program startup.")
5492
5493 # Monkey patch to get rid of an unnecessary conditional since the context is
ValueError: tf.enable_eager_execution must be called at program startup.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Folks who would also like to see this dataset in tensorflow/datasets
, please +1/thumbs-up so the developers can know which requests to prioritize.
Any plans to support TensorFlow 2? Since Datasets are not released and hopefully TF2 is going to be released in the next months, would totally make sense.
What do you think?
Currently I've got the issues with contrib part
# Flatten
--> 127 flat_ds = tf.contrib.framework.nest.flatten(nested_ds)
128 flat_np = []
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.