Comments (3)
There are 2 workarounds, tho:
- Download datasets from web and just load them locally
- Use metadata directly (temporal solution, since metadata can change)
import datasets
from datasets.data_files import DataFilesDict, DataFilesList
data_files_list = DataFilesList(
[
"hf://datasets/allenai/c4@1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-train.00000-of-01024.json.gz"
],
[("allenai/c4", "1588ec454efa1a09f29cd18ddd04fe05fc8653a2")],
)
data_files = DataFilesDict({"train": data_files_list})
c4_dataset = datasets.load_dataset(
path="allenai/c4",
data_files=data_files,
split="train",
cache_dir="/datesets/cache",
download_mode="reuse_cache_if_exists",
token=False,
)
Second solution also shows where to find the bug. I suggest that the hashing functions should always use only original parameter data_files
, and not the one they get after connecting to the server and creating DataFilesDict
from datasets.
Hi! You need to set the HF_DATASETS_OFFLINE
env variable to 1
to load cached datasets offline, as explained in the docs here.
from datasets.
Just tested. It doesn't work, because of the exact problem I described above: hash of dataset config is different.
The only error difference is the reason why it cannot connect to HuggingFace (now it's 'offline mode is enabled')
from datasets.
Related Issues (20)
- List of dictionary features get standardized
- [WebDataset] KeyError with user-defined `Features` when a field is missing in an example
- HTTPError 403 raised by CLI convert_to_parquet when creating script branch on 3rd party repos
- Add the option of saving in parquet instead of arrow HOT 2
- Extraction protocol for arrow files is not defined
- irc_disentangle - Issue with splitting data
- Support the deserialization of json lines files comprised of lists HOT 1
- Fail to load "stas/c4-en-10k" dataset since 2.16 version HOT 2
- Add MedImg for streaming HOT 3
- Column order is nondeterministic when loading from JSON
- ```push_to_hub()``` - Prevent Automatic Generation of Splits
- WinError 32 The process cannot access the file during load_dataset
- NonMatchingSplitsSizesError when using data_dir HOT 2
- Invalid YAML in README.md: unknown tag !<tag:yaml.org,2002:python/tuple>
- Export Parquet Tablet Audio-Set is null bytes in Arrow
- Caching map result of DatasetDict.
- Avoid downloading the whole dataset when only README.me has been touched on hub. HOT 2
- ValueError: Couldn't infer the same data file format for all splits. Got {'train': ('json', {}), 'validation': (None, {})}
- Support for pathlib.Path in datasets 2.19.0
- save_to_disk() freezes when saving on s3 bucket with multiprocessing
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datasets.