jonasgeiping / cramming Goto Github PK
View Code? Open in Web Editor NEWCramming the training of a (BERT-type) language model into limited compute.
License: MIT License
Cramming the training of a (BERT-type) language model into limited compute.
License: MIT License
I am asking this for benchmarking purposes. In the config files, it is stated that training lasts 600_000 micro-batch steps and is terminated in 1 day if it does not reach it. How many training steps are actually taken using an RTX-A4000 in a day ?
Hi,
I am trying to replicate the final recipe by running python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert train=bert-o4 data=pile-readymade
as explained in the README file and I am getting the following error: torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: FileNotFoundError: [Errno 2] No such file or directory: 'ldconfig'
. The error message suggests me to set the environment variables TORCH_LOGS="+dynamo" TORCHDYNAMO_VERBOSE=1
which I did and the error message is shown in the box below. Please help me figure out how to solve this issue related to ldconfig
. I could not find a solution to this on the web.
[2023-12-19 17:44:59,958] [0/0] torch._dynamo.output_graph: [INFO] Step 2: calling compiler function inductor
Error executing job with overrides: ['name=amp_b8192_cb_o4_final', 'arch=crammed-bert', 'train=bert-o4', 'data=pile-readymade']
Traceback (most recent call last):
File "/nfs/scistore19/alistgrp/imodoran/workplace/M-FAC_extensions/cramming/pretrain.py", line 199, in launch
cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
File "/nfs/scistore19/alistgrp/imodoran/workplace/M-FAC_extensions/cramming/cramming/utils.py", line 54, in main_launcher
metrics = main_fn(cfg, setup) File "/nfs/scistore19/alistgrp/imodoran/workplace/M-FAC_extensions/cramming/pretrain.py", line 55, in main_training_process
loss = model_engine.step(device_batch) File "/nfs/scistore19/alistgrp/imodoran/workplace/M-FAC_extensions/cramming/cramming/backend/torch_default.py", line 124, in step
loss = self.forward(**batch)["loss"]
File "/nfs/scistore19/alistgrp/imodoran/workplace/M-FAC_extensions/cramming/cramming/backend/torch_default.py", line 140, in forward
return self.model(*inputs, **kwargs)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs) File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs) File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
return fn(*args, **kwargs)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
return fn(*args, **kwargs)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert
return _compile(
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 569, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner
out_code = transform_code_object(code, transform)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
transformations(instructions, code_options)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 458, in transform
tracer.run()
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2069, in run
super().run()
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 719, in run
and self.step()
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 683, in step
getattr(self, inst.opname)(inst)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2157, in RETURN_VALUE
self.output.compile_subgraph(
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 857, in compile_subgraph
self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 957, in compile_and_call_fx_graph
compiled_fn = self.call_user_compiler(gm)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1024, in call_user_compiler
raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1009, in call_user_compiler
compiled_fn = compiler_fn(gm, self.example_inputs())
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
compiled_gm = compiler_fn(gm, example_inputs)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
compiled_gm = compiler_fn(gm, example_inputs)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/__init__.py", line 1568, in __call__
return compile_fx(model_, inputs_, config_patches=self.config)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 961, in compile_fx
return compile_fx(
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1150, in compile_fx
return aot_autograd(
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/backends/common.py", line 55, in compiler_fn
cg = aot_module_simplified(gm, example_inputs, **kwargs)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 3891, in aot_module_simplified
compiled_fn = create_aot_dispatcher_function(
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 3429, in create_aot_dispatcher_function
compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 2212, in aot_wrapper_dedupe
return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 2392, in aot_wrapper_synthetic_base
return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 2917, in aot_dispatch_autograd
compiled_fw_func = aot_config.fw_compiler(
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1092, in fw_compiler_base
return inner_compile(
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/repro/after_aot.py", line 80, in debug_wrapper
inner_compiled_fn = compiler_fn(gm, example_inputs)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/debug.py", line 228, in inner
return fn(*args, **kwargs)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 54, in newFunction
return old_func(*args, **kwargs)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 341, in compile_fx_inner
compiled_graph: CompiledFxGraph = fx_codegen_and_compile(
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 565, in fx_codegen_and_compile
compiled_fn = graph.compile_to_fn()
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/graph.py", line 970, in compile_to_fn
return self.compile_to_module().call
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/graph.py", line 941, in compile_to_module
mod = PyCodeCache.load_by_key_path(key, path, linemap=linemap)
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1139, in load_by_key_path
exec(code, mod.__dict__, mod.__dict__)
File "/tmp/torchinductor_imodoran/k6/ck6fiae7msa7cgviyukidcm4bynb5bjdai7xz5hbv7tswlzqpxba.py", line 1127, in <module>
async_compile.wait(globals())
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1418, in wait
scope[key] = result.result()
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1277, in result
self.future.result()
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/concurrent/futures/_base.py", line 446, in result
return self.__get_result()
File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
FileNotFoundError: [Errno 2] No such file or directory: 'ldconfig'
Hey,
I'm trying to run your Code and install cramming, but got the following error:
File /opt/conda/lib/python3.10/site-packages/datasets/distributed.py:3
1 from typing import TypeVar
----> 3 from .arrow_dataset import Dataset, _split_by_node_map_style_dataset
4 from .iterable_dataset import IterableDataset, _split_by_node_iterable_dataset
7 DatasetType = TypeVar("DatasetType", Dataset, IterableDataset)
ImportError: cannot import name '_split_by_node_map_style_dataset' from 'datasets.arrow_dataset' (/opt/conda/lib/python3.10/site-packages/datasets/arrow_dataset.py)
Can you publish the pip freeze
output of your env and also Python version you are using, I suspect a incompatability is the reason.
Hi
I am trying to do some bench-marking as part of my experiments i want train BERT model with 512 sequence length and dtype as float 32 , i have pre trained the model wth above configuration and run the evaluation on glue_sne but the numbers are very poor.
May i know what went wrong
The verification command fails on macOS Ventura on a MacBook Pro M1 Pro:
python pretrain.py name=test arch=bert-base train=bert-base data=sanity-check-2 dryrun=True impl.microbatch_size=2
The error:
Error executing job with overrides: ['name=test', 'arch=bert-base', 'train=bert-base', 'data=sanity-check-2', 'dryrun=True', 'impl.microbatch_size=2']
Traceback (most recent call last):
File "/Users/louislac/Documents/Developer/Python/cramming/pretrain.py", line 153, in launch
cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
File "/Users/louislac/Documents/Developer/Python/cramming/cramming/utils.py", line 57, in main_launcher
setup = system_startup(cfg)
File "/Users/louislac/Documents/Developer/Python/cramming/cramming/utils.py", line 81, in system_startup
torch.multiprocessing.set_sharing_strategy(cfg.impl.sharing_strategy)
File "/Users/louislac/Documents/Developer/Python/cramming/.env/lib/python3.10/site-packages/torch/multiprocessing/__init__.py", line 58, in set_sharing_strategy
assert new_strategy in _all_sharing_strategies
AssertionError
Upon investigation, it looks like impl.sharing_strategy
is "file_descriptor"
(default value) but _all_sharing_strategies
only includes "file_system"
on macOS and Windows. Changing this value to file_system
solves the issue, thought I do not know the implications:
python pretrain.py name=test arch=bert-base train=bert-base data=sanity-check-2 dryrun=True impl.microbatch_size=2 impl.sharing_strategy=file_system
I have been playing with this on my local hardware which is somewhat smaller even than your paper's reference machines (GPU is GTX1080, 8GB). One thing that has become apparent is that there is a difficulty with investigating scaling of the model size (#heads, depth, etc.) in that substantially different hyperparameters are required for effective model calibration as the size is varied. There is a paper by Yang et. al. "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" (https://arxiv.org/abs/2203.03466) which addresses exactly this issue, and proposes some modifications to how hyper-parameters and initializations are specified to make good hyperparameter choice much more invariant across model size. I suggest incorporating their parameterization would be a very useful change. One thing it would allow is more rapid investigation with very small crammed models for initial exploration and then much easier scaling up to test things in the larger model context.
Hi,
Thank you for this amazing repository. I am trying to replicate your model by running the default command in README
python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert train=bert-o4 data=pile-readymade
and
python eval.py eval=GLUE_sane name=amp_b8192_cb_o4_final eval.checkpoint=latest impl.microbatch_size=16 impl.shuffle_in_dataloader=True impl.compile_torch=False
The only change I made to the above command is adding 'budget=24' to the training command.
I train the model for 24hrs on 1 A100 40G GPU, but the average GLUE is only 0.73, based on your paper I assume it should be somewhere between 0.792 (A4000) and 0.804 (A6000).
The installation of the repository are done in a fresh conda environment, I only made three change to the code, which are the change mentioned in #38 , #44 and wandb configs.
Below is the attached wandb log for the pre-training loss, the loss ends in 2.973 and the curve does not looks right.
Could you guide me on what might be the problem? I am happy to provide any further information you need.
Thanks so much for the help!
Hmm, This may seem a bit excessive, but I'm a bit confused and don't know how to preprocess the data and train a RoBERTa model. Can you do a basic step by step tutorial for me?
Looks like I'm also looking to implement a custom tokenizer for training. Do you have any suggestions?
Thanks a lot.
Hello!
I'm trying to use a model that was pre-trained using cramming as a huggingface model (using AutoModel.from_pretrained(PATH_TO_MODEL)
.
The transformers library needs model.bin
file instead of the model.pth
format the save_final_model()
func creates currently.
Is there a suggested way to convert the files easily or to be able to use the checkpoints as a 'huggingface' model?
thanks!
after pip install -e .
try
python pretrain.py name=test arch=hf-bert-base train=bert-base dryrun=True
the console error as following
zsh: illegal hardware instruction python pretrain.py name=test arch=hf-bert-base train=bert-base dryrun=True
any idea ? thanks.
Hi, thank you for this wonderful work.
I met with some troubles when reproducing the head only results. I mean, I can reproduce your results on end-to-end tuning, but when I freeze the BERT (encoder) parameters and only tune the classification head, the result can not be as good as your checkpoint.
The SST-2 accuracy of your checkpoint at https://huggingface.co/JonasGeiping/crammed-bert is 0.922 (end-to-end) and 0.918 (head only) in my reproduction. The bert-base-uncased (from HuggingFace) accuracy is 0.931 (end-to-end) and 0.930 (head only).
I downloaded the c4-subset-processed from your dropbox link and I replicated your work by running:
python pretrain.py name=amp_b4096_c5_o3_final arch=bert-c5 train=bert-o3 train.batch_size=4096 data=c4-subset-processed
The end-to-end accuracy on SST-2 is 0.922 but the head only acuuracy is only 0.784. I'm wondering why I got this problem.
I freeze the encoder parameters by:
for param in model.encoder.parameters():
param.requires_grad = False
I also want to know how the checkpoint at https://huggingface.co/JonasGeiping/crammed-bert was trained? Was it trained by running the above command?
Thanks again for your time!
Error executing job with overrides: []
Traceback (most recent call last):
File "/tmp/pycharm_project_41/cramming-main/pretrain.py", line 153, in launch
cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
File "/tmp/pycharm_project_41/cramming-main/cramming/utils.py", line 64, in main_launcher
main_fn(cfg, setup)
File "/tmp/pycharm_project_41/cramming-main/pretrain.py", line 45, in main_training_process
for step, batch in iterable_data:
File "/tmp/pycharm_project_41/cramming-main/cramming/backend/utils.py", line 263, in next
batch = next(self.dataset_iterator)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 530, in next
data = self._next_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
data.reraise()
File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 457, in reraise
raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.8/dist-packages/transformers/data/data_collator.py", line 42, in call
return self.torch_call(features)
File "/tmp/pycharm_project_41/cramming-main/cramming/backend/utils.py", line 221, in torch_call
storage = elem._storage()._new_shared(len(examples) * 8 * elem.shape[0], device=elem.device) # 8 for byte->long
TypeError: _new_shared() got an unexpected keyword argument 'device'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Process finished with exit code 1
Thanks for the fix, however when I run the pretraining script with the updated command the following error was raised:
166 Resolving data files: 100%|███████████████████| 88/88 [00:02<00:00, 43.91it/s]
167 Error executing job with overrides: ['name=cram_24h', 'arch=crammed-bert', 'train=bert-o4', 'data=pile-readymade', 'budget=24']
168 Traceback (most recent call last):
169 File "/localdisk/home/Work/Repositories/cramming/pretrain.py", line 196, in launch
170 cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
171 File "/localdisk/home/Work/Repositories/cramming/cramming/utils.py", line 54, in main_launcher
172 metrics = main_fn(cfg, setup)
173 File "/localdisk/home/Work/Repositories/cramming/pretrain.py", line 21, in main_training_process
174 dataset, tokenizer = cramming.load_pretraining_corpus(cfg.data, cfg.impl)
175 File "/localdisk/home/Work/Repositories/cramming/cramming/data/pretraining_preparation.py", line 40, in load_pretraining_corpus
176 return _load_from_hub(cfg_data, data_path)
177 File "/localdisk/home/Work/Repositories/cramming/cramming/data/pretraining_preparation.py", line 461, in _load_from_hub
178 tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, split="train", streaming=cfg_data.streaming, cache_dir=data_path)["train"]
179 File "/home/.local/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 60, in getitem
180 raise NotImplementedError("Subclasses of Dataset should implement getitem.")
181 NotImplementedError: Subclasses of Dataset should implement getitem.
182 Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Have you encountered similar issues?
Thank you
Originally posted by @shiwenqin in #43 (comment)
Hello, I've been using this repository on a cloud cluster of A100 gpus. Unfortunately, my credits have ended, and I'm planning to buy a PC to continue running experiments. The RTX 3060 has 12gb of vram, which is 1 gb more than the 2080 which was used in the paper. Do you think that it would be possible to pre-train a bert model with the RTX 3060? It would be great if you could advise me on this before going ahead and buying the PC.
Thank you very much!
tokenizer, cfg_arch, model_file = cramming.utils.find_pretrained_checkpoint(cfg)
File "/home/tahabinhuraib/cramming/cramming/utils.py", line 177, in find_pretrained_checkpoint
all_checkpoints = [f for f in os.listdir(local_checkpoint_folder)]
FileNotFoundError: [Errno 2] No such file or directory: '/home/tahabinhuraib/cramming/outputs/bert-finetuning/checkpoints'
Great work and lovely repo. However, I am failing to push to HF using the provided load_local_model.py script.
I have a private dataset, and use the pre-training script successfuly via:
python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert train=bert-o4 data={my_dataset}
Trained fine - saved fine.
But when running - I just want to try pushing to hub for instance:
python load_local_model.py name=amp_b8192_cb_o4_mimic_final wandb=none impl.push_to_huggingface_hub=True arch=crammed-bert train=bert-o4 dryrun=False +eval=GLUE_sane
I get a whole lot of missing keys when trying to load the state dicts:
RuntimeError: Error(s) in loading state_dict for OptimizedModule:
Missing key(s) in state_dict: "_orig_mod.encoder.embedding.word_embedding.weight", "_orig_mod.encoder.embedding.pos_embedding.scale_factor", "_orig_mod.encoder.embedding.norm.weight", "_orig_mod.encoder.embedding.norm.bias", "_orig_mod.encoder.layers.0.norm1.weight",....
and so on.
Is there anything obvious I am missing when trying to re-load the model?
Another question - is there a straight forward way to convert the current model files to that compatible with the HF transformers library, but locally rather than via hub?
Any help would be much appreciated. Package info below. Python 3.10.
Package Version
------------------------ ------------
aiohttp 3.8.5
aiosignal 1.3.1
antlr4-python3-runtime 4.9.3
asttokens 2.4.0
async-timeout 4.0.3
attrs 23.1.0
backcall 0.2.0
certifi 2023.7.22
charset-normalizer 3.2.0
cmake 3.27.4.1
comm 0.1.4
cramming 0.1.0
datasets 2.14.5
debugpy 1.8.0
decorator 5.1.1
dill 0.3.7
einops 0.6.1
evaluate 0.4.0
exceptiongroup 1.1.3
executing 1.2.0
filelock 3.12.4
frozenlist 1.4.0
fsspec 2023.6.0
huggingface-hub 0.16.4
hydra-core 1.3.2
idna 3.4
ipykernel 6.25.2
ipython 8.15.0
jedi 0.19.0
Jinja2 3.1.2
joblib 1.3.2
jupyter_client 8.3.1
jupyter_core 5.3.1
lit 16.0.6
MarkupSafe 2.1.3
matplotlib-inline 0.1.6
mpmath 1.3.0
multidict 6.0.4
multiprocess 0.70.15
nest-asyncio 1.5.7
networkx 3.1
numpy 1.25.2
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
omegaconf 2.3.0
packaging 23.1
pandas 2.1.0
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
pip 22.3.1
platformdirs 3.10.0
prompt-toolkit 3.0.39
psutil 5.9.5
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 13.0.0
Pygments 2.16.1
pynvml 11.5.0
python-dateutil 2.8.2
pytz 2023.3.post1
PyYAML 6.0.1
pyzmq 25.1.1
regex 2023.8.8
requests 2.31.0
responses 0.18.0
safetensors 0.3.3
scikit-learn 1.3.0
scipy 1.11.2
setuptools 65.5.0
six 1.16.0
stack-data 0.6.2
sympy 1.12
threadpoolctl 3.2.0
tokenizers 0.13.3
torch 2.0.1
tornado 6.3.3
tqdm 4.66.1
traitlets 5.10.0
transformers 4.33.2
triton 2.0.0
typing_extensions 4.7.1
tzdata 2023.3
urllib3 2.0.4
wcwidth 0.2.6
wheel 0.41.2
xxhash 3.3.0
yarl 1.9.2
zstandard 0.21.0
Hi Jonas,
Thanks for sharing the great work! I have a small question about the paper.
Both your paper and Izsak et al. referred to Roberta for something called "sparse token prediction", which I couldn't find in the Roberta paper. From your code, it appears that "sparse token prediction" just means that you are only calculating the loss from the positions that's masked. It seems that this should be the default setting for training an MLM (and appears to be the case in Bert's code. The situation where you turn off this sparse prediction doesn't quite make sense -- why would one want to predict the unmasked tokens? Am I missing something obvious here?
Thanks for any help!
Hey there, and thank you for this wonderful work!
I'm trying to grab the prepcoessed dataset files from Dropbox, but it is sort of a pain to remotely download it due to Dropbox putting restrictions on the links :\
Would it be possible for you to mirror it on Google Drive (so gdown would work) or on S3 (via Requester Pays)?
Hello,
How much storage space should I reserve to run following recipe ?
python pretrain.py name=amp_b4096_c5_o3_final arch=bert-c4 train=bert-o3 train.batch_size=4096 data=c4-subset-processed
While evaluating UltraFastBERT (a downstream project using the repository at https://github.com/pbelcak/UltraFastBERT under the training
folder, with most of the code identical), I encountered the following error when running python eval.py eval=GLUE name=UltraFastBERT-1x11-long eval.checkpoint=hf://pbelcak/UltraFastBERT-1x11-long impl.microbatch_size=4d
:
loaded with 164,460,531 parameters.
Some weights of ScriptableLMForSequenceClassification were not initialized from the model checkpoint at pbelcak/UltraFastBERT-1x11-long and are newly initialized: ['pooler.dense.weight', 'head.weight', 'head.bias', 'pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Error executing job with overrides: ['eval=GLUE', 'name=UltraFastBERT-1x11-long', 'eval.checkpoint=hf://pbelcak/UltraFastBERT-1x11-long', 'impl.microbatch_size=4']
Traceback (most recent call last):
File "/root/autodl-tmp/UltraFastBERT/training/eval.py", line 147, in launch
cramming.utils.main_launcher(cfg, main_downstream_process, job_name="downstream finetuning")
File "/root/autodl-tmp/UltraFastBERT/training/cramming/utils.py", line 54, in main_launcher
metrics = main_fn(cfg, setup)
File "/root/autodl-tmp/UltraFastBERT/training/eval.py", line 37, in main_downstream_process
model_engine.load_checkpoint(cfg_arch, model_file)
File "/root/autodl-tmp/UltraFastBERT/training/cramming/backend/torch_default.py", line 237, in load_checkpoint
self.optimizer, self.scheduler = _load_optimizer(self.model, self.cfg_train, self.cfg_impl)
TypeError: _load_optimizer() missing 1 required positional argument: 'initial_time'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
And indeed, line 237 of the file calls _load_optimizer
with just 3 arguments instead of 4:
cramming/cramming/backend/torch_default.py
Line 237 in f6ba4cb
Maybe add self.initial_time
as the fourth argument?
Hi
I am trying train cramming bert on bookcorpus dataset and evaluating on GLUE but during evaluation got CUDA error , not sure what went wrong
here is the training step
return dsl.ContainerOp(
name='Train Model',
image='tiruai/cramming-bert-training:v0.1',
command="python",
arguments=[
"/app/pretrain.py",
"name=bookcorpus_wiki_training",
"data=bookcorpus-wikipedia",
"arch=bert-c5",
"train=bert-o3",
"train.batch_size=4096"
],
# file_outputs={
# 'model': '/mnt/model.pt',
# },
pvolumes={"/mnt": vol_existing}
).set_image_pull_policy(
'Always').set_gpu_limit(1).set_image_pull_policy('Always').set_cpu_limit("100").set_memory_limit("100Gi")
evaluation step code
ef eval_op():
return dsl.ContainerOp(
name='Evaluation GLUE Model',
image='tiruai/cramming-bert-training:v0.1',
command="python",
arguments=[
"/app/eval.py",
"name=bookcorpus_wiki_training",
"eval.checkpoint=latest",
"impl.microbatch_size=16",
"impl.shuffle_in_dataloader=True",
],
# file_outputs={
# 'model': '/mnt/model.pt',
# },
pvolumes={"/mnt": vol_existing}
).set_image_pull_policy(
'Always').set_gpu_limit(1).set_image_pull_policy('Always').set_cpu_limit("100").set_memory_limit("100Gi")
error message
[2023-05-24 08:31:36,608] [CLS] it is born of another of those fated yet fortuitous connections in didion's disorienting world, this one between two people ( elena mcmahon and treat morrison ) who were equally remote. [SEP] ellena mcmahon and treat morrison have a lucky connection despite both being remote. [SEP]
[2023-05-24 08:31:36,608] ... is tokenized into ...
[2023-05-24 08:31:36,609] [CLS]_it_is_born_of_another_of_those_fated_yet_fort_##uit_##ous_connections_in_did_##ion_'_s_di_##sor_##ient_##ing_world_,_this_one_between_two_people_(_elena_mcmahon_and_treat_morrison_)_who_were_equally_remote_._[SEP]_ellen_##a_mcmahon_and_treat_morrison_have_a_lucky_connection_despite_both_being_remote_._[SEP]
[2023-05-24 08:31:36,610] Correct Answer: entailment
[2023-05-24 08:31:36,610] Random sentence from validset of size 9,815: ...
[2023-05-24 08:31:36,611] [CLS] in the small marina you can eat while surrounded by expensive boats. [SEP] in the marina is where you can eat while being around expensive boats. [SEP]
[2023-05-24 08:31:36,611] Correct Answer: entailment
[2023-05-24 08:31:36,618] Finetuning task mnli with 3 classes for 245430 steps.
[2023-05-24 08:31:40,062] Model with architecture ScriptableMaskedLM loaded with 118,654,467 parameters.
[2023-05-24 08:31:41,135] State dict difference is ScriptableLMForSequenceClassification:
Missing key(s) in state_dict: "pooler.dense.weight", "pooler.dense.bias", "head.weight", "head.bias".
Unexpected key(s) in state_dict: "prediction_head.weight", "decoder.weight". ... Ok?
03 examples/s]
Running tokenizer on dataset: 82%|████████▏ | 321536/392702 [00:22<00:04, 15399.13 examples/s]
Running tokenizer on dataset: 82%|████████▏ | 323584/392702 [00:22<00:04, 15572.11 examples/s]
Running tokenizer on dataset: 83%|████████▎ | 325632/392702 [00:22<00:05, 12331.06 examples/s]
Running tokenizer on dataset: 83%|████████▎ | 327680/392702 [00:22<00:04, 13426.25 examples/s]
Running tokenizer on dataset: 84%|████████▍ | 329728/392702 [00:23<00:04, 13973.85 examples/s]
Running tokenizer on dataset: 84%|████████▍ | 331776/392702 [00:23<00:04, 14539.52 examples/s]
Running tokenizer on dataset: 85%|████████▌ | 333824/392702 [00:23<00:04, 14640.69 examples/s]
Running tokenizer on dataset: 86%|████████▌ | 335872/392702 [00:23<00:03, 15239.77 examples/s]
Running tokenizer on dataset: 86%|████████▌ | 337920/392702 [00:23<00:03, 15290.07 examples/s]
Running tokenizer on dataset: 87%|████████▋ | 339968/392702 [00:23<00:03, 15587.97 examples/s]
Running tokenizer on dataset: 87%|████████▋ | 342016/392702 [00:23<00:03, 16011.82 examples/s]
Running tokenizer on dataset: 88%|████████▊ | 344064/392702 [00:23<00:03, 16177.33 examples/s]
Running tokenizer on dataset: 88%|████████▊ | 346112/392702 [00:24<00:03, 12678.05 examples/s]
Running tokenizer on dataset: 89%|████████▊ | 348160/392702 [00:24<00:03, 13702.95 examples/s]
Running tokenizer on dataset: 89%|████████▉ | 350208/392702 [00:24<00:02, 14277.95 examples/s]
Running tokenizer on dataset: 90%|████████▉ | 352256/392702 [00:24<00:02, 14833.25 examples/s]
Running tokenizer on dataset: 90%|█████████ | 354304/392702 [00:24<00:02, 15280.79 examples/s]
Running tokenizer on dataset: 91%|█████████ | 356352/392702 [00:24<00:02, 15441.27 examples/s]
Running tokenizer on dataset: 91%|█████████▏| 358400/392702 [00:24<00:02, 15709.97 examples/s]
Running tokenizer on dataset: 92%|█████████▏| 360448/392702 [00:25<00:02, 15771.06 examples/s]
Running tokenizer on dataset: 92%|█████████▏| 362496/392702 [00:25<00:02, 12243.57 examples/s]
Running tokenizer on dataset: 93%|█████████▎| 364544/392702 [00:25<00:02, 13106.24 examples/s]
Running tokenizer on dataset: 93%|█████████▎| 366592/392702 [00:25<00:01, 13827.95 examples/s]
Running tokenizer on dataset: 94%|█████████▍| 368640/392702 [00:25<00:01, 14478.71 examples/s]
Running tokenizer on dataset: 94%|█████████▍| 370688/392702 [00:25<00:01, 14913.85 examples/s]
Running tokenizer on dataset: 95%|█████████▍| 372736/392702 [00:26<00:01, 15188.62 examples/s]
Running tokenizer on dataset: 95%|█████████▌| 374784/392702 [00:26<00:01, 15032.76 examples/s]
Running tokenizer on dataset: 96%|█████████▌| 376832/392702 [00:26<00:01, 15636.90 examples/s]
Running tokenizer on dataset: 96%|█████████▋| 378880/392702 [00:26<00:00, 15699.55 examples/s]
Running tokenizer on dataset: 97%|█████████▋| 380928/392702 [00:26<00:00, 12454.78 examples/s]
Running tokenizer on dataset: 98%|█████████▊| 382976/392702 [00:26<00:00, 13219.98 examples/s]
Running tokenizer on dataset: 98%|█████████▊| 385024/392702 [00:26<00:00, 14095.60 examples/s]
Running tokenizer on dataset: 99%|█████████▊| 387072/392702 [00:27<00:00, 14634.68 examples/s]
Running tokenizer on dataset: 99%|█████████▉| 389120/392702 [00:27<00:00, 15261.46 examples/s]
Running tokenizer on dataset: 100%|█████████▉| 391168/392702 [00:27<00:00, 15652.23 examples/s]
Running tokenizer on dataset: 0%| | 0/9815 [00:00<?, ? examples/s]
Running tokenizer on dataset: 21%|██ | 2048/9815 [00:00<00:00, 16538.00 examples/s]
Running tokenizer on dataset: 42%|████▏ | 4096/9815 [00:00<00:00, 10746.31 examples/s]
Running tokenizer on dataset: 63%|██████▎ | 6144/9815 [00:00<00:00, 12963.77 examples/s]
Running tokenizer on dataset: 83%|████████▎ | 8192/9815 [00:00<00:00, 14227.97 examples/s]
Running tokenizer on dataset: 100%|██████████| 9815/9815 [00:00<00:00, 14685.07 examples/s]
Running tokenizer on dataset: 0%| | 0/9832 [00:00<?, ? examples/s]
Running tokenizer on dataset: 21%|██ | 2048/9832 [00:00<00:00, 15785.53 examples/s]
Running tokenizer on dataset: 42%|████▏ | 4096/9832 [00:00<00:00, 15806.00 examples/s]
Running tokenizer on dataset: 62%|██████▏ | 6144/9832 [00:00<00:00, 15882.56 examples/s]
Running tokenizer on dataset: 83%|████████▎ | 8192/9832 [00:00<00:00, 15944.02 examples/s]
Running tokenizer on dataset: 100%|██████████| 9832/9832 [00:00<00:00, 15771.16 examples/s]
Running tokenizer on dataset: 0%| | 0/9796 [00:00<?, ? examples/s]
Running tokenizer on dataset: 21%|██ | 2048/9796 [00:00<00:00, 16714.86 examples/s]
Running tokenizer on dataset: 42%|████▏ | 4096/9796 [00:00<00:00, 16507.27 examples/s]
Running tokenizer on dataset: 63%|██████▎ | 6144/9796 [00:00<00:00, 16328.05 examples/s]
Running tokenizer on dataset: 84%|████████▎ | 8192/9796 [00:00<00:00, 11709.00 examples/s]
Running tokenizer on dataset: 100%|██████████| 9796/9796 [00:00<00:00, 12395.41 examples/s]
Running tokenizer on dataset: 0%| | 0/9847 [00:00<?, ? examples/s]
Running tokenizer on dataset: 21%|██ | 2048/9847 [00:00<00:00, 16509.26 examples/s]
Running tokenizer on dataset: 42%|████▏ | 4096/9847 [00:00<00:00, 16613.94 examples/s]
Running tokenizer on dataset: 62%|██████▏ | 6144/9847 [00:00<00:00, 16372.92 examples/s]
Running tokenizer on dataset: 83%|████████▎ | 8192/9847 [00:00<00:00, 16333.38 examples/s]
Running tokenizer on dataset: 100%|██████████| 9847/9847 [00:00<00:00, 16111.78 examples/s]
Downloading builder script: 0%| | 0.00/5.75k [00:00<?, ?B/s]
Downloading builder script: 100%|██████████| 5.75k/5.75k [00:00<00:00, 2.66MB/s]
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
Error executing job with overrides: ['name=bookcorpus_wiki_training', 'eval.checkpoint=latest', 'impl.microbatch_size=16', 'impl.shuffle_in_dataloader=True']
Traceback (most recent call last):
File "/app/eval.py", line 114, in launch
cramming.utils.main_launcher(cfg, main_downstream_process, job_name="downstream finetuning")
File "/app/cramming/utils.py", line 64, in main_launcher
main_fn(cfg, setup)
File "/app/eval.py", line 48, in main_downstream_process
loss = model_engine.step(device_batch)
File "/app/cramming/backend/torch_default.py", line 112, in step
self.backward(loss)
File "/app/cramming/backend/torch_default.py", line 132, in backward
return self.scaler.scale(loss / self.accumulation_steps_expected).backward()
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 450, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7ff18cdf470c in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7ff18cdb7620 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(char const*, char const*, int, bool) + 0x33e (0x7ff18ce7e68e in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xe86e5c (0x7ff18dd25e5c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x507c0a (0x7ff1cd415c0a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x3b861 (0x7ff18cdd6861 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x186 (0x7ff18cdd00b6 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0xd (0x7ff18cdd01dd in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x786958 (0x7ff1cd694958 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7ff1cd694ce5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x5ce863]
frame #11: /usr/bin/python() [0x5d176c]
frame #12: /usr/bin/python() [0x5d1908]
frame #13: /usr/bin/python() [0x5a978d]
frame #14: /usr/bin/python() [0x5eb5b1]
frame #15: /usr/bin/python() [0x4effff]
frame #16: /usr/bin/python() [0x5fccc7]
frame #17: PyGC_Collect + 0x4c (0x6739ac in /usr/bin/python)
frame #18: Py_FinalizeEx + 0x7a (0x680b4a in /usr/bin/python)
frame #19: Py_Exit + 0xc (0x67f76c in /usr/bin/python)
frame #20: /usr/bin/python() [0x67f79b]
frame #21: PyErr_PrintEx + 0x16 (0x67f9c6 in /usr/bin/python)
frame #22: PyRun_SimpleFileExFlags + 0x1c5 (0x67fc25 in /usr/bin/python)
frame #23: Py_RunMain + 0x212 (0x6b8082 in /usr/bin/python)
frame #24: Py_BytesMain + 0x2d (0x6b840d in /usr/bin/python)
frame #25: __libc_start_main + 0xf3 (0x7ff220a23083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #26: _start + 0x2e (0x5faa2e in /usr/bin/python)
Error: signal: aborted (core dumped)
Hello!
I am wondering what the correct data preprocessing command is for the final recipe. Could you add this information to the README?
Also, is there a straight forward way to restrict memory requirements during preprocessing? It seems to use 60GB+ of RAM when reading data via gzip (using one of the preprocessing commands from scripts/preprocessing.sh).
error-log.txt
Hi, I am also trying to replicate the preprocessed c4 dataset.
Since the default config has deduplicate_entries: True
, however, the "dedup tool" seems not found: cramming/dedup/release/dedup_dataset: not found
.
I am wondering where to get the dedup tool, and if possible, can we download the preprocessed c4 dataset somewhere?
After cloning and installation this command :
python pretrain.py name=test arch=bert-base train=bert-base data=sanity-check-2 dryrun=True impl.microbatch_size=2
produces "In 'cfg_pretrain': Could not find 'arch/bert-base'". If I replace the arch argument with train/hf-bert-tiny
I get :
"FileNotFoundError: Directory /root/cramming/outputs/data/sanity-check-2_BPEx32768_aa4b98dc480e637aa82f59461e1b1729 not found"
If I try the final recipe : python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert train=bert-o4 data=pile-readymade
I get "RuntimeError: Unexpected optimization option max_autotune_gemm"
Hello,
First of all thank you for your labour for creating this work. I have pretrained crammed bert model with custom data and I want to know is it possible to use it for QA task. I tried register it as modified architecture of ScriptableLMForTokenClassification but I could not. Do you have any suggestion to finetune for QA taskespecially using as HF model?
I followed instructions to replicate the Last1.13release using the corrseponding version's README.md, i.e.
python pretrain.py name=amp_b4096_c5_o3_final arch=bert-c5 train=bert-o3 train.batch_size=4096 data=bookcorpus-wikipedia
python eval.py eval=GLUE_sane name=amp_b4096_c5_o3_final eval.checkpoint=latest impl.microbatch_size=16 impl.shuffle_in_dataloader=True
The pretraining worked fine except for loss explosion using the default lr_scheduler budget-triangle2 in bert-o3.yaml, so i just changed to budget-one-cycle according to the report of schedulers on the paper, since these two have similar behaviors for pretraining loss decay.
Anyway the pretraining finnaly achieved a loss of 1.8282 in a RTX2080Ti for a single day, equivalent to the result reported in paper. But for evaluation, problem came out for the downstream tasks diffrent of 2 classifications, like 3 classification for MNLI and 1 classification for STSB.
For MNLI, errors happened like
RuntimeError: CUDA error: device-side assert triggered
or
IndexError: Target 2 is out of bounds
if putting the model on CPU and to looking for further infos.
For STSB, errors happened like
loss evaluation error happens, Target size (torch.Size([16])) must be the same as input size (torch.Size([16, 2]))
I checked the code carefully, and found the problem comes one line from the 'class ScriptableLMForSequenceClassification(PreTrainedModel)'
config.arch['num_labels'] = config.num_labels
(
)which is initialized in downstream task function (https://github.com/JonasGeiping/cramming/blob/4a5e3008a5ec05ed68f9d096e4875f8dddadcf81/cramming/architectures/scriptable_bert.py#L24C1-L35C17)
def construct_scriptable_bert(cfg_arch, vocab_size, downstream_classes=None):
"""See the config file for details on what is possible."""
cfg_arch.embedding.vocab_size = vocab_size
cfg_arch.num_labels = downstream_classes
config = crammedBertConfig(OmegaConf.to_container(cfg_arch, resolve=True))
if downstream_classes is None:
model = ScriptableLMForPreTraining(config)
else:
model = ScriptableLMForSequenceClassification(config)
return model
class crammedBertConfig(PretrainedConfig):
model_type = "crammedBERT"
def __init__(self, cfg_arch_container: dict = {}, **kwargs):
self.arch = cfg_arch_container
super().__init__(**kwargs)
All the modification here work and I realized the args passed to ScriptableLMForSequenceClassification
worked as arch
attribute of crammedBertConfig
class inherited from transformers lib's basic class PretrainedConfig
.
class ScriptableLMForSequenceClassification(PreTrainedModel):
"""Classification head and pooler."""
config_class = crammedBertConfig
def __init__(self, config):
super().__init__(config)
config.arch['num_labels'] = config.num_labels
self.cfg = OmegaConf.create(config.arch) # this could be nicer ...
self.encoder = ScriptableLM(config)
self.pooler = PoolingComponent(self.cfg.classification_head, self.cfg.hidden_size)
self.head = torch.nn.Linear(self.cfg.classification_head.head_dim, self.cfg.num_labels)
However, this line of code config.arch['num_labels'] = config.num_labels
just rewrites the final classification number to 2 since the default PretrainedConfig
sets its attribute num_labels
to 2.
I commented this line of code and it seems work fine.
As this released version is fairly old to the newest Torch2.1, I think it's meaningless to open a pr so I leave a issue here in case someone encounters the same problem of me :)
Thanks for your great jobs! I want to compare BERT with GPT under the same model size setting, so I wonder if there are any configs for training a GPT-like model. Is it enough to just remove the mask token in the input and change the attention mask and prediction target accordingly?
Error executing job with overrides: ['name=test', 'arch=hf-bert-base', 'train=bert-base', 'data=sanity-check-2', 'dryrun=True', 'impl.microbatch_size=2']
Traceback (most recent call last):
File "/root/cramming/cramming/data/pretraining_preparation.py", line 47, in load_pretraining_corpus
tokenized_dataset = datasets.load_from_disk(data_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/datasets/load.py", line 1898, in load_from_disk
raise FileNotFoundError(f"Directory {dataset_path} not found")
FileNotFoundError: Directory /root/cramming/outputs/data/sanity-check-2_BPEx32768_324a8001208359684a2025ba5bd5f119 not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/cramming/pretrain.py", line 155, in launch
cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
File "/root/cramming/cramming/utils.py", line 54, in main_launcher
metrics = main_fn(cfg, setup)
^^^^^^^^^^^^^^^^^^^
File "/root/cramming/pretrain.py", line 21, in main_training_process
dataset, tokenizer = cramming.load_pretraining_corpus(cfg.data, cfg.impl)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/cramming/cramming/data/pretraining_preparation.py", line 63, in load_pretraining_corpus
preprocessed_dataset, new_tokenizer = preprocess_dataset(
^^^^^^^^^^^^^^^^^^^
File "/root/cramming/cramming/data/pretraining_preparation.py", line 169, in preprocess_dataset
tokenized_dataset = _huggingface_preprocessing(raw_data, tokenizer, cfg_data, num_threads=num_threads) # Tokenize, group, sort...
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/cramming/cramming/data/pretraining_preparation.py", line 238, in _huggingface_preprocessing
tokenized_dataset = raw_dataset.map(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/datasets/arrow_dataset.py", line 580, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/datasets/arrow_dataset.py", line 545, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/datasets/arrow_dataset.py", line 3170, in map
with Pool(len(kwargs_per_job)) as pool:
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/multiprocess/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/multiprocess/pool.py", line 191, in init
self._setup_queues()
File "/usr/local/lib/python3.11/dist-packages/multiprocess/pool.py", line 346, in _setup_queues
self._inqueue = self._ctx.SimpleQueue()
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/multiprocess/context.py", line 113, in SimpleQueue
return SimpleQueue(ctx=self.get_context())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/multiprocess/queues.py", line 344, in init
self._rlock = ctx.Lock()
^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/multiprocess/context.py", line 68, in Lock
return Lock(ctx=self.get_context())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/multiprocess/synchronize.py", line 168, in init
SemLock.init(self, SEMAPHORE, 1, 1, ctx=ctx)
File "/usr/local/lib/python3.11/dist-packages/multiprocess/synchronize.py", line 86, in init
register(self._semlock.name, "semaphore")
File "/usr/local/lib/python3.11/dist-packages/multiprocess/resource_tracker.py", line 158, in register
self._send('REGISTER', name, rtype)
File "/usr/local/lib/python3.11/dist-packages/multiprocess/resource_tracker.py", line 165, in _send
self.ensure_running()
File "/usr/local/lib/python3.11/dist-packages/multiprocess/resource_tracker.py", line 132, in ensure_running
pid = util.spawnv_passfds(exe, args, fds_to_pass)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/multiprocess/util.py", line 452, in spawnv_passfds
return _posixsubprocess.fork_exec(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: fork_exec() takes exactly 23 arguments (21 given)
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Hi! I've cloned into the latest version of cramming and tried to verify the installation with the command:
python pretrain.py name=test arch=hf-bert-base train=bert-base data=sanity-check-2 dryrun=True impl.microbatch_size=2
Doing so results in an error related to torch.dynamo, see this pastebin. On the other hand, if I append impl.compile_torch=False
then everything runs smoothly.
I believe the same error occurs with the 'replicate the final recipe' code as well - it gives a similar torch.dynamo error if one doesn't use impl.compile_torch=False
.
I tested this with torch=2.0.1 and python 3.9 and 3.10. (Note that python 3.11 doesn't work since torch.compile doesn't work with python 3.11).
I'd like to fine-tune this model for token classification task. As suggested in #35 , instantiating from AutoModelForTokenClassification
should work. However, I see an error.
import cramming
from transformers import AutoTokenizer, AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("JonasGeiping/crammed-bert", num_labels=3)
>>> ---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[46], line 1
----> 1 model = AutoModelForTokenClassification.from_pretrained("JonasGeiping/crammed-bert", num_labels=3)
File ~\.conda\envs\product_scanner\lib\site-packages\transformers\models\auto\auto_factory.py:566, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
564 elif type(config) in cls._model_mapping.keys():
565 model_class = _get_model_class(config, cls._model_mapping)
--> 566 return model_class.from_pretrained(
567 pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
568 )
569 raise ValueError(
570 f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
571 f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
572 )
File ~\.conda\envs\product_scanner\lib\site-packages\transformers\modeling_utils.py:3462, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
3456 config = cls._autoset_attn_implementation(
3457 config, use_flash_attention_2=use_flash_attention_2, torch_dtype=torch_dtype, device_map=device_map
3458 )
3460 with ContextManagers(init_contexts):
3461 # Let's make sure we don't run the init function of buffer modules
-> 3462 model = cls(config, *model_args, **model_kwargs)
3464 # make sure we use the model's config since the __init__ call might have copied it
3465 config = model.config
File ~\.conda\envs\product_scanner\lib\site-packages\cramming\architectures\crammed_bert.py:396, in ScriptableLMForTokenClassification.__init__(self, config)
393 self.cfg = OmegaConf.create(config.arch)
395 self.encoder = ScriptableLM(config)
--> 396 self.head = torch.nn.Linear(self.cfg.classification_head.head_dim, self.num_labels)
398 self.problem_type = None
399 self._init_weights()
File ~\.conda\envs\product_scanner\lib\site-packages\torch\nn\modules\module.py:1614, in Module.__getattr__(self, name)
1612 if name in modules:
1613 return modules[name]
-> 1614 raise AttributeError("'{}' object has no attribute '{}'".format(
1615 type(self).__name__, name))
AttributeError: 'ScriptableLMForTokenClassification' object has no attribute 'num_labels'
Versions:
transformers==4.36.2
torch==2.0.1
Hi
Am running cramming BERT training on single A100 GPU 80GB, through kubeflow pipelines with below settings
return dsl.ContainerOp(
name='Download data and Tokenize',
image='tiruai/cramming-bert-training:v0.1',
command="python",
arguments=["/app/pretrain.py",
"name=bookcorpus_wiki",
"data=bookcorpus-wikipedia",
"dryrun=True",
"impl.forbid_dataset_preprocessing=False",
"data.max_seq_in_tokenized_dataset=85e6"
],
# file_outputs={
# "tokenized_data": "/mnt/output",
# },
pvolumes={"/mnt": vol_existing}
).set_image_pull_policy(
'Always').set_gpu_limit(1).set_image_pull_policy('Always').set_cpu_limit("100").set_memory_limit("100Gi")
it through error below error, not sure what could be the issue
MB/s]
Downloading: 98%|█████████▊| 19.9G/20.3G [06:47<00:06, 52.5MB/s]
Downloading: 98%|█████████▊| 19.9G/20.3G [06:47<00:06, 53.6MB/s]
Downloading: 98%|█████████▊| 19.9G/20.3G [06:47<00:06, 53.1MB/s]
Downloading: 98%|█████████▊| 20.0G/20.3G [06:47<00:06, 53.9MB/s]
Downloading: 98%|█████████▊| 20.0G/20.3G [06:47<00:05, 54.2MB/s]
Downloading: 98%|█████████▊| 20.0G/20.3G [06:47<00:05, 53.2MB/s]
Downloading: 98%|█████████▊| 20.0G/20.3G [06:47<00:05, 52.9MB/s]
Downloading: 99%|█████████▊| 20.0G/20.3G [06:47<00:05, 52.9MB/s]
Downloading: 99%|█████████▊| 20.0G/20.3G [06:47<00:06, 48.1MB/s]
Downloading: 99%|█████████▊| 20.0G/20.3G [06:48<00:06, 45.6MB/s]
Downloading: 99%|█████████▊| 20.0G/20.3G [06:48<00:06, 43.0MB/s]
Downloading: 99%|█████████▊| 20.0G/20.3G [06:48<00:06, 44.8MB/s]
Downloading: 99%|█████████▊| 20.0G/20.3G [06:48<00:06, 46.2MB/s]
Downloading: 99%|█████████▊| 20.0G/20.3G [06:48<00:05, 47.8MB/s]
Downloading: 99%|█████████▊| 20.0G/20.3G [06:48<00:05, 48.8MB/s]
Downloading: 99%|█████████▊| 20.0G/20.3G [06:48<00:05, 50.0MB/s]
Downloading: 99%|█████████▊| 20.0G/20.3G [06:48<00:05, 50.2MB/s]
Downloading: 99%|█████████▉| 20.0G/20.3G [06:48<00:04, 51.0MB/s]
Downloading: 99%|█████████▉| 20.0G/20.3G [06:48<00:04, 51.0MB/s]
Downloading: 99%|█████████▉| 20.0G/20.3G [06:49<00:04, 52.0MB/s]
Downloading: 99%|█████████▉| 20.0G/20.3G [06:49<00:04, 52.6MB/s]
Downloading: 99%|█████████▉| 20.0G/20.3G [06:49<00:04, 51.9MB/s]
Downloading: 99%|█████████▉| 20.0G/20.3G [06:49<00:04, 52.6MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:49<00:04, 53.1MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:49<00:04, 53.3MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:49<00:03, 53.0MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:49<00:03, 53.4MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:49<00:03, 52.9MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:49<00:03, 52.8MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:50<00:03, 52.6MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:50<00:03, 52.8MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:50<00:03, 52.9MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:50<00:03, 52.9MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:50<00:03, 50.4MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:50<00:03, 51.1MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:50<00:03, 51.2MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:50<00:02, 52.3MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:50<00:02, 52.8MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:50<00:02, 52.3MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:51<00:02, 53.6MB/s]
Downloading: 99%|█████████▉| 20.1G/20.3G [06:51<00:02, 53.3MB/s]
Downloading: 99%|█████████▉| 20.2G/20.3G [06:51<00:02, 52.9MB/s]
Downloading: 99%|█████████▉| 20.2G/20.3G [06:51<00:02, 53.0MB/s]
Downloading: 99%|█████████▉| 20.2G/20.3G [06:51<00:02, 52.4MB/s]
Downloading: 99%|█████████▉| 20.2G/20.3G [06:51<00:02, 52.5MB/s]
Downloading: 99%|█████████▉| 20.2G/20.3G [06:51<00:01, 52.1MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:51<00:01, 51.0MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:51<00:01, 52.2MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:51<00:01, 52.4MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 52.6MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 52.9MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 53.2MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 48.7MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 43.1MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 44.1MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 44.8MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:00, 47.0MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:00, 48.9MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:53<00:00, 50.1MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:53<00:00, 50.3MB/s]
Downloading: 100%|█████████▉| 20.3G/20.3G [06:53<00:00, 51.1MB/s]
Downloading: 100%|█████████▉| 20.3G/20.3G [06:53<00:00, 50.7MB/s]
Downloading: 100%|█████████▉| 20.3G/20.3G [06:53<00:00, 46.8MB/s]
Downloading: 100%|█████████▉| 20.3G/20.3G [06:53<00:00, 48.3MB/s]
Downloading: 100%|█████████▉| 20.3G/20.3G [06:53<00:00, 49.0MB/s]
Downloading: 100%|██████████| 20.3G/20.3G [06:53<00:00, 49.0MB/s]
Running tokenizer on every text in dataset (num_proc=100): 0%| | 0/11083870 [00:00<?, ? examples/s]Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/multiprocess/forkserver.py", line 280, in main
code = _serve_one(child_r, fds,
File "/usr/local/lib/python3.8/dist-packages/multiprocess/forkserver.py", line 319, in _serve_one
code = spawn._main(child_r, parent_sentinel)
File "/usr/local/lib/python3.8/dist-packages/multiprocess/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/local/lib/python3.8/dist-packages/dill/_dill.py", line 272, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/usr/local/lib/python3.8/dist-packages/dill/_dill.py", line 419, in load
obj = StockUnpickler.load(self)
File "/usr/local/lib/python3.8/dist-packages/dill/_dill.py", line 574, in _create_function
func = FunctionType(fcode, fglobals or dict(), fname, fdefaults, fclosure)
TypeError: function() argument 'globals' must be dict, not builtin_function_or_method
Error executing job with overrides: ['name=bookcorpus_wiki', 'data=bookcorpus-wikipedia', 'dryrun=True', 'impl.forbid_dataset_preprocessing=False', 'data.max_seq_in_tokenized_dataset=85e6']
Traceback (most recent call last):
File "/app/cramming/data/pretraining_preparation.py", line 45, in load_pretraining_corpus
tokenized_dataset = datasets.load_from_disk(data_path)
File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 1886, in load_from_disk
raise FileNotFoundError(f"Directory {dataset_path} not found")
FileNotFoundError: Directory /mnt/data/bookcorpus-wikitext_WordPiecex32768_e956802d0d91e79bb272ce39a4b92970 not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/pretrain.py", line 153, in launch
cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
File "/app/cramming/utils.py", line 64, in main_launcher
main_fn(cfg, setup)
File "/app/pretrain.py", line 21, in main_training_process
dataset, tokenizer = cramming.load_pretraining_corpus(cfg.data, cfg.impl)
File "/app/cramming/data/pretraining_preparation.py", line 63, in load_pretraining_corpus
preprocessed_dataset, new_tokenizer = preprocess_dataset(
File "/app/cramming/data/pretraining_preparation.py", line 175, in preprocess_dataset
tokenized_dataset = _huggingface_preprocessing(raw_data, tokenizer, cfg_data, num_threads=num_threads)
File "/app/cramming/data/pretraining_preparation.py", line 239, in _huggingface_preprocessing
tokenized_dataset = raw_dataset.map(
File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 578, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 543, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 3166, in map
for rank, done, content in iflatmap_unordered(
File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 1365, in iflatmap_unordered
with manager_cls() as manager:
File "/usr/local/lib/python3.8/dist-packages/multiprocess/context.py", line 57, in Manager
m.start()
File "/usr/local/lib/python3.8/dist-packages/multiprocess/managers.py", line 583, in start
self._address = reader.recv()
File "/usr/local/lib/python3.8/dist-packages/multiprocess/connection.py", line 253, in recv
buf = self._recv_bytes()
File "/usr/local/lib/python3.8/dist-packages/multiprocess/connection.py", line 417, in _recv_bytes
buf = self._recv(4)
File "/usr/local/lib/python3.8/dist-packages/multiprocess/connection.py", line 386, in _recv
raise EOFError
EOFError
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error: exit status 1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.