jonasgeiping / cramming Goto Github PK

View Code? Open in Web Editor NEW

1.3K 1.3K 100.0 237 KB

Cramming the training of a (BERT-type) language model into limited compute.

License: MIT License

Python 73.51% Shell 26.49%

english-language language-model machine-learning

cramming's People

Contributors

Stargazers

Watchers

Forkers

eltociear brunotech oftarradiddle yazanghafir dumpmemory slidersun ubeydemavus maxmax2016 dandelight lee-b davidwagnerkc evelynmitchell nthon codeaudit ai-ml-cv ss18 w32zhong labibchowdhury1 kandy22 zhangchenlai stjordanis techthiyanes qmpham iamfaith ishine cairohy fazziekey ilyaev25 laudehenri cemberk jmsundin fabfish jakabbuda tabrown-clgx belyak rlrs aspiringastro galsenaicommunity navi0105 evilmucedin sanketvmehta playerrrrr tfisher98 lei-zhang-code guy-oren alibarisoztekin jamesliu brydon makinwumi promaxlegal-softwaredev lostmsu kurnianggoro pbloem martinkuo427 katanallama zurabdz bsharapov rasmuspjohansson yonosoysantiago canslove markhng525 felix-hh xxerxxo seungjaeryanlee ethanxli akshaybadola warner-benjamin randl jeankaddour itay-nakash akamil-etsy samliu leescpeter ncbwct klhhhhh tibebo99 kunato shenzhenyi cojennin v-smith harishgovardhandamodar kbatsuren yangboz amauryfierens sysujayce schlaepf shiwenqin euclaise learnslowly ahmed-ata112 wilfoderek snergun schneiderkamplab gothamv alpoge boyuanfeng pbelcak

cramming's Issues

Reproduce the result when freezing parameters

Hi, thank you for this wonderful work.

I met with some troubles when reproducing the head only results. I mean, I can reproduce your results on end-to-end tuning, but when I freeze the BERT (encoder) parameters and only tune the classification head, the result can not be as good as your checkpoint.

The SST-2 accuracy of your checkpoint at https://huggingface.co/JonasGeiping/crammed-bert is 0.922 (end-to-end) and 0.918 (head only) in my reproduction. The bert-base-uncased (from HuggingFace) accuracy is 0.931 (end-to-end) and 0.930 (head only).

I downloaded the c4-subset-processed from your dropbox link and I replicated your work by running:

python pretrain.py name=amp_b4096_c5_o3_final arch=bert-c5 train=bert-o3 train.batch_size=4096 data=c4-subset-processed

The end-to-end accuracy on SST-2 is 0.922 but the head only acuuracy is only 0.784. I'm wondering why I got this problem.

I freeze the encoder parameters by:

for param in model.encoder.parameters():
    param.requires_grad = False

I also want to know how the checkpoint at https://huggingface.co/JonasGeiping/crammed-bert was trained? Was it trained by running the above command?

Thanks again for your time!

TypeError: _load_optimizer() missing 1 required positional argument: 'initial_time'

While evaluating UltraFastBERT (a downstream project using the repository at https://github.com/pbelcak/UltraFastBERT under the training folder, with most of the code identical), I encountered the following error when running python eval.py eval=GLUE name=UltraFastBERT-1x11-long eval.checkpoint=hf://pbelcak/UltraFastBERT-1x11-long impl.microbatch_size=4d:

 loaded with 164,460,531 parameters.
Some weights of ScriptableLMForSequenceClassification were not initialized from the model checkpoint at pbelcak/UltraFastBERT-1x11-long and are newly initialized: ['pooler.dense.weight', 'head.weight', 'head.bias', 'pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Error executing job with overrides: ['eval=GLUE', 'name=UltraFastBERT-1x11-long', 'eval.checkpoint=hf://pbelcak/UltraFastBERT-1x11-long', 'impl.microbatch_size=4']
Traceback (most recent call last):
  File "/root/autodl-tmp/UltraFastBERT/training/eval.py", line 147, in launch
    cramming.utils.main_launcher(cfg, main_downstream_process, job_name="downstream finetuning")
  File "/root/autodl-tmp/UltraFastBERT/training/cramming/utils.py", line 54, in main_launcher
    metrics = main_fn(cfg, setup)
  File "/root/autodl-tmp/UltraFastBERT/training/eval.py", line 37, in main_downstream_process
    model_engine.load_checkpoint(cfg_arch, model_file)
  File "/root/autodl-tmp/UltraFastBERT/training/cramming/backend/torch_default.py", line 237, in load_checkpoint
    self.optimizer, self.scheduler = _load_optimizer(self.model, self.cfg_train, self.cfg_impl)
TypeError: _load_optimizer() missing 1 required positional argument: 'initial_time'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

And indeed, line 237 of the file calls _load_optimizer with just 3 arguments instead of 4:

cramming/cramming/backend/torch_default.py

Line 237 in f6ba4cb

 self.optimizer, self.scheduler = _load_optimizer(self.model, self.cfg_train, self.cfg_impl) 

Maybe add self.initial_time as the fourth argument?

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)` while running evaluation

I am trying train cramming bert on bookcorpus dataset and evaluating on GLUE but during evaluation got CUDA error , not sure what went wrong

here is the training step

 return dsl.ContainerOp(
        name='Train Model',
        image='tiruai/cramming-bert-training:v0.1',
        command="python",
        arguments=[
            "/app/pretrain.py",
            "name=bookcorpus_wiki_training",
            "data=bookcorpus-wikipedia",
            "arch=bert-c5",
            "train=bert-o3",
            "train.batch_size=4096"

        ],
        # file_outputs={
        #     'model': '/mnt/model.pt',
        # },
        pvolumes={"/mnt": vol_existing}
    ).set_image_pull_policy(
        'Always').set_gpu_limit(1).set_image_pull_policy('Always').set_cpu_limit("100").set_memory_limit("100Gi")

evaluation step code

ef eval_op():
    return dsl.ContainerOp(
        name='Evaluation GLUE Model',
        image='tiruai/cramming-bert-training:v0.1',
        command="python",
        arguments=[
            "/app/eval.py",
            "name=bookcorpus_wiki_training",
            "eval.checkpoint=latest",
            "impl.microbatch_size=16",
            "impl.shuffle_in_dataloader=True",

        ],
        # file_outputs={
        #     'model': '/mnt/model.pt',
        # },
        pvolumes={"/mnt": vol_existing}
    ).set_image_pull_policy(
        'Always').set_gpu_limit(1).set_image_pull_policy('Always').set_cpu_limit("100").set_memory_limit("100Gi")

error message

[2023-05-24 08:31:36,608] [CLS] it is born of another of those fated yet fortuitous connections in didion's disorienting world, this one between two people ( elena mcmahon and treat morrison ) who were equally remote. [SEP] ellena mcmahon and treat morrison have a lucky connection despite both being remote. [SEP]
[2023-05-24 08:31:36,608] ... is tokenized into ...
[2023-05-24 08:31:36,609] [CLS]_it_is_born_of_another_of_those_fated_yet_fort_##uit_##ous_connections_in_did_##ion_'_s_di_##sor_##ient_##ing_world_,_this_one_between_two_people_(_elena_mcmahon_and_treat_morrison_)_who_were_equally_remote_._[SEP]_ellen_##a_mcmahon_and_treat_morrison_have_a_lucky_connection_despite_both_being_remote_._[SEP]
[2023-05-24 08:31:36,610] Correct Answer: entailment
[2023-05-24 08:31:36,610] Random sentence from validset of size 9,815: ...
[2023-05-24 08:31:36,611] [CLS] in the small marina you can eat while surrounded by expensive boats. [SEP] in the marina is where you can eat while being around expensive boats. [SEP]
[2023-05-24 08:31:36,611] Correct Answer: entailment
[2023-05-24 08:31:36,618] Finetuning task mnli with 3 classes for 245430 steps.
[2023-05-24 08:31:40,062] Model with architecture ScriptableMaskedLM loaded with 118,654,467 parameters.
[2023-05-24 08:31:41,135] State dict difference is  ScriptableLMForSequenceClassification:
	Missing key(s) in state_dict: "pooler.dense.weight", "pooler.dense.bias", "head.weight", "head.bias". 
	Unexpected key(s) in state_dict: "prediction_head.weight", "decoder.weight". ... Ok?
03 examples/s]
Running tokenizer on dataset:  82%|████████▏ | 321536/392702 [00:22<00:04, 15399.13 examples/s]
Running tokenizer on dataset:  82%|████████▏ | 323584/392702 [00:22<00:04, 15572.11 examples/s]
Running tokenizer on dataset:  83%|████████▎ | 325632/392702 [00:22<00:05, 12331.06 examples/s]
Running tokenizer on dataset:  83%|████████▎ | 327680/392702 [00:22<00:04, 13426.25 examples/s]
Running tokenizer on dataset:  84%|████████▍ | 329728/392702 [00:23<00:04, 13973.85 examples/s]
Running tokenizer on dataset:  84%|████████▍ | 331776/392702 [00:23<00:04, 14539.52 examples/s]
Running tokenizer on dataset:  85%|████████▌ | 333824/392702 [00:23<00:04, 14640.69 examples/s]
Running tokenizer on dataset:  86%|████████▌ | 335872/392702 [00:23<00:03, 15239.77 examples/s]
Running tokenizer on dataset:  86%|████████▌ | 337920/392702 [00:23<00:03, 15290.07 examples/s]
Running tokenizer on dataset:  87%|████████▋ | 339968/392702 [00:23<00:03, 15587.97 examples/s]
Running tokenizer on dataset:  87%|████████▋ | 342016/392702 [00:23<00:03, 16011.82 examples/s]
Running tokenizer on dataset:  88%|████████▊ | 344064/392702 [00:23<00:03, 16177.33 examples/s]
Running tokenizer on dataset:  88%|████████▊ | 346112/392702 [00:24<00:03, 12678.05 examples/s]
Running tokenizer on dataset:  89%|████████▊ | 348160/392702 [00:24<00:03, 13702.95 examples/s]
Running tokenizer on dataset:  89%|████████▉ | 350208/392702 [00:24<00:02, 14277.95 examples/s]
Running tokenizer on dataset:  90%|████████▉ | 352256/392702 [00:24<00:02, 14833.25 examples/s]
Running tokenizer on dataset:  90%|█████████ | 354304/392702 [00:24<00:02, 15280.79 examples/s]
Running tokenizer on dataset:  91%|█████████ | 356352/392702 [00:24<00:02, 15441.27 examples/s]
Running tokenizer on dataset:  91%|█████████▏| 358400/392702 [00:24<00:02, 15709.97 examples/s]
Running tokenizer on dataset:  92%|█████████▏| 360448/392702 [00:25<00:02, 15771.06 examples/s]
Running tokenizer on dataset:  92%|█████████▏| 362496/392702 [00:25<00:02, 12243.57 examples/s]
Running tokenizer on dataset:  93%|█████████▎| 364544/392702 [00:25<00:02, 13106.24 examples/s]
Running tokenizer on dataset:  93%|█████████▎| 366592/392702 [00:25<00:01, 13827.95 examples/s]
Running tokenizer on dataset:  94%|█████████▍| 368640/392702 [00:25<00:01, 14478.71 examples/s]
Running tokenizer on dataset:  94%|█████████▍| 370688/392702 [00:25<00:01, 14913.85 examples/s]
Running tokenizer on dataset:  95%|█████████▍| 372736/392702 [00:26<00:01, 15188.62 examples/s]
Running tokenizer on dataset:  95%|█████████▌| 374784/392702 [00:26<00:01, 15032.76 examples/s]
Running tokenizer on dataset:  96%|█████████▌| 376832/392702 [00:26<00:01, 15636.90 examples/s]
Running tokenizer on dataset:  96%|█████████▋| 378880/392702 [00:26<00:00, 15699.55 examples/s]
Running tokenizer on dataset:  97%|█████████▋| 380928/392702 [00:26<00:00, 12454.78 examples/s]
Running tokenizer on dataset:  98%|█████████▊| 382976/392702 [00:26<00:00, 13219.98 examples/s]
Running tokenizer on dataset:  98%|█████████▊| 385024/392702 [00:26<00:00, 14095.60 examples/s]
Running tokenizer on dataset:  99%|█████████▊| 387072/392702 [00:27<00:00, 14634.68 examples/s]
Running tokenizer on dataset:  99%|█████████▉| 389120/392702 [00:27<00:00, 15261.46 examples/s]
Running tokenizer on dataset: 100%|█████████▉| 391168/392702 [00:27<00:00, 15652.23 examples/s]
                                                                                               

Running tokenizer on dataset:   0%|          | 0/9815 [00:00<?, ? examples/s]
Running tokenizer on dataset:  21%|██        | 2048/9815 [00:00<00:00, 16538.00 examples/s]
Running tokenizer on dataset:  42%|████▏     | 4096/9815 [00:00<00:00, 10746.31 examples/s]
Running tokenizer on dataset:  63%|██████▎   | 6144/9815 [00:00<00:00, 12963.77 examples/s]
Running tokenizer on dataset:  83%|████████▎ | 8192/9815 [00:00<00:00, 14227.97 examples/s]
Running tokenizer on dataset: 100%|██████████| 9815/9815 [00:00<00:00, 14685.07 examples/s]
                                                                                           

Running tokenizer on dataset:   0%|          | 0/9832 [00:00<?, ? examples/s]
Running tokenizer on dataset:  21%|██        | 2048/9832 [00:00<00:00, 15785.53 examples/s]
Running tokenizer on dataset:  42%|████▏     | 4096/9832 [00:00<00:00, 15806.00 examples/s]
Running tokenizer on dataset:  62%|██████▏   | 6144/9832 [00:00<00:00, 15882.56 examples/s]
Running tokenizer on dataset:  83%|████████▎ | 8192/9832 [00:00<00:00, 15944.02 examples/s]
Running tokenizer on dataset: 100%|██████████| 9832/9832 [00:00<00:00, 15771.16 examples/s]
                                                                                           

Running tokenizer on dataset:   0%|          | 0/9796 [00:00<?, ? examples/s]
Running tokenizer on dataset:  21%|██        | 2048/9796 [00:00<00:00, 16714.86 examples/s]
Running tokenizer on dataset:  42%|████▏     | 4096/9796 [00:00<00:00, 16507.27 examples/s]
Running tokenizer on dataset:  63%|██████▎   | 6144/9796 [00:00<00:00, 16328.05 examples/s]
Running tokenizer on dataset:  84%|████████▎ | 8192/9796 [00:00<00:00, 11709.00 examples/s]
Running tokenizer on dataset: 100%|██████████| 9796/9796 [00:00<00:00, 12395.41 examples/s]
                                                                                           

Running tokenizer on dataset:   0%|          | 0/9847 [00:00<?, ? examples/s]
Running tokenizer on dataset:  21%|██        | 2048/9847 [00:00<00:00, 16509.26 examples/s]
Running tokenizer on dataset:  42%|████▏     | 4096/9847 [00:00<00:00, 16613.94 examples/s]
Running tokenizer on dataset:  62%|██████▏   | 6144/9847 [00:00<00:00, 16372.92 examples/s]
Running tokenizer on dataset:  83%|████████▎ | 8192/9847 [00:00<00:00, 16333.38 examples/s]
Running tokenizer on dataset: 100%|██████████| 9847/9847 [00:00<00:00, 16111.78 examples/s]
                                                                                           

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]
Downloading builder script: 100%|██████████| 5.75k/5.75k [00:00<00:00, 2.66MB/s]
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
Error executing job with overrides: ['name=bookcorpus_wiki_training', 'eval.checkpoint=latest', 'impl.microbatch_size=16', 'impl.shuffle_in_dataloader=True']
Traceback (most recent call last):
  File "/app/eval.py", line 114, in launch
    cramming.utils.main_launcher(cfg, main_downstream_process, job_name="downstream finetuning")
  File "/app/cramming/utils.py", line 64, in main_launcher
    main_fn(cfg, setup)
  File "/app/eval.py", line 48, in main_downstream_process
    loss = model_engine.step(device_batch)
  File "/app/cramming/backend/torch_default.py", line 112, in step
    self.backward(loss)
  File "/app/cramming/backend/torch_default.py", line 132, in backward
    return self.scaler.scale(loss / self.accumulation_steps_expected).backward()
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 450, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7ff18cdf470c in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7ff18cdb7620 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(char const*, char const*, int, bool) + 0x33e (0x7ff18ce7e68e in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xe86e5c (0x7ff18dd25e5c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x507c0a (0x7ff1cd415c0a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x3b861 (0x7ff18cdd6861 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x186 (0x7ff18cdd00b6 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0xd (0x7ff18cdd01dd in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x786958 (0x7ff1cd694958 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7ff1cd694ce5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x5ce863]
frame #11: /usr/bin/python() [0x5d176c]
frame #12: /usr/bin/python() [0x5d1908]
frame #13: /usr/bin/python() [0x5a978d]
frame #14: /usr/bin/python() [0x5eb5b1]
frame #15: /usr/bin/python() [0x4effff]
frame #16: /usr/bin/python() [0x5fccc7]
frame #17: PyGC_Collect + 0x4c (0x6739ac in /usr/bin/python)
frame #18: Py_FinalizeEx + 0x7a (0x680b4a in /usr/bin/python)
frame #19: Py_Exit + 0xc (0x67f76c in /usr/bin/python)
frame #20: /usr/bin/python() [0x67f79b]
frame #21: PyErr_PrintEx + 0x16 (0x67f9c6 in /usr/bin/python)
frame #22: PyRun_SimpleFileExFlags + 0x1c5 (0x67fc25 in /usr/bin/python)
frame #23: Py_RunMain + 0x212 (0x6b8082 in /usr/bin/python)
frame #24: Py_BytesMain + 0x2d (0x6b840d in /usr/bin/python)
frame #25: __libc_start_main + 0xf3 (0x7ff220a23083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #26: _start + 0x2e (0x5faa2e in /usr/bin/python)
Error: signal: aborted (core dumped)

Tutorial for pretrain RoBERTa with custom data

Hmm, This may seem a bit excessive, but I'm a bit confused and don't know how to preprocess the data and train a RoBERTa model. Can you do a basic step by step tutorial for me?
Looks like I'm also looking to implement a custom tokenizer for training. Do you have any suggestions?
Thanks a lot.

How to load local data

Hi, Jonas. I would like to ask how to load local data. Specifically, I first downloaded the data here, and then I hoped to run the following experiments:

python pretrain.py \
     name=amp_b8192_cb_o4_final arch=crammed-bert \
     train=bert-o4  data=pile-readymade \

but it seems that the downloaded data cannot be loaded (I also tried to modify the yaml, but all failed).

can't import cramming

Hey,
I'm trying to run your Code and install cramming, but got the following error:

File /opt/conda/lib/python3.10/site-packages/datasets/distributed.py:3
1 from typing import TypeVar
----> 3 from .arrow_dataset import Dataset, _split_by_node_map_style_dataset
4 from .iterable_dataset import IterableDataset, _split_by_node_iterable_dataset
7 DatasetType = TypeVar("DatasetType", Dataset, IterableDataset)

ImportError: cannot import name '_split_by_node_map_style_dataset' from 'datasets.arrow_dataset' (/opt/conda/lib/python3.10/site-packages/datasets/arrow_dataset.py)

Can you publish the pip freeze output of your env and also Python version you are using, I suspect a incompatability is the reason.

Verification command fails on macOS

The verification command fails on macOS Ventura on a MacBook Pro M1 Pro:

python pretrain.py name=test arch=bert-base train=bert-base data=sanity-check-2 dryrun=True impl.microbatch_size=2

The error:

Error executing job with overrides: ['name=test', 'arch=bert-base', 'train=bert-base', 'data=sanity-check-2', 'dryrun=True', 'impl.microbatch_size=2']
Traceback (most recent call last):
  File "/Users/louislac/Documents/Developer/Python/cramming/pretrain.py", line 153, in launch
    cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
  File "/Users/louislac/Documents/Developer/Python/cramming/cramming/utils.py", line 57, in main_launcher
    setup = system_startup(cfg)
  File "/Users/louislac/Documents/Developer/Python/cramming/cramming/utils.py", line 81, in system_startup
    torch.multiprocessing.set_sharing_strategy(cfg.impl.sharing_strategy)
  File "/Users/louislac/Documents/Developer/Python/cramming/.env/lib/python3.10/site-packages/torch/multiprocessing/__init__.py", line 58, in set_sharing_strategy
    assert new_strategy in _all_sharing_strategies
AssertionError

Upon investigation, it looks like impl.sharing_strategy is "file_descriptor" (default value) but _all_sharing_strategies only includes "file_system" on macOS and Windows. Changing this value to file_system solves the issue, thought I do not know the implications:

python pretrain.py name=test arch=bert-base train=bert-base data=sanity-check-2 dryrun=True impl.microbatch_size=2 impl.sharing_strategy=file_system

Errors with both the verify installation command as well as the final recipe

After cloning and installation this command :

python pretrain.py name=test arch=bert-base train=bert-base data=sanity-check-2 dryrun=True impl.microbatch_size=2

produces "In 'cfg_pretrain': Could not find 'arch/bert-base'". If I replace the arch argument with train/hf-bert-tiny I get :

"FileNotFoundError: Directory /root/cramming/outputs/data/sanity-check-2_BPEx32768_aa4b98dc480e637aa82f59461e1b1729 not found"

If I try the final recipe : python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert train=bert-o4 data=pile-readymade

I get "RuntimeError: Unexpected optimization option max_autotune_gemm"

Suggestion : support Maximal Update Parameterization

I have been playing with this on my local hardware which is somewhat smaller even than your paper's reference machines (GPU is GTX1080, 8GB). One thing that has become apparent is that there is a difficulty with investigating scaling of the model size (#heads, depth, etc.) in that substantially different hyperparameters are required for effective model calibration as the size is varied. There is a paper by Yang et. al. "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" (https://arxiv.org/abs/2203.03466) which addresses exactly this issue, and proposes some modifications to how hyper-parameters and initializations are specified to make good hyperparameter choice much more invariant across model size. I suggest incorporating their parameterization would be a very useful change. One thing it would allow is more rapid investigation with very small crammed models for initial exploration and then much easier scaling up to test things in the larger model context.

Preprocessed files on S3/Google Drive

Hey there, and thank you for this wonderful work!

I'm trying to grab the prepcoessed dataset files from Dropbox, but it is sort of a pain to remotely download it due to Dropbox putting restrictions on the links :\

Would it be possible for you to mirror it on Google Drive (so gdown would work) or on S3 (via Requester Pays)?

Issue with torch.compile / dynamo

Hi! I've cloned into the latest version of cramming and tried to verify the installation with the command:
python pretrain.py name=test arch=hf-bert-base train=bert-base data=sanity-check-2 dryrun=True impl.microbatch_size=2

Doing so results in an error related to torch.dynamo, see this pastebin. On the other hand, if I append impl.compile_torch=False then everything runs smoothly.

I believe the same error occurs with the 'replicate the final recipe' code as well - it gives a similar torch.dynamo error if one doesn't use impl.compile_torch=False.

I tested this with torch=2.0.1 and python 3.9 and 3.10. (Note that python 3.11 doesn't work since torch.compile doesn't work with python 3.11).

I run the test command,got this error,how to fix it?looks like no dataset

Error executing job with overrides: ['name=test', 'arch=hf-bert-base', 'train=bert-base', 'data=sanity-check-2', 'dryrun=True', 'impl.microbatch_size=2']
Traceback (most recent call last):
File "/root/cramming/cramming/data/pretraining_preparation.py", line 47, in load_pretraining_corpus
tokenized_dataset = datasets.load_from_disk(data_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/datasets/load.py", line 1898, in load_from_disk
raise FileNotFoundError(f"Directory {dataset_path} not found")
FileNotFoundError: Directory /root/cramming/outputs/data/sanity-check-2_BPEx32768_324a8001208359684a2025ba5bd5f119 not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/cramming/pretrain.py", line 155, in launch
cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
File "/root/cramming/cramming/utils.py", line 54, in main_launcher
metrics = main_fn(cfg, setup)
^^^^^^^^^^^^^^^^^^^
File "/root/cramming/pretrain.py", line 21, in main_training_process
dataset, tokenizer = cramming.load_pretraining_corpus(cfg.data, cfg.impl)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/cramming/cramming/data/pretraining_preparation.py", line 63, in load_pretraining_corpus
preprocessed_dataset, new_tokenizer = preprocess_dataset(
^^^^^^^^^^^^^^^^^^^
File "/root/cramming/cramming/data/pretraining_preparation.py", line 169, in preprocess_dataset
tokenized_dataset = _huggingface_preprocessing(raw_data, tokenizer, cfg_data, num_threads=num_threads) # Tokenize, group, sort...
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/cramming/cramming/data/pretraining_preparation.py", line 238, in _huggingface_preprocessing
tokenized_dataset = raw_dataset.map(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/datasets/arrow_dataset.py", line 580, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/datasets/arrow_dataset.py", line 545, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/datasets/arrow_dataset.py", line 3170, in map
with Pool(len(kwargs_per_job)) as pool:
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/multiprocess/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/multiprocess/pool.py", line 191, in init
self._setup_queues()
File "/usr/local/lib/python3.11/dist-packages/multiprocess/pool.py", line 346, in _setup_queues
self._inqueue = self._ctx.SimpleQueue()
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/multiprocess/context.py", line 113, in SimpleQueue
return SimpleQueue(ctx=self.get_context())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/multiprocess/queues.py", line 344, in init
self._rlock = ctx.Lock()
^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/multiprocess/context.py", line 68, in Lock
return Lock(ctx=self.get_context())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/multiprocess/synchronize.py", line 168, in init
SemLock.init(self, SEMAPHORE, 1, 1, ctx=ctx)
File "/usr/local/lib/python3.11/dist-packages/multiprocess/synchronize.py", line 86, in init
register(self._semlock.name, "semaphore")
File "/usr/local/lib/python3.11/dist-packages/multiprocess/resource_tracker.py", line 158, in register
self._send('REGISTER', name, rtype)
File "/usr/local/lib/python3.11/dist-packages/multiprocess/resource_tracker.py", line 165, in _send
self.ensure_running()
File "/usr/local/lib/python3.11/dist-packages/multiprocess/resource_tracker.py", line 132, in ensure_running
pid = util.spawnv_passfds(exe, args, fds_to_pass)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/multiprocess/util.py", line 452, in spawnv_passfds
return _posixsubprocess.fork_exec(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: fork_exec() takes exactly 23 arguments (21 given)

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

data preprocessing got failed during tokenization on single GPU

Am running cramming BERT training on single A100 GPU 80GB, through kubeflow pipelines with below settings

 return dsl.ContainerOp(
        name='Download data and Tokenize',
        image='tiruai/cramming-bert-training:v0.1',
        command="python",
        arguments=["/app/pretrain.py",
                   "name=bookcorpus_wiki",
                   "data=bookcorpus-wikipedia",
                   "dryrun=True",
                   "impl.forbid_dataset_preprocessing=False",
                   "data.max_seq_in_tokenized_dataset=85e6"
                   ],
        # file_outputs={
        #     "tokenized_data": "/mnt/output",
        # },
        pvolumes={"/mnt": vol_existing}
    ).set_image_pull_policy(
        'Always').set_gpu_limit(1).set_image_pull_policy('Always').set_cpu_limit("100").set_memory_limit("100Gi")

it through error below error, not sure what could be the issue


MB/s]
Downloading:  98%|█████████▊| 19.9G/20.3G [06:47<00:06, 52.5MB/s]
Downloading:  98%|█████████▊| 19.9G/20.3G [06:47<00:06, 53.6MB/s]
Downloading:  98%|█████████▊| 19.9G/20.3G [06:47<00:06, 53.1MB/s]
Downloading:  98%|█████████▊| 20.0G/20.3G [06:47<00:06, 53.9MB/s]
Downloading:  98%|█████████▊| 20.0G/20.3G [06:47<00:05, 54.2MB/s]
Downloading:  98%|█████████▊| 20.0G/20.3G [06:47<00:05, 53.2MB/s]
Downloading:  98%|█████████▊| 20.0G/20.3G [06:47<00:05, 52.9MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:47<00:05, 52.9MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:47<00:06, 48.1MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:48<00:06, 45.6MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:48<00:06, 43.0MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:48<00:06, 44.8MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:48<00:06, 46.2MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:48<00:05, 47.8MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:48<00:05, 48.8MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:48<00:05, 50.0MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:48<00:05, 50.2MB/s]
Downloading:  99%|█████████▉| 20.0G/20.3G [06:48<00:04, 51.0MB/s]
Downloading:  99%|█████████▉| 20.0G/20.3G [06:48<00:04, 51.0MB/s]
Downloading:  99%|█████████▉| 20.0G/20.3G [06:49<00:04, 52.0MB/s]
Downloading:  99%|█████████▉| 20.0G/20.3G [06:49<00:04, 52.6MB/s]
Downloading:  99%|█████████▉| 20.0G/20.3G [06:49<00:04, 51.9MB/s]
Downloading:  99%|█████████▉| 20.0G/20.3G [06:49<00:04, 52.6MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:49<00:04, 53.1MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:49<00:04, 53.3MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:49<00:03, 53.0MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:49<00:03, 53.4MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:49<00:03, 52.9MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:49<00:03, 52.8MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:03, 52.6MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:03, 52.8MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:03, 52.9MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:03, 52.9MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:03, 50.4MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:03, 51.1MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:03, 51.2MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:02, 52.3MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:02, 52.8MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:02, 52.3MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:51<00:02, 53.6MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:51<00:02, 53.3MB/s]
Downloading:  99%|█████████▉| 20.2G/20.3G [06:51<00:02, 52.9MB/s]
Downloading:  99%|█████████▉| 20.2G/20.3G [06:51<00:02, 53.0MB/s]
Downloading:  99%|█████████▉| 20.2G/20.3G [06:51<00:02, 52.4MB/s]
Downloading:  99%|█████████▉| 20.2G/20.3G [06:51<00:02, 52.5MB/s]
Downloading:  99%|█████████▉| 20.2G/20.3G [06:51<00:01, 52.1MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:51<00:01, 51.0MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:51<00:01, 52.2MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:51<00:01, 52.4MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 52.6MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 52.9MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 53.2MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 48.7MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 43.1MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 44.1MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 44.8MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:00, 47.0MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:00, 48.9MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:53<00:00, 50.1MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:53<00:00, 50.3MB/s]
Downloading: 100%|█████████▉| 20.3G/20.3G [06:53<00:00, 51.1MB/s]
Downloading: 100%|█████████▉| 20.3G/20.3G [06:53<00:00, 50.7MB/s]
Downloading: 100%|█████████▉| 20.3G/20.3G [06:53<00:00, 46.8MB/s]
Downloading: 100%|█████████▉| 20.3G/20.3G [06:53<00:00, 48.3MB/s]
Downloading: 100%|█████████▉| 20.3G/20.3G [06:53<00:00, 49.0MB/s]
Downloading: 100%|██████████| 20.3G/20.3G [06:53<00:00, 49.0MB/s]

Running tokenizer on every text in dataset (num_proc=100):   0%|          | 0/11083870 [00:00<?, ? examples/s]Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/forkserver.py", line 280, in main
    code = _serve_one(child_r, fds,
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/forkserver.py", line 319, in _serve_one
    code = spawn._main(child_r, parent_sentinel)
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/local/lib/python3.8/dist-packages/dill/_dill.py", line 272, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/usr/local/lib/python3.8/dist-packages/dill/_dill.py", line 419, in load
    obj = StockUnpickler.load(self)
  File "/usr/local/lib/python3.8/dist-packages/dill/_dill.py", line 574, in _create_function
    func = FunctionType(fcode, fglobals or dict(), fname, fdefaults, fclosure)
TypeError: function() argument 'globals' must be dict, not builtin_function_or_method

                                                                                                              
Error executing job with overrides: ['name=bookcorpus_wiki', 'data=bookcorpus-wikipedia', 'dryrun=True', 'impl.forbid_dataset_preprocessing=False', 'data.max_seq_in_tokenized_dataset=85e6']
Traceback (most recent call last):
  File "/app/cramming/data/pretraining_preparation.py", line 45, in load_pretraining_corpus
    tokenized_dataset = datasets.load_from_disk(data_path)
  File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 1886, in load_from_disk
    raise FileNotFoundError(f"Directory {dataset_path} not found")
FileNotFoundError: Directory /mnt/data/bookcorpus-wikitext_WordPiecex32768_e956802d0d91e79bb272ce39a4b92970 not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/app/pretrain.py", line 153, in launch
    cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
  File "/app/cramming/utils.py", line 64, in main_launcher
    main_fn(cfg, setup)
  File "/app/pretrain.py", line 21, in main_training_process
    dataset, tokenizer = cramming.load_pretraining_corpus(cfg.data, cfg.impl)
  File "/app/cramming/data/pretraining_preparation.py", line 63, in load_pretraining_corpus
    preprocessed_dataset, new_tokenizer = preprocess_dataset(
  File "/app/cramming/data/pretraining_preparation.py", line 175, in preprocess_dataset
    tokenized_dataset = _huggingface_preprocessing(raw_data, tokenizer, cfg_data, num_threads=num_threads)
  File "/app/cramming/data/pretraining_preparation.py", line 239, in _huggingface_preprocessing
    tokenized_dataset = raw_dataset.map(
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 578, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 543, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 3166, in map
    for rank, done, content in iflatmap_unordered(
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 1365, in iflatmap_unordered
    with manager_cls() as manager:
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/context.py", line 57, in Manager
    m.start()
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/managers.py", line 583, in start
    self._address = reader.recv()
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/connection.py", line 253, in recv
    buf = self._recv_bytes()
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/connection.py", line 417, in _recv_bytes
    buf = self._recv(4)
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/connection.py", line 386, in _recv
    raise EOFError
EOFError
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error: exit status 1

GLUE evaluation numbers are very poor, if increase the sequence length to 512 and float 32

I am trying to do some bench-marking as part of my experiments i want train BERT model with 512 sequence length and dtype as float 32 , i have pre trained the model wth above configuration and run the evaluation on glue_sne but the numbers are very poor.

May i know what went wrong

loading checkpoints for using as a huggingface model

Hello!

I'm trying to use a model that was pre-trained using cramming as a huggingface model (using AutoModel.from_pretrained(PATH_TO_MODEL).
The transformers library needs model.bin file instead of the model.pth format the save_final_model() func creates currently.

Is there a suggested way to convert the files easily or to be able to use the checkpoints as a 'huggingface' model?
thanks!

From PR 43

Thanks for the fix, however when I run the pretraining script with the updated command the following error was raised:

166 Resolving data files: 100%|███████████████████| 88/88 [00:02<00:00, 43.91it/s]
167 Error executing job with overrides: ['name=cram_24h', 'arch=crammed-bert', 'train=bert-o4', 'data=pile-readymade', 'budget=24']
168 Traceback (most recent call last):
169 File "/localdisk/home/Work/Repositories/cramming/pretrain.py", line 196, in launch
170 cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
171 File "/localdisk/home/Work/Repositories/cramming/cramming/utils.py", line 54, in main_launcher
172 metrics = main_fn(cfg, setup)
173 File "/localdisk/home/Work/Repositories/cramming/pretrain.py", line 21, in main_training_process
174 dataset, tokenizer = cramming.load_pretraining_corpus(cfg.data, cfg.impl)
175 File "/localdisk/home/Work/Repositories/cramming/cramming/data/pretraining_preparation.py", line 40, in load_pretraining_corpus
176 return _load_from_hub(cfg_data, data_path)
177 File "/localdisk/home/Work/Repositories/cramming/cramming/data/pretraining_preparation.py", line 461, in _load_from_hub
178 tokenized_dataset = datasets.load_dataset(cfg_data.hf_location, split="train", streaming=cfg_data.streaming, cache_dir=data_path)["train"]
179 File "/home/.local/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 60, in getitem
180 raise NotImplementedError("Subclasses of Dataset should implement getitem.")
181 NotImplementedError: Subclasses of Dataset should implement getitem.
182 Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Have you encountered similar issues?

Thank you

Originally posted by @shiwenqin in #43 (comment)

Finetuning for SQuAD task

Hello,

First of all thank you for your labour for creating this work. I have pretrained crammed bert model with custom data and I want to know is it possible to use it for QA task. I tried register it as modified architecture of ScriptableLMForTokenClassification but I could not. Do you have any suggestion to finetune for QA taskespecially using as HF model?

torch._dynamo error on step 2: calling compiler function 'inductor'

Hi,

I am trying to replicate the final recipe by running python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert train=bert-o4 data=pile-readymade as explained in the README file and I am getting the following error: torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: FileNotFoundError: [Errno 2] No such file or directory: 'ldconfig'. The error message suggests me to set the environment variables TORCH_LOGS="+dynamo" TORCHDYNAMO_VERBOSE=1 which I did and the error message is shown in the box below. Please help me figure out how to solve this issue related to ldconfig. I could not find a solution to this on the web.

[2023-12-19 17:44:59,958] [0/0] torch._dynamo.output_graph: [INFO] Step 2: calling compiler function inductor
Error executing job with overrides: ['name=amp_b8192_cb_o4_final', 'arch=crammed-bert', 'train=bert-o4', 'data=pile-readymade']
Traceback (most recent call last):
  File "/nfs/scistore19/alistgrp/imodoran/workplace/M-FAC_extensions/cramming/pretrain.py", line 199, in launch
    cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
  File "/nfs/scistore19/alistgrp/imodoran/workplace/M-FAC_extensions/cramming/cramming/utils.py", line 54, in main_launcher
    metrics = main_fn(cfg, setup)                                                                                                                                                                                                                                                                                       File "/nfs/scistore19/alistgrp/imodoran/workplace/M-FAC_extensions/cramming/pretrain.py", line 55, in main_training_process
    loss = model_engine.step(device_batch)                                                                                                                                                                                                                                                                              File "/nfs/scistore19/alistgrp/imodoran/workplace/M-FAC_extensions/cramming/cramming/backend/torch_default.py", line 124, in step
    loss = self.forward(**batch)["loss"]
  File "/nfs/scistore19/alistgrp/imodoran/workplace/M-FAC_extensions/cramming/cramming/backend/torch_default.py", line 140, in forward
    return self.model(*inputs, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)                                                                                                                                                                                                                                                                             File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)                                                                                                                                                                                                                                                                                File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
    return fn(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert
    return _compile(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 569, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
    transformations(instructions, code_options)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 458, in transform
    tracer.run()
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2069, in run
    super().run()
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 719, in run
    and self.step()
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 683, in step
    getattr(self, inst.opname)(inst)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2157, in RETURN_VALUE
    self.output.compile_subgraph(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 857, in compile_subgraph
    self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 957, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1024, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1009, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/__init__.py", line 1568, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 961, in compile_fx
    return compile_fx(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1150, in compile_fx
    return aot_autograd(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/backends/common.py", line 55, in compiler_fn
    cg = aot_module_simplified(gm, example_inputs, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 3891, in aot_module_simplified
    compiled_fn = create_aot_dispatcher_function(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 3429, in create_aot_dispatcher_function
    compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 2212, in aot_wrapper_dedupe
    return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 2392, in aot_wrapper_synthetic_base
    return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 2917, in aot_dispatch_autograd
    compiled_fw_func = aot_config.fw_compiler(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1092, in fw_compiler_base
    return inner_compile(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/repro/after_aot.py", line 80, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/debug.py", line 228, in inner
    return fn(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 54, in newFunction
    return old_func(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 341, in compile_fx_inner
    compiled_graph: CompiledFxGraph = fx_codegen_and_compile(
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 565, in fx_codegen_and_compile
    compiled_fn = graph.compile_to_fn()
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/graph.py", line 970, in compile_to_fn
    return self.compile_to_module().call
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
    r = func(*args, **kwargs)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/graph.py", line 941, in compile_to_module
    mod = PyCodeCache.load_by_key_path(key, path, linemap=linemap)
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1139, in load_by_key_path
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_imodoran/k6/ck6fiae7msa7cgviyukidcm4bynb5bjdai7xz5hbv7tswlzqpxba.py", line 1127, in <module>
    async_compile.wait(globals())
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1418, in wait
    scope[key] = result.result()
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1277, in result
    self.future.result()
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/nfs/scistore19/alistgrp/imodoran/miniconda3/envs/env_cramming/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
FileNotFoundError: [Errno 2] No such file or directory: 'ldconfig'

Training Step Count

I am asking this for benchmarking purposes. In the config files, it is stated that training lasts 600_000 micro-batch steps and is terminated in 1 day if it does not reach it. How many training steps are actually taken using an RTX-A4000 in a day ?

Flash Attention

Hi, great project!

Are there any plans to implement/support Flash attention 1, 2, or 3 or SDPA.

Cheers.

preprocessed c4 dataset?

Hi, I am also trying to replicate the preprocessed c4 dataset.
Since the default config has deduplicate_entries: True, however, the "dedup tool" seems not found: cramming/dedup/release/dedup_dataset: not found.

I am wondering where to get the dedup tool, and if possible, can we download the preprocessed c4 dataset somewhere?

try it on Mac M1 but failed

after pip install -e .
try

python pretrain.py name=test arch=hf-bert-base train=bert-base  dryrun=True

the console error as following

zsh: illegal hardware instruction  python pretrain.py name=test arch=hf-bert-base train=bert-base  dryrun=True

any idea ? thanks.

Cola dataset evaluation

Hello, when I use the evaluation script I get this for the cola dataset but all looks well for the other datasets.
Also, when I look at the logs, I can see the Matthews correlation.

Storage space requirement

Hello,

How much storage space should I reserve to run following recipe ?

python pretrain.py name=amp_b4096_c5_o3_final arch=bert-c4 train=bert-o3 train.batch_size=4096 data=c4-subset-processed

Unable to replicate the results using the default command

Hi,

Thank you for this amazing repository. I am trying to replicate your model by running the default command in README

python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert train=bert-o4  data=pile-readymade

and

python eval.py eval=GLUE_sane name=amp_b8192_cb_o4_final eval.checkpoint=latest impl.microbatch_size=16 impl.shuffle_in_dataloader=True impl.compile_torch=False

The only change I made to the above command is adding 'budget=24' to the training command.

I train the model for 24hrs on 1 A100 40G GPU, but the average GLUE is only 0.73, based on your paper I assume it should be somewhere between 0.792 (A4000) and 0.804 (A6000).
The installation of the repository are done in a fresh conda environment, I only made three change to the code, which are the change mentioned in #38 , #44 and wandb configs.

Below is the attached wandb log for the pre-training loss, the loss ends in 2.973 and the curve does not looks right.

Could you guide me on what might be the problem? I am happy to provide any further information you need.

Thanks so much for the help!

TypeError: _new_shared() got an unexpected keyword argument 'device'

Error executing job with overrides: []
Traceback (most recent call last):
File "/tmp/pycharm_project_41/cramming-main/pretrain.py", line 153, in launch
cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
File "/tmp/pycharm_project_41/cramming-main/cramming/utils.py", line 64, in main_launcher
main_fn(cfg, setup)
File "/tmp/pycharm_project_41/cramming-main/pretrain.py", line 45, in main_training_process
for step, batch in iterable_data:
File "/tmp/pycharm_project_41/cramming-main/cramming/backend/utils.py", line 263, in next
batch = next(self.dataset_iterator)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 530, in next
data = self._next_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
data.reraise()
File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 457, in reraise
raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.8/dist-packages/transformers/data/data_collator.py", line 42, in call
return self.torch_call(features)
File "/tmp/pycharm_project_41/cramming-main/cramming/backend/utils.py", line 221, in torch_call
storage = elem._storage()._new_shared(len(examples) * 8 * elem.shape[0], device=elem.device) # 8 for byte->long
TypeError: _new_shared() got an unexpected keyword argument 'device'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Process finished with exit code 1

Uploading trained model to HF/saving in HF format locally

Great work and lovely repo. However, I am failing to push to HF using the provided load_local_model.py script.

I have a private dataset, and use the pre-training script successfuly via:

python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert train=bert-o4  data={my_dataset}

Trained fine - saved fine.

But when running - I just want to try pushing to hub for instance:

python load_local_model.py name=amp_b8192_cb_o4_mimic_final wandb=none impl.push_to_huggingface_hub=True arch=crammed-bert train=bert-o4 dryrun=False +eval=GLUE_sane

I get a whole lot of missing keys when trying to load the state dicts:

RuntimeError: Error(s) in loading state_dict for OptimizedModule:
Missing key(s) in state_dict: "_orig_mod.encoder.embedding.word_embedding.weight", "_orig_mod.encoder.embedding.pos_embedding.scale_factor", "_orig_mod.encoder.embedding.norm.weight", "_orig_mod.encoder.embedding.norm.bias", "_orig_mod.encoder.layers.0.norm1.weight",....

and so on.

Is there anything obvious I am missing when trying to re-load the model?

Another question - is there a straight forward way to convert the current model files to that compatible with the HF transformers library, but locally rather than via hub?

Any help would be much appreciated. Package info below. Python 3.10.


Package                  Version
------------------------ ------------
aiohttp                  3.8.5
aiosignal                1.3.1
antlr4-python3-runtime   4.9.3
asttokens                2.4.0
async-timeout            4.0.3
attrs                    23.1.0
backcall                 0.2.0
certifi                  2023.7.22
charset-normalizer       3.2.0
cmake                    3.27.4.1
comm                     0.1.4
cramming                 0.1.0
datasets                 2.14.5
debugpy                  1.8.0
decorator                5.1.1
dill                     0.3.7
einops                   0.6.1
evaluate                 0.4.0
exceptiongroup           1.1.3
executing                1.2.0
filelock                 3.12.4
frozenlist               1.4.0
fsspec                   2023.6.0
huggingface-hub          0.16.4
hydra-core               1.3.2
idna                     3.4
ipykernel                6.25.2
ipython                  8.15.0
jedi                     0.19.0
Jinja2                   3.1.2
joblib                   1.3.2
jupyter_client           8.3.1
jupyter_core             5.3.1
lit                      16.0.6
MarkupSafe               2.1.3
matplotlib-inline        0.1.6
mpmath                   1.3.0
multidict                6.0.4
multiprocess             0.70.15
nest-asyncio             1.5.7
networkx                 3.1
numpy                    1.25.2
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-cupti-cu11   11.7.101
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.2.10.91
nvidia-cusolver-cu11     11.4.0.1
nvidia-cusparse-cu11     11.7.4.91
nvidia-nccl-cu11         2.14.3
nvidia-nvtx-cu11         11.7.91
omegaconf                2.3.0
packaging                23.1
pandas                   2.1.0
parso                    0.8.3
pexpect                  4.8.0
pickleshare              0.7.5
pip                      22.3.1
platformdirs             3.10.0
prompt-toolkit           3.0.39
psutil                   5.9.5
ptyprocess               0.7.0
pure-eval                0.2.2
pyarrow                  13.0.0
Pygments                 2.16.1
pynvml                   11.5.0
python-dateutil          2.8.2
pytz                     2023.3.post1
PyYAML                   6.0.1
pyzmq                    25.1.1
regex                    2023.8.8
requests                 2.31.0
responses                0.18.0
safetensors              0.3.3
scikit-learn             1.3.0
scipy                    1.11.2
setuptools               65.5.0
six                      1.16.0
stack-data               0.6.2
sympy                    1.12
threadpoolctl            3.2.0
tokenizers               0.13.3
torch                    2.0.1
tornado                  6.3.3
tqdm                     4.66.1
traitlets                5.10.0
transformers             4.33.2
triton                   2.0.0
typing_extensions        4.7.1
tzdata                   2023.3
urllib3                  2.0.4
wcwidth                  0.2.6
wheel                    0.41.2
xxhash                   3.3.0
yarl                     1.9.2
zstandard                0.21.0

Configs for GPT?

Thanks for your great jobs! I want to compare BERT with GPT under the same model size setting, so I wonder if there are any configs for training a GPT-like model. Is it enough to just remove the mask token in the input and change the attention mask and prediction target accordingly?

Evaluation failed on MNLI and STSB Datasets for Last1.13release

I followed instructions to replicate the Last1.13release using the corrseponding version's README.md, i.e.

python pretrain.py name=amp_b4096_c5_o3_final arch=bert-c5 train=bert-o3 train.batch_size=4096 data=bookcorpus-wikipedia

python eval.py eval=GLUE_sane name=amp_b4096_c5_o3_final eval.checkpoint=latest impl.microbatch_size=16 impl.shuffle_in_dataloader=True

The pretraining worked fine except for loss explosion using the default lr_scheduler budget-triangle2 in bert-o3.yaml, so i just changed to budget-one-cycle according to the report of schedulers on the paper, since these two have similar behaviors for pretraining loss decay.
Anyway the pretraining finnaly achieved a loss of 1.8282 in a RTX2080Ti for a single day, equivalent to the result reported in paper. But for evaluation, problem came out for the downstream tasks diffrent of 2 classifications, like 3 classification for MNLI and 1 classification for STSB.
For MNLI, errors happened like
RuntimeError: CUDA error: device-side assert triggered
or
IndexError: Target 2 is out of bounds if putting the model on CPU and to looking for further infos.
For STSB, errors happened like
loss evaluation error happens, Target size (torch.Size([16])) must be the same as input size (torch.Size([16, 2]))

I checked the code carefully, and found the problem comes one line from the 'class ScriptableLMForSequenceClassification(PreTrainedModel)'

config.arch['num_labels'] = config.num_labels

(

cramming/cramming/architectures/scriptable_bert.py

Line 229 in 4a5e300

config.arch['num_labels'] = config.num_labels

)

which is initialized in downstream task function (https://github.com/JonasGeiping/cramming/blob/4a5e3008a5ec05ed68f9d096e4875f8dddadcf81/cramming/architectures/scriptable_bert.py#L24C1-L35C17)

def construct_scriptable_bert(cfg_arch, vocab_size, downstream_classes=None):
   """See the config file for details on what is possible."""
   cfg_arch.embedding.vocab_size = vocab_size
   cfg_arch.num_labels = downstream_classes

   config = crammedBertConfig(OmegaConf.to_container(cfg_arch, resolve=True))
   if downstream_classes is None:
       model = ScriptableLMForPreTraining(config)
   else:
       model = ScriptableLMForSequenceClassification(config)

   return model

class crammedBertConfig(PretrainedConfig):
   model_type = "crammedBERT"

   def __init__(self, cfg_arch_container: dict = {}, **kwargs):
       self.arch = cfg_arch_container
       super().__init__(**kwargs)

All the modification here work and I realized the args passed to ScriptableLMForSequenceClassification worked as arch attribute of crammedBertConfig class inherited from transformers lib's basic class PretrainedConfig.

class ScriptableLMForSequenceClassification(PreTrainedModel):
    """Classification head and pooler."""

    config_class = crammedBertConfig

    def __init__(self, config):
        super().__init__(config)
        config.arch['num_labels'] = config.num_labels
        self.cfg = OmegaConf.create(config.arch)  # this could be nicer ...
        self.encoder = ScriptableLM(config)

        self.pooler = PoolingComponent(self.cfg.classification_head, self.cfg.hidden_size)
        self.head = torch.nn.Linear(self.cfg.classification_head.head_dim, self.cfg.num_labels)

However, this line of code config.arch['num_labels'] = config.num_labels just rewrites the final classification number to 2 since the default PretrainedConfig sets its attribute num_labels to 2.

I commented this line of code and it seems work fine.

As this released version is fairly old to the newest Torch2.1, I think it's meaningless to open a pr so I leave a issue here in case someone encounters the same problem of me :)

Can't evaluate

  tokenizer, cfg_arch, model_file = cramming.utils.find_pretrained_checkpoint(cfg)
File "/home/tahabinhuraib/cramming/cramming/utils.py", line 177, in find_pretrained_checkpoint
  all_checkpoints = [f for f in os.listdir(local_checkpoint_folder)]
FileNotFoundError: [Errno 2] No such file or directory: '/home/tahabinhuraib/cramming/outputs/bert-finetuning/checkpoints'

Preprocessing for final recipe

Hello!

I am wondering what the correct data preprocessing command is for the final recipe. Could you add this information to the README?

Also, is there a straight forward way to restrict memory requirements during preprocessing? It seems to use 60GB+ of RAM when reading data via gzip (using one of the preprocessing commands from scripts/preprocessing.sh).
error-log.txt

Pretraining on a single RTX 3060

Hello, I've been using this repository on a cloud cluster of A100 gpus. Unfortunately, my credits have ended, and I'm planning to buy a PC to continue running experiments. The RTX 3060 has 12gb of vram, which is 1 gb more than the 2080 which was used in the paper. Do you think that it would be possible to pre-train a bert model with the RTX 3060? It would be great if you could advise me on this before going ahead and buying the PC.
Thank you very much!

Question about sparse token prediction

Hi Jonas,

Thanks for sharing the great work! I have a small question about the paper.

Both your paper and Izsak et al. referred to Roberta for something called "sparse token prediction", which I couldn't find in the Roberta paper. From your code, it appears that "sparse token prediction" just means that you are only calculating the loss from the positions that's masked. It seems that this should be the default setting for training an MLM (and appears to be the case in Bert's code. The situation where you turn off this sparse prediction doesn't quite make sense -- why would one want to predict the unmasked tokens? Am I missing something obvious here?

Thanks for any help!

Finetuning for token classification

I'd like to fine-tune this model for token classification task. As suggested in #35 , instantiating from AutoModelForTokenClassification should work. However, I see an error.

import cramming
from transformers import AutoTokenizer, AutoModelForTokenClassification

model  = AutoModelForTokenClassification.from_pretrained("JonasGeiping/crammed-bert", num_labels=3)

>>> ---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[46], line 1
----> 1 model  = AutoModelForTokenClassification.from_pretrained("JonasGeiping/crammed-bert", num_labels=3)

File ~\.conda\envs\product_scanner\lib\site-packages\transformers\models\auto\auto_factory.py:566, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    564 elif type(config) in cls._model_mapping.keys():
    565     model_class = _get_model_class(config, cls._model_mapping)
--> 566     return model_class.from_pretrained(
    567         pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
    568     )
    569 raise ValueError(
    570     f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
    571     f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
    572 )

File ~\.conda\envs\product_scanner\lib\site-packages\transformers\modeling_utils.py:3462, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
   3456 config = cls._autoset_attn_implementation(
   3457     config, use_flash_attention_2=use_flash_attention_2, torch_dtype=torch_dtype, device_map=device_map
   3458 )
   3460 with ContextManagers(init_contexts):
   3461     # Let's make sure we don't run the init function of buffer modules
-> 3462     model = cls(config, *model_args, **model_kwargs)
   3464 # make sure we use the model's config since the __init__ call might have copied it
   3465 config = model.config

File ~\.conda\envs\product_scanner\lib\site-packages\cramming\architectures\crammed_bert.py:396, in ScriptableLMForTokenClassification.__init__(self, config)
    393 self.cfg = OmegaConf.create(config.arch)
    395 self.encoder = ScriptableLM(config)
--> 396 self.head = torch.nn.Linear(self.cfg.classification_head.head_dim, self.num_labels)
    398 self.problem_type = None
    399 self._init_weights()

File ~\.conda\envs\product_scanner\lib\site-packages\torch\nn\modules\module.py:1614, in Module.__getattr__(self, name)
   1612     if name in modules:
   1613         return modules[name]
-> 1614 raise AttributeError("'{}' object has no attribute '{}'".format(
   1615     type(self).__name__, name))

AttributeError: 'ScriptableLMForTokenClassification' object has no attribute 'num_labels'

Versions:

transformers==4.36.2
torch==2.0.1