Comments (2)
After a series of attempts, it seems to have succeeded. I've listed the process below for others to reference.
First, create a new file pile-readymade-local.yaml
, with the content as follows:
# Draw a preprocessed dataset directly from my HF profile.
# This dataset is already tokenized, you "have" to load the correct tokenizer (which happens automatically with data.load_pretraining_corpus)
name: the_pile_WordPiecex32768
name_proc: the_pile_WordPiecex32768_2efdb9d060d1ae95faf952ec1a50f020
sources:
hub:
provider: local
streaming: True
vocab_size: 32768 # cannot be changed!
seq_length: 128 # cannot be changed!
Then, modify line 35 of load_pretraining_corpus
in cramming/cramming/data/pretraining_preparation.py
to:
try:
processed_dataset_dir = cfg_data.name_proc
except:
processed_dataset_dir = f"{cfg_data.name}_{checksum}"
Change the original line 47 tokenized_dataset = datasets.load_from_disk(data_path)
to:
if cfg_data is not None:
tokenized_dataset = datasets.load_dataset(data_path)["train"].with_format("torch")
else:
tokenized_dataset = datasets.load_from_disk(data_path)
Finally, use the following command to train:
python pretrain.py \
name=amp_b8192_cb_o4_final arch=crammed-bert \
train=bert-o4 data=pile-readymade-local
from cramming.
Ok, I'm glad you got it working!
This was never a usecase I had before, given that I have the originals. I'll close this issue for now, but people will be able to find it through the search.
from cramming.
Related Issues (20)
- RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)` while running evaluation HOT 10
- Pretraining on a single RTX 3060 HOT 2
- Errors with both the verify installation command as well as the final recipe HOT 2
- GLUE evaluation numbers are very poor, if increase the sequence length to 512 and float 32 HOT 5
- Evaluation failed on MNLI and STSB Datasets for Last1.13release HOT 3
- I run the test command,got this error,how to fix it?looks like no dataset HOT 12
- Tutorial for pretrain RoBERTa with custom data HOT 2
- Issue with torch.compile / dynamo HOT 5
- Question about sparse token prediction HOT 1
- Uploading trained model to HF/saving in HF format locally HOT 8
- Finetuning for SQuAD task HOT 2
- try it on Mac M1 but failed HOT 2
- can't import cramming HOT 2
- TypeError: _load_optimizer() missing 1 required positional argument: 'initial_time' HOT 1
- torch._dynamo error on step 2: calling compiler function 'inductor' HOT 7
- Finetuning for token classification HOT 3
- Configs for GPT? HOT 2
- From PR 43 HOT 5
- Unable to replicate the results using the default command HOT 15
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cramming.