Giter VIP home page Giter VIP logo

Comments (2)

Doraemonzzz avatar Doraemonzzz commented on June 16, 2024

After a series of attempts, it seems to have succeeded. I've listed the process below for others to reference.
First, create a new file pile-readymade-local.yaml, with the content as follows:

# Draw a preprocessed dataset directly from my HF profile.
# This dataset is already tokenized, you "have" to load the correct tokenizer (which happens automatically with data.load_pretraining_corpus)
name: the_pile_WordPiecex32768
name_proc: the_pile_WordPiecex32768_2efdb9d060d1ae95faf952ec1a50f020
sources:
  hub:
    provider: local
streaming: True

vocab_size: 32768 # cannot be changed!
seq_length: 128 # cannot be changed!

Then, modify line 35 of load_pretraining_corpus in cramming/cramming/data/pretraining_preparation.py to:

    try:
        processed_dataset_dir = cfg_data.name_proc
    except:
        processed_dataset_dir = f"{cfg_data.name}_{checksum}"

Change the original line 47 tokenized_dataset = datasets.load_from_disk(data_path) to:

if cfg_data is not None:
    tokenized_dataset = datasets.load_dataset(data_path)["train"].with_format("torch")
else:
    tokenized_dataset = datasets.load_from_disk(data_path)

Finally, use the following command to train:

python pretrain.py \
    name=amp_b8192_cb_o4_final arch=crammed-bert \
    train=bert-o4  data=pile-readymade-local

from cramming.

JonasGeiping avatar JonasGeiping commented on June 16, 2024

Ok, I'm glad you got it working!

This was never a usecase I had before, given that I have the originals. I'll close this issue for now, but people will be able to find it through the search.

from cramming.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.