If you want to train your own language model on a Wikipedia in your chosen language,
run prepare_wiki.sh
. The script will ask for a language and will then
download, extract, and prepare the latest version of Wikipedia for the chosen language.
Example command: bash prepare_wiki.sh
This will create a data
folder in this directory and wiki_dumps
, wiki_extr
, and
wiki
subfolders. In each subfolder, it will furthermore create a folder LANG
where LANG
is the language of the Wikipedia. The prepared files are stored in
wiki/LANG
as train.csv
and val.csv
to match the format used for text
classification datasets. By default, train.csv
contains around 100 million tokens
and val.csv
is 10% the size of train.csv
.
Run create_toks.py
to tokenize the input texts.
Example command: python create_toks.py data/imdb
Usage:
create_toks.py DIR_PATH [CHUNKSIZE] [N_LBLS] [LANG]
create_toks.py --dir-path DIR_PATH [--chunksize CHUNKSIZE] [--n-lbls N_LBLS] [--lang LANG]
DIR_PATH
: the directory where your data is locatedCHUNKSIZE
: the size of the chunks when reading the files with pandas; use smaller sizes with less RAMLANG
: the language of your corpus.
train.csv
and val.csv
files should be in DIR_PATH
. The script will then save the
training and test tokens and labels as arrays to binary files in NumPy format in a tmp
in the above path in the following files:
tok_trn.npy
, tok_val.npy
, lbl_trn.npy
, and lbl_val.npy
.
In addition, a joined corpus containing white space-separated tokens is produced in tmp/joined.txt
.
Run tok2id.py
to map the tokens in the tok_trn.npy
and tok_val.npy
files to ids.
Example command: python tok2id.py data/imdb
Usage:
tok2id.py PREFIX [MAX_VOCAB] [MIN_FREQ]
tok2id.py --prefix PREFIX [--max-vocab MAX_VOCAB] [--min-freq MIN_FREQ]
PREFIX
: the file path prefix indata/nlp_clas/{prefix}
MAX_VOCAB
: the maximum vocabulary sizeMIN_FREQ
: the minimum frequency of words that should be kept
Before fine-tuning the language model, you can run pretrain_lm.py
to create a
pre-trained language model using WikiText-103 (or whatever base corpus you prefer).
Example command: python pretrain_lm.py data/wiki/de/ 0 --lr 1e-3 --cl 12
Usage:
pretrain_lm.py DIR_PATH CUDA_ID [CL] [BS] [BACKWARDS] [LR] [SAMPLED] [PRETRAIN_ID]
pretrain_lm.py --dir-path DIR_PATH --cuda-id CUDA_ID [--cl CL] [--bs BS] [--backwards BACKWARDS] [--lr LR] [--sampled SAMPLED] [--pretrain-id PRETRAIN_ID]
DIR_PATH
: the directory that contains the Wikipedia filesCUDA_ID
: the id of the GPU that should be used;CL
: the # of epochs to trainBS
: the batch sizeBACKWARDS
: whether a backwards LM should be trainedLR
: the learning rateSAMPLED
: whether a sampled softmax should be used (default:True
)PRETRAIN_ID
: the id used for saving the trained LM
You might have to adapt the learning rate and the # of epochs to maximize performance.
Alternately, you can download the pre-trained models here. Before,
create a directory wt103
. In wt103
, create a models
and a tmp
folder. Save the model files
in the models
folder and itos_wt103.pkl
, the word-to-token mapping, to the tmp
folder.
Then run finetune_lm.py
to fine-tune a language model pretrained on WikiText-103 data on the target task data.
Example command: python finetune_lm.py data/imdb data/wt103 1 25 --lm-id pretrain_wt103
Usage:
finetune_lm.py DIR_PATH PRETRAIN_PATH [CUDA_ID] [CL] [PRETRAIN_ID] [LM_ID] [BS] [DROPMULT] [BACKWARDS] [LR] [PRELOAD] [BPE] [STARTAT] [USE_CLR] [USE_REGULAR_SCHEDULE] [USE_DISCRIMINATIVE] [NOTRAIN] [JOINED] [TRAIN_FILE_ID] [EARLY_STOPPING]
finetune_lm.py --dir-path DIR_PATH --pretrain-path PRETRAIN_PATH [--cuda-id CUDA_ID] [--cl CL] [--pretrain-id PRETRAIN_ID] [--lm-id LM_ID] [--bs BS] [--dropmult DROPMULT] [--backwards BACKWARDS] [--lr LR] [--preload PRELOAD] [--bpe BPE] [--startat STARTAT] [--use-clr USE_CLR] [--use-regular-schedule USE_REGULAR_SCHEDULE] [--use-discriminative USE_DISCRIMINATIVE] [--notrain NOTRAIN] [--joined JOINED] [--train-file-id TRAIN_FILE_ID] [--early-stopping EARLY_STOPPING]
DIR_PATH
: the directory where thetmp
andmodels
folder are locatedPRETRAIN_PATH
: the path where the pretrained model is saved; if using the downloaded model, this iswt103
CUDA_ID
: the id of the GPU used for training the modelCL
: number of epochs to train the modelPRETRAIN_ID
: the id of the pretrained model; set towt103
per defaultLM_ID
: the id used for saving the fine-tuned language modelBS
: the batch size used for training the modelDROPMULT
: the factor used to multiply the dropout parametersBACKWARDS
: whether a backwards LM is trainedLR
: the learning ratePRELOAD
: whether we load a pretrained LM (True
by default)BPE
: whether we use byte-pair encoding (BPE)STARTAT
: can be used to continue fine-tuning a model; if>0
, loads an already fine-tuned LM; can also be used to indicate the layer at which to start the gradual unfreezing (1
is last hidden layer, etc.); in the final model, we only used this for training the classifierUSE_CLR
: whether to use slanted triangular learning rates (STLR) (True
by default)USE_REGULAR_SCHEDULE
: whether to use a regular schedule (instead of STLR)USE_DISCRIMINATIVE
: whether to use discriminative fine-tuning (True
by default)NOTRAIN
: whether to skip fine-tuningJOINED
: whether to fine-tune the LM on the concatenation of training and validation dataTRAIN_FILE_ID
: can be used to indicate different training files (e.g. to test training sizes)EARLY_STOPPING
: whether to use early stopping
The language model is fine-tuned using warm-up reverse annealing and triangular learning rates. For IMDb,
we set --cl
, the number of epochs to 50
and used a learning rate --lr
of 4e-3
.
Run train_clas.py
to train the classifier on top of the fine-tuned language model with gradual unfreezing,
discriminative fine-tuning, and slanted triangular learning rates.
Example command: python train_clas.py data/imdb 0 --lm-id pretrain_wt103 --clas-id pretrain_wt103 --cl 50
Usage:
train_clas.py DIR_PATH CUDA_ID [LM_ID] [CLAS_ID] [BS] [CL] [BACKWARDS] [STARTAT] [UNFREEZE] [LR] [DROPMULT] [BPE] [USE_CLR] [USE_REGULAR_SCHEDULE] [USE_DISCRIMINATIVE] [LAST] [CHAIN_THAW] [FROM_SCRATCH] [TRAIN_FILE_ID]
train_clas.py --dir-path DIR_PATH --cuda-id CUDA_ID [--lm-id LM_ID] [--clas-id CLAS_ID] [--bs BS] [--cl CL] [--backwards BACKWARDS] [--startat STARTAT] [--unfreeze UNFREEZE] [--lr LR] [--dropmult DROPMULT] [--bpe BPE] [--use-clr USE_CLR] [--use-regular-schedule USE_REGULAR_SCHEDULE] [--use-discriminative USE_DISCRIMINATIVE] [--last LAST] [--chain-thaw CHAIN_THAW] [--from-scratch FROM_SCRATCH] [--train-file-id TRAIN_FILE_ID]
DIR_PATH
: the directory where thetmp
andmodels
folder are locatedCUDA_ID
: the id of the GPU used for training the modelLM_ID
: the id of the fine-tuned language model that should be loadedCLAS_ID
: the id used for saving the classifierBS
: the batch size used for training the modelCL
: the number of epochs to train the model with all layers unfrozenBACKWARDS
: whether a backwards LM is trainedSTARTAT
: whether to use gradual unfreezing (0
) or load the pretrained model (1
)UNFREEZE
: whether to unfreeze the whole network (after optional gradual unfreezing) or train only the final classifier layer (default isTrue
)LR
: the learning rateDROPMULT
: the factor used to multiply the dropout parametersBPE
: whether we use byte-pair encoding (BPE)USE_CLR
: whether to use slanted triangular learning rates (STLR) (True
by default)USE_REGULAR_SCHEDULE
: whether to use a regular schedule (instead of STLR)USE_DISCRIMINATIVE
: whether to use discriminative fine-tuning (True
by default)LAST
: whether to fine-tune only the last layer of the modelCHAIN_THAW
: whether to use chain-thawFROM_SCRATCH
: whether to train the model from scratch (without loading a pretrained model)TRAIN_FILE_ID
: can be used to indicate different training files (e.g. to test training sizes)
For fine-tuning the classifier on IMDb, we set --cl
, the number of epochs to 50
.
Run eval_clas.py
to get the classifier accuracy and confusion matrix.
This requires the files produced during the training process: itos.pkl and the classifier (named clas_1.h5 by default), as well as the npy
files containing the evaluation samples and labels.
Example command: python eval_clas.py data/imdb 0 --lm-id pretrain_wt103 --clas-id pretrain_wt103
Usage:
eval_clas.py DIR_PATH CUDA_ID [LM_ID] [CLAS_ID] [BS] [BACKWARDS] [BPE]
eval_clas.py --dir-path DIR_PATH --cuda-id CUDA_ID [--lm-id LM_ID] [--clas-id CLAS_ID] [--bs BS] [--bpe BPE]
DIR_PATH
: the directory where thetmp
andmodels
folder are locatedCUDA_ID
: the id of the GPU used for training the modelLM_ID
: the id of the fine-tuned language model that should be loadedCLAS_ID
: the id used for saving the classifierBS
: the batch size used for training the modelBACKWARDS
: whether a backwards LM is trainedBPE
: whether we use byte-pair encoding (BPE)
Run predict_with_classifier.py
to predict against free text entry.
This requires two files produced during the training process: itos.pkl and the classifier (named clas_1.h5 by default)
Example command: python predict_with_classifier.py trained_models/itos.pkl trained_models/classifier_model.h5
It is suggested to customize this script to your needs.