epfllm / megatron-llm Goto Github PK
View Code? Open in Web Editor NEWdistributed trainer for LLMs
License: Other
distributed trainer for LLMs
License: Other
I have found the code for get_checkpoint_name(s) and DistributedOptimizer are different. The upstream version had fix many bugs . Would u mind rebase it ?
The validation metrics are currently not correctly sent to wandb. In the internal wandb logs I found entries like 2023-08-03 17:50:55,807 WARNING HandlerThread:28565 [handler.py:handle_request_partial_history():553] Step 50 < 51. Dropping entry: {'lm loss validation': 1.3021695613861084, '_timestamp': 1691085055.8074224}
.
I suspect this happens because flush_all()
is called in training_log()
which calls wandb.log(.. commit=True)
. When the same step (iteration) number is later used in evaluate_and_print_results()
the generated log entries seem to be ignored by wandb.
The training speed gradually slowed down when I was fine-tuning LLaMA-2 7B. The training speed metric "Elapsed time per iteration" decreased from 2000 to 7000. Besides, the GPU-Util gradually decreases from 80% to 20% in the first 100 training iterations. There may be a subtle memory leak, but I cannot figure out it. Could you give me some advice?
Hi,
I found the following code in arguments.py was deleted in your implementation, can i know why?
if args.swiglu:
# reduce the dimnesion for MLP since projections happens on
# two linear layers. this keeps the number of paramters in
# the same ballpark as the counterpart with 4*h size
# we keep it a multiple of 64, which means the actual tensor size
# will be a multiple of 64 / tp_size
args.ffn_hidden_size = int((4 * args.hidden_size * 2 / 3) / 64) * 64
We should add a small section to docs/guide/getting_started.md
to describe how to use the update_to_hub script
repo commit fa9e08f34954076a7d8b53d48c4cfac9cfd550e7
env
# python
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.14.0a0+410ce96'
convert llama2 7B to megatron weight and split to tp=2 pp=4
and global world size is 8, single node test
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0x6c (0x7f13158ba70c in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string, std::allocator > const&) + 0xfa (0x7f131587d620 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(char const*, char const*, int, bool) + 0x33e (0x7f131594468e in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x154fd (0x7f13159154fd in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x3f0f6 (0x7f131593f0f6 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x507c0a (0x7f1355ed1c0a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: + 0x3b861 (0x7f131589c861 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x186 (0x7f13158960b6 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0xd (0x7f13158961dd in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #9: + 0x4604cd3 (0x7f134f0aecd3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: + 0x46051e3 (0x7f134f0af1e3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7f1355daaeb8 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #12: + 0xf3c722 (0x7f134b9e6722 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::python::PythonEngine::execute(std::vector > const&, std::vector > const&, bool, bool, bool, std::vector > const&) + 0x6e (0x7f1356127ece in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #14: THPEngine_run_backward(_object*, _object*, _object*) + 0x2f9 (0x7f1356126849 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #15: PyCFunction_Call + 0x59 (0x5f5b39 in /usr/bin/python)
frame #16: _PyObject_MakeTpCall + 0x296 (0x5f6706 in /usr/bin/python)
frame #17: _PyEval_EvalFrameDefault + 0x62cd (0x57165d in /usr/bin/python)
frame #18: _PyFunction_Vectorcall + 0x1b6 (0x5f5ee6 in /usr/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x726 (0x56bab6 in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x1b6 (0x5f5ee6 in /usr/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x726 (0x56bab6 in /usr/bin/python)
frame #22: _PyEval_EvalCodeWithName + 0x26a (0x569d8a in /usr/bin/python)
frame #23: _PyFunction_Vectorcall + 0x393 (0x5f60c3 in /usr/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x1902 (0x56cc92 in /usr/bin/python)
frame #25: _PyEval_EvalCodeWithName + 0x26a (0x569d8a in /usr/bin/python)
frame #26: _PyFunction_Vectorcall + 0x393 (0x5f60c3 in /usr/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x726 (0x56bab6 in /usr/bin/python)
frame #28: _PyFunction_Vectorcall + 0x1b6 (0x5f5ee6 in /usr/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x726 (0x56bab6 in /usr/bin/python)
frame #30: _PyEval_EvalCodeWithName + 0x26a (0x569d8a in /usr/bin/python)
frame #31: _PyFunction_Vectorcall + 0x393 (0x5f60c3 in /usr/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x726 (0x56bab6 in /usr/bin/python)
frame #33: _PyEval_EvalCodeWithName + 0x26a (0x569d8a in /usr/bin/python)
frame #34: PyEval_EvalCode + 0x27 (0x68e267 in /usr/bin/python)
frame #35: /usr/bin/python() [0x67d9b1]
frame #36: /usr/bin/python() [0x67da2f]
frame #37: /usr/bin/python() [0x67dad1]
frame #38: PyRun_SimpleFileExFlags + 0x197 (0x67fbf7 in /usr/bin/python)
frame #39: Py_RunMain + 0x212 (0x6b8082 in /usr/bin/python)
frame #40: Py_BytesMain + 0x2d (0x6b840d in /usr/bin/python)
frame #41: __libc_start_main + 0xf3 (0x7f13a9204083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #42: _start + 0x2e (0x5faa2e in /usr/bin/python)
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/mpt/Megatron-LLM/Megatron-LLM/tools/preprocess_data.py", line 71, in encode
text = data[key]
KeyError: 'text'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mpt/Megatron-LLM/Megatron-LLM/tools/preprocess_data.py", line 201, in
main()
Special tokens: {'': 32000, '': 32001, '': 32002, '': 32003, '': 32004, '': 1, '': 2}
padded vocab (size: 32005) with 123 dummy tokens (new size: 32128)
File "/mpt/Megatron-LLM/Megatron-LLM/tools/preprocess_data.py", line 179, in main
for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 423, in
return (item for chunk in result for item in chunk)
File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next
raise value
KeyError: 'text'
hello, when I set --make_vocab_size_divisible_by 1 in the finetune script. It get fact --make_vocab_size_divisible_by value is 128?
this is my finetune script:
LOG_ARGS="--log_interval 1 --save_interval 100 --eval_interval 50"
TRAIN_ARGS="--train_iters 100 --lr_decay_style cosine --lr_warmup_iters 50 --lr 3e-4 --min_lr 1e-6"
DISTRIBUTED_ARGS="--nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
COMMON_ARGS="--num_layers 32 --num_attention_heads 32 --seq_length 4096 --max_position_embeddings 4096 --ffn_hidden_size 11008
--hidden_dropout 0.0 --position_embedding_type rotary --no_bias_gelu_fusion
--no_bias_dropout_fusion --use_checkpoint_args
--attention_dropout 0.0 --adam_beta1 0.9 --adam_beta2 0.95 --adam_eps 1e-5
--layernorm_epsilon 1e-6
--weight_decay 0.1 --sequence_parallel --recompute_granularity selective
--log_timers_to_tensorboard
--rope_scaling_factor 1.0"
torchrun $DISTRIBUTED_ARGS finetune.py
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 1
--load /Megatron-LLM-sharded-weights
--save /Megatron-LLM-sharded-weights
--tensorboard_dir /Megatron-LLM-sharded-weights/tensorboard/
--data_path /Megatron-LLM/corpus_indexed/china_text_document
--split 100,0,0
--model_name llama2
--tokenizer_type SentencePieceTokenizer
--make_vocab_size_divisible_by 1
--bf16
--global_batch_size 1000
--micro_batch_size 2
--use_checkpoint_args
$COMMON_ARGS $LOG_ARGS $TRAIN_ARGS
Great job!
QWen is an open-source model widely used by the community. Does it support the training of this model?
hello, I run finetune llama2-7B meet the error:
Traceback (most recent call last):
File "/home/dengkaibiao/Megatron-LLM/finetune.py", line 261, in
pretrain(args, data_provider, model_provider, ModelType.encoder_or_decoder,
File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 108, in pretrain
model, optimizer, opt_param_scheduler = _setup_model_and_optimizer(
File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 371, in _setup_model_and_optimizer
args.iteration = load_checkpoint(model, optimizer, opt_param_scheduler)
File "/home/dengkaibiao/Megatron-LLM/megatron/checkpointing.py", line 603, in load_checkpoint
check_checkpoint_args(checkpoint_args)
File "/home/dengkaibiao/Megatron-LLM/megatron/checkpointing.py", line 57, in check_checkpoint_args
_compare('padded_vocab_size')
File "/home/dengkaibiao/Megatron-LLM/megatron/checkpointing.py", line 49, in _compare
assert checkpoint_value == args_value, error_message
AssertionError: padded_vocab_size value from checkpoint (32000) is not equal to the input argument value (32256).
================================================================================
this is my script:
export CUDA_DEVICE_MAX_CONNECTIONS=1
LOG_ARGS="--log_interval 1 --save_interval 10 --eval_interval 10"
TRAIN_ARGS="--train_iters 10 --lr_decay_style cosine --lr_warmup_iters 5 --lr 3e-4 --min_lr 1e-6"
DISTRIBUTED_ARGS="--nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
COMMON_ARGS="--num_layers 32 --num_attention_heads 32 --seq_length 4096 --max_position_embeddings 4096 --ffn_hidden_size 11008
--hidden_dropout 0.0 --position_embedding_type rotary --no_bias_gelu_fusion
--no_bias_dropout_fusion --use_checkpoint_args
--attention_dropout 0.0 --adam_beta1 0.9 --adam_beta2 0.95 --adam_eps 1e-5
--layernorm_epsilon 1e-6
--weight_decay 0.1 --sequence_parallel --recompute_activations --recompute_granularity selective
--log_timers_to_tensorboard
--rope_scaling_factor 1.0"
#--vocab_file=/home/dengkaibiao/Llama-2-7b-hf/tokenizer.model
export CUDA_VISIBLE_DEVICES=1,2
torchrun $DISTRIBUTED_ARGS finetune.py
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 1
--load /home/dengkaibiao/Megatron-LLM-sharded-weights-7B-TP2
--save /home/dengkaibiao/Megatron-LLM-sharded-weights-7B-TP2
--tensorboard_dir /home/dengkaibiao/Megatron-LLM-sharded-weights-7B-TP2/tensorboard/
--data_path /home/dengkaibiao/Megatron-LLM/corpus_indexed/china_text_document
--split 100,0,0
--model_name llama2
--tokenizer_type SentencePieceTokenizer
--vocab_file=/home/dengkaibiao/Llama-2-7b-hf/tokenizer.model
--make_vocab_size_divisible_by 1
--bf16
--global_batch_size 128
--micro_batch_size 1
--use_flash_attn
$COMMON_ARGS $LOG_ARGS $TRAIN_ARGS
Calling examples/finetune.sh falcon --size 7 --tp 1 --pp 2 --gpus 8 --global-batch 8
I get the error finetune.py: error: unrecognized arguments: --use_multiquery_attn
(the argument is added in finetune.sh#L65). This argument seems not to be defined in arguments.py. Is it save to simply leave it away?
The current weight conversion script doesn't generate a corresponding HuggingFace tokenizer configuration. Ideally the tokenizer configuration (special_tokens_map.json
, tokenizer.json
, tokenizer.model
, tokenizer_config.json
) should be generated as part of the megatron2hf conversion script.
As a temporary solution I created a create_hf_tokenizer_config.py script that generates a HF tokenizer configuration with token-ids matching the Megatron-LLM tokenizers with support additional custom tokens.
Additionally I noticed the following points:
_SentencePieceTokenizer
the _FalconTokenizer
doesn't add special tokens like <CLS><SEP><EOD><MASK>
and uses the standard EOS token (<|endoftext|>
) also as EOD token._SentencePieceTokenizer
the use of custom tokens is tied to adding the special tokens (<CLS>, <SEP>, <EOD>, <MASK>
are added when new_tokens == True
) even though they might not be used (eod should always be mapped to eos (</s>
) since it is used by get_ltor_masks_and_position_ids()
when reset_position_ids
or reset_attention_mask
are True
)Traceback (most recent call last):
File "/home/dengkaibiao/Megatron-LLM/finetune.py", line 261, in
pretrain(args, data_provider, model_provider, ModelType.encoder_or_decoder,
File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 139, in pretrain
iteration = _train(args,
File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 685, in _train
train_step(forward_step_func,
File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 412, in train_step
losses_reduced = forward_backward_func(
File "/home/dengkaibiao/Megatron-LLM/megatron/schedules.py", line 234, in forward_backward_no_pipelining
output_tensor = forward_step(forward_step_func, data_iterator,
File "/home/dengkaibiao/Megatron-LLM/megatron/schedules.py", line 117, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "/home/dengkaibiao/Megatron-LLM/finetune.py", line 227, in forward_step
output_tensor = model(tokens, position_ids, attention_mask,
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/distributed.py", line 58, in forward
return self.module(*inputs, **kwargs)
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/module.py", line 186, in forward
outputs = self.module(*inputs, **kwargs)
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/gpt_model.py", line 87, in forward
lm_output = self.language_model(
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/language_model.py", line 512, in forward
encoder_output = self.encoder(
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 1239, in forward
hidden_states = layer(
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 757, in forward
attention_output, attention_bias = self.self_attention(layernorm_output,
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 510, in forward
context_layer = self._checkpointed_attention_forward(
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 397, in checkpointed_attention_forward
hidden_states = megatron.core.tensor_parallel.checkpoint(
File "/home/dengkaibiao/Megatron-LLM/megatron/core/tensor_parallel/random.py", line 251, in checkpoint
return CheckpointFunction.apply(function,
File "/home/dengkaibiao/Megatron-LLM/megatron/core/tensor_parallel/random.py", line 194, in forward
outputs = run_function(*args)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 393, in custom_forward
output = self.core_attention(query_layer, key_layer,
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 231, in forward
attention_probs = self.scale_mask_softmax(attention_scores,
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/fused_softmax.py", line 148, in forward
return self.forward_fused_softmax(input, mask)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/fused_softmax.py", line 183, in forward_fused_softmax
probs = ScaledUpperTriangMaskedSoftmax.apply(input, scale)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/fused_softmax.py", line 22, in forward
softmax_results = scaled_upper_triang_masked_softmax_cuda.forward(
RuntimeError: seq_len <= 2048 INTERNAL ASSERT FAILED at "/home/llm-deploy/apex/csrc/megatron/scaled_upper_triang_masked_softmax_cuda.cu":38, please report a bug to PyTorch.
Could you provide the script to convert hf llama or llama2 to megatron weight ? Thanks
Can you send me the complete parameters related to training llama2 using finetune. py?
Hi! Your work is excellent, I want to use it to optimize the inference of LLaMA2-70B, I have 4 servers with 2 A100(80G) on each server, and I want to use model parallel and tensor parallel together to make the inference faster, can you help me ? :)
I have 16x4090 on two machines, each with 8 sheets on top. How can I use this project ?
python weights2megatron.py llama --size=30 --out=test --cache-dir=pretrain_model/LLaMA/30B/ (MEATA release version)
torch.Size([4992, 6656]) 128 52
Converting weights: 0%| | 0/60 [00:00<?, ?it/s]
Traceback (most recent call last):
File "weights2megatron.py", line 257, in <module>
main(args.model, args.size, args.out, args.cache_dir, args.megatron_path)
File "weights2megatron.py", line 165, in main
megatron_weights = llama_to_megatron(hf_weights, size, llama_source,
File "weights2megatron.py", line 135, in llama_to_megatron
transformer[f"{prefix}.attention.query_key_value.weight"] = rearrange_qkv(
File "weights2megatron.py", line 95, in rearrange_qkv
assert len(wq) == n_heads
AssertionError
it seems that we need to transport the weight for rearrange_qkv like following
weights[f"{prefix}.attention.wq.weight"].T if version ==1 else weights[f"{prefix}.attention.wq.weight"],
weights[f"{prefix}.attention.wk.weight"].T if version ==1 else weights[f"{prefix}.attention.wk.weight"],
weights[f"{prefix}.attention.wv.weight"].T if version ==1 else weights[f"{prefix}.attention.wv.weight"]
Can we support support llama 1 ? Currently, it just only support llama2
assert version == 2, "Only llama v2 available using huggingface"
The current rotary-embedding code in this repo seems to ignore position_ids
and instead always assumes position_ids to match the sequence indices, i.e. position_ids
are not passed on to the encoder.forward() function:
Megatron-LLM/megatron/model/language_model.py
Lines 512 to 515 in 9006118
The actual call to appy_rotary_emb() always uses pre-computed self.freqs_cis
without any position_id dependent lookups, see transformer.py#L499.
This is incompatible to the args.reset_position_ids
parameter used for get_ltor_masks_and_position_ids()
in finetune.py#L76 which if set to True generates position-ids for batch-packing (potentially restarting from 0 multiple times in one sequence).
Compare to HF transformers Llama impl which passes the position-ids on to the attention layers: src/transformers/models/llama/modeling_llama.py#L412-L420 and actually uses them in apply_rotary_pos_emb().
In the original Llama repository, a BOS token is prepended during inference, as seen in this code snippet.
Given this, should we also prepend a BOS token for each document during the 2nd stage of pretraining to ensure alignment with the original model's practices?
From prior models such as GPT-2 and BLOOM, a <|endoftext|>
token is typically used to delineate separate documents. For example, a common approach is doc1 <eos> doc2 <eos> ....
While I'm uncertain about Llama-2's exact handling of this, maybe something like <bos> doc1 <eos> <bos> doc2 <eos> ...
?
Can we use huggingface dataset instead of megatron style for training.
Can anyone help me with this. I'm having a bit of trouble with it, but I didn't find a similar problem inside all the issues.
Using hf_to_megatron.py generates a weights file with TP=1,PP=1, how can I use it in TP=2,PP=2 scenario. I noticed that the embedding parameters get split in two with TP=2.
First, of all thank you for creating this project! It looks very exciting and interesting due its close Hugging Face Integration.
I am very curious and wanted to give it a try following the Getting Started Guide in the documentation. But i ran into an error during the "Model Sharding" resulting into a Bus error (core dumped)
.
I am running on a single Node 8x A100 80GB with 1TB of memory. I followed the exact same step in the guide and used the container.
below is the full error stack in case its helpful. It includes quite a lot of weird C errors/warning in the beginning. I installed the package with
cd Megatron-LLM
pip install -r requirements.txt
cd megatron/data/
make
cd ../../
in the container.
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = float; U = float; V = c10::Half]’:
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:95: required from here
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
750 | cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
750 | cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
768 | cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = float; U = float; V = c10::BFloat16]’:
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:103: required from here
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
750 | cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
750 | cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
768 | cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = c10::Half; U = float; V = c10::Half]’:
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:127: required from here
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
750 | cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
750 | cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
768 | cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = c10::BFloat16; U = float; V = c10::BFloat16]’:
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:138: required from here
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = c10::BFloat16]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
750 | cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
750 | cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = c10::BFloat16]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
768 | cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
[3/3] c++ layer_norm_cuda.o layer_norm_cuda_kernel.cuda.o -shared -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_mix_prec_layer_norm_cuda.so
Loading extension module fused_mix_prec_layer_norm_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /epfllm/Megatron-LLM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_dense_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF fused_weight_gradient_dense.o.d -DTORCH_EXTENSION_NAME=fused_dense_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -O3 -c /epfllm/Megatron-LLM/megatron/fused_kernels/fused_weight_gradient_dense.cpp -o fused_weight_gradient_dense.o
[2/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_dense_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -gencode arch=compute_80,code=sm_80 -std=c++17 -c /epfllm/Megatron-LLM/megatron/fused_kernels/fused_weight_gradient_dense.cu -o fused_weight_gradient_dense.cuda.o
[3/3] c++ fused_weight_gradient_dense.o fused_weight_gradient_dense.cuda.o -shared -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_dense_cuda.so
Loading extension module fused_dense_cuda...
Building model ...
/epfllm/Megatron-LLM/megatron/model/llama_model.py:38: UserWarning: Llama is not intended to use dropout
warnings.warn( "Llama is not intended to use dropout")
/epfllm/Megatron-LLM/megatron/model/llama_model.py:40: UserWarning: Llama is not intended to use dropout
warnings.warn( "Llama is not intended to use dropout")
loading release checkpoint from ./model
checkpoint version 3.0
successfully loaded checkpoint from ./model at iteration 0
using world size: 4, data-parallel-size: 1, tensor-model-parallel size: 4, pipeline-model-parallel size: 1
setting global batch size to 1
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. True
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.999
adam_eps ........................................ 1e-08
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_query_key_layer_scaling ................... True
apply_residual_connection_post_layernorm ........ False
async_tensor_model_parallel_allreduce ........... True
attention_dropout ............................... 0.1
attention_softmax_in_fp32 ....................... False
barrier_with_L1_time ............................ True
bert_load ....................................... None
bf16 ............................................ True
bias_dropout_fusion ............................. False
bias_gelu_fusion ................................ False
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
data_impl ....................................... infer
data_parallel_random_init ....................... False
data_parallel_size .............................. 1
data_path ....................................... None
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
DDP_impl ........................................ local
decoder_num_layers .............................. None
decoder_seq_length .............................. None
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
distribute_saved_activations .................... False
distributed_backend ............................. nccl
embedding_path .................................. None
empty_unused_memory_level ....................... 0
encoder_num_layers .............................. 32
encoder_seq_length .............................. 4096
end_weight_decay ................................ 0.01
eod_mask_loss ................................... False
eval_interval ................................... 1000
eval_iters ...................................... 100
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_signal_handler ............................. False
ffn_hidden_size ................................. 11008
finetune ........................................ False
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_e4m3 ........................................ False
fp8_hybrid ...................................... False
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_wgrad ....................................... True
global_batch_size ............................... 1
glu_activation .................................. swiglu
gradient_accumulation_fusion .................... True
head_lr_mult .................................... 1.0
hidden_dropout .................................. 0.1
hidden_size ..................................... 4096
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... 512
init_method_std ................................. 0.02
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
iter_per_epoch .................................. 1250
kv_channels ..................................... 128
layernorm_epsilon ............................... 1e-05
lima_dropout .................................... False
load ............................................ None
local_rank ...................................... None
log_batch_size_to_tensorboard ................... False
log_interval .................................... 100
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. None
lr_decay_iters .................................. None
lr_decay_samples ................................ None
lr_decay_style .................................. linear
lr_warmup_fraction .............................. None
lr_warmup_iters ................................. 0
lr_warmup_samples ............................... 0
make_vocab_size_divisible_by .................... 128
mask_prob ....................................... 0.15
masked_softmax_fusion ........................... False
max_position_embeddings ......................... 4096
max_tokens_to_oom ............................... 12000
merge_file ...................................... None
metrics ......................................... []
micro_batch_size ................................ 1
min_loss_scale .................................. 1.0
min_lr .......................................... 0.0
mmap_warmup ..................................... False
new_tokens ...................................... True
no_load_optim ................................... True
no_load_rng ..................................... True
no_persist_layer_norm ........................... False
no_save_optim ................................... True
no_save_rng ..................................... True
num_attention_heads ............................. 32
num_attention_heads_kv .......................... 32
num_channels .................................... 3
num_classes ..................................... 1000
num_layers ...................................... 32
num_layers_per_virtual_pipeline_stage ........... None
num_workers ..................................... 2
onnx_safe ....................................... None
optimizer ....................................... adam
override_opt_param_scheduler .................... False
parallel_attn ................................... False
parallel_layernorm .............................. False
params_dtype .................................... torch.bfloat16
patch_dim ....................................... 16
perform_initialization .......................... False
pipeline_model_parallel_size .................... 1
pipeline_model_parallel_split_rank .............. None
position_embedding_type ......................... PositionEmbeddingType.rotary
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
recompute_granularity ........................... None
recompute_method ................................ None
recompute_num_layers ............................ 1
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
rope_scaling_factor ............................. 1.0
rope_theta ...................................... 10000.0
sample_rate ..................................... 1.0
save ............................................ ./model_sharded
save_interval ................................... 1
scalar_loss_mask ................................ 0.0
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 4096
sequence_parallel ............................... False
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
skip_iters ...................................... []
split ........................................... 969, 30, 1
standalone_embedding_stage ...................... False
start_weight_decay .............................. 0.01
tensor_model_parallel_size ...................... 4
tensorboard_dir ................................. None
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
tie_embed_logits ................................ False
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_model ................................. None
tokenizer_type .................................. SentencePieceTokenizer
train_data_path ................................. None
train_iters ..................................... None
train_samples ................................... None
transformer_impl ................................ local
transformer_pipeline_model_parallel_size ........ 1
use_bias ........................................ False
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_contiguous_buffers_in_local_ddp ............. True
use_cpu_initialization .......................... True
use_distributed_optimizer ....................... False
use_flash_attn .................................. False
use_one_sent_docs ............................... False
use_post_ln ..................................... False
use_ring_exchange_p2p ........................... False
use_rms_norm .................................... True
valid_data_path ................................. None
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vocab_extra_ids ................................. 0
vocab_extra_ids_list ............................ None
vocab_file ...................................... None
wandb_api_key ................................... None
wandb_entity .................................... meditron
wandb_id ........................................ None
wandb_logger .................................... False
wandb_project ................................... None
wandb_resume .................................... False
weight_decay .................................... 0.01
weight_decay_incr_style ......................... constant
world_size ...................................... 4
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1
Setting consumed_train_samples to 0 and consumed_valid_samples to 0
sending embeddings
sending lm_head
Detected CUDA files, patching ldflags
Emitting ninja build file /epfllm/Megatron-LLM/megatron/fused_kernels/build/build.ninja...
sending transformer layer 0
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /epfllm/Megatron-LLM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_dense_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_dense_cuda...
Bus error (core dumped)
What is the minimum number of 80 GB GPUs required to train Falcon-40B or Llama2-70B and what are working/best TP/PP configurations?
So far it seems a single 8x A100 80 GB node is not enough even for Falcon-40B. I tried with TP 8, after model loading ~50 GB GPU memory are allocated, it goes OOM when processing the first batch.
The weights2megatron directory contains the script convert_llama2hf.py which looks very similar to (an old version of) convert_llama_weights_to_hf.py. The only difference seems to be a custom parameter for max_shard_size
...
The current version of convert_llama_weights_to_hf.py of HF transformers has support for Llama2, e.g. 70B variant and MQA (n_kv_heads != n_heads
).
I reported this issue last time in the issue #22
After conducting extensive investigation, I finally found that this issue only occurs when setting micro_batch_size=1. So I decided to open a new issue to emphasize this point.
I believe you can reproduce this issue because I pulled your latest code and ran it with micro_batch_size=1, and the problem still persisted. It returned to normal after setting micro_batch_size to 2. (both experiments` settings are tp=2 and pp=4 on 8 * A100 40G)
In weights2megatron/README.md it says:
"We are experiencing unusually high loss in falcon even when loading from a validated checkpoint. We are working to fix this..."
Is this still an open issue?
I am trying to convert baichuan2-megatron to hf. When reading the code, i can not understand this part
def permute(x):
if revert:
return x.view(head_dim//2, 2, dim).transpose(0, 1).reshape(head_dim, dim)
return x.view(2, head_dim//2, dim).transpose(0, 1).reshape(head_dim, dim)
Why head_dim//2?
Really appreciate it if someone can explain this.
Hey there, I did read the docs and found LLaMa fine-tuning scripts. I was wondering if there is a way to pretrain LLaMa and Mistral Models from scratch ?
Please let me know if it's possible.
Thanks
The current conversion from megatron weights to huggingface transformer version only supports traditional multiple-head attention, which works for Llama/Llama2 under 70B scale.
We will incorporate GQA(MQA) support for scaling up to 70B Llama2.
Hi,
I am using your code to train llama 13b model, but I noticed that the time it takes to save a checkpoint is quite long. It took 5 minutes for the first save steps, and 15 minutes for the second save. Therefore, I have currently set torch.distributed.init_process_group(timeout=timedelta(minutes=120)). What could be the reason for this?
RoPE scaling / interpolation as here should be easy to add:
https://together.ai/blog/llama-2-7b-32k
(and later possibly landmark attention https://github.com/epfml/landmark-attention , can make a separate issue later)
While merging a sharded llama2 7b tp2-pp2 checkpoint the exception AttributeError: 'TransformerLanguageModel' object has no attribute 'lm_head'
is thrown here.
Traceback (most recent call last):
File "/root/koepf/epfl-megatron/tools/checkpoint_util.py", line 152, in <module>
main()
File "/root/koepf/epfl-megatron/tools/checkpoint_util.py", line 145, in main
loader.load_checkpoint(queue, args)
File "/root/koepf/epfl-megatron/tools/checkpoint_loader_megatron.py", line 319, in load_checkpoint
_load_checkpoint(queue, args)
File "/root/koepf/epfl-megatron/tools/checkpoint_loader_megatron.py", line 221, in _load_checkpoint
queue_put("lm_head", {"lm_head": torch.cat([models[tp_rank].language_model.lm_head.data
File "/root/koepf/epfl-megatron/tools/checkpoint_loader_megatron.py", line 221, in <listcomp>
queue_put("lm_head", {"lm_head": torch.cat([models[tp_rank].language_model.lm_head.data
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1630, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'TransformerLanguageModel' object has no attribute 'lm_head'
python tools/checkpoint_util.py --target_tensor_parallel_size 1 --target_pipeline_parallel_size 1 --load_dir /root/koepf/megatron-data/checkpoints/llama2-7b-tp2-pp2-trained/ --save_dir /root/koepf/megatron-data/llama2-7b-out --model_type llama2 --bf16
Nice-to-have feature ideas:
In weights2megatron.py for llama models args.make_vocab_size_divisible_by = 1
is set and this setting seems to be effective until conversion to hf (here are args of a 70b). This is different to falcon models which actually use a padded vocabulary with dummy tokens and this applies then also for the export to huggingface, e.g. see OpenAssistant/falcon-40b-megacode2-oasst/blob/main/config.json#L24.
Normally the size is padded to a value divisible by 128 to improve the model efficiency. Does this padding not have a beneficial effect for llama2? Was there a good reason not to use the megatron default value of 128?
Hi @AleHD
When I run this script:
python weights2megatron/weights2megatron.py llama --size=7
--out=/path/to/megatron/weights/ --cache-dir=/path/to/llama-7b/
It says:
assert version == 2, "Only llama v2 available using huggingface"
Does this script not support Llama v1?
Hello, thanks for this nice library!
I was wondering if it’s possible to load from an intermediate checkpoint (our servers crashed during continuous pre-training)?
We’re running into issues where some command line arguments (e.g. TP and PP) are not loaded correctly from the checkpoint (i.e., the iter_xxxxxx folders).
We're running these arguments:
LOG_ARGS="--log_interval 1 --save_interval 500 --eval_interval 100"
TRAIN_ARGS="--train_iters 50000 --lr_decay_style cosine --lr_warmup_iters 5000 --lr 3e-4 --min_lr 1e-6"
DISTRIBUTED_ARGS="--nproc_per_node 4 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_DISTRIBUTED_DEBUG=INFO
torchrun $DISTRIBUTED_ARGS finetune.py \
--tensor_model_parallel_size 4 \
--pipeline_model_parallel_size 1 \
--load /mpt/Megatron-LLM/sharded_4/weights/ \
--save /mpt/Megatron-LLM/sharded_4/weights/ \
--tensorboard_dir /mpt/Megatron-LLM/checkpoints/weights/tensorboard/ \
--data_path /mpt/Megatron-LLM/tokenized/_text_document \
--model_name llama2 \
--tokenizer_type SentencePieceTokenizer \
--vocab_file /mpt/Megatron-LLM/megatron/weights/tokenizer.model \
--bf16 \
--use_flash_attn \
--no_bias_gelu_fusion \
--micro_batch_size 1 \
--global_batch_size 512 \
--seq_length 2048 \
--sequence_parallel \
--recompute_granularity selective \
--use_checkpoint_args \
$COMMON_ARGS $LOG_ARGS $TRAIN_ARGS $LLAMA_ARGS
Where the latest_checkpointed_iteration.txt
number points to the last checkpoint.
Here is a snippet of the log:
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
> setting tensorboard ...
time to initialize megatron (seconds): 11.061
[after megatron is initialized] datetime: 2024-01-31 19:21:24
Here tensor_model_parallel_size
has value 1, which shouldn't be correct and causes GPU OOM.
Is there a way to fix this?
Are you planning to add support for Mistral?
Hello, this is my finetune script:(when I set --seq_length=4096,it run error, but set --seq_length=2048, it can run)--880GA800
export CUDA_DEVICE_MAX_CONNECTIONS=1
LOG_ARGS="--log_interval 1 --save_interval 100 --eval_interval 50"
TRAIN_ARGS="--train_iters 100 --lr_decay_style cosine --lr_warmup_iters 50 --lr 3e-4 --min_lr 1e-6"
DISTRIBUTED_ARGS="--nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
COMMON_ARGS="--num_layers 32 --num_attention_heads 32 --seq_length 4096 --max_position_embeddings 4096 --ffn_hidden_size 11008
--hidden_dropout 0.0 --position_embedding_type rotary --no_bias_gelu_fusion
--no_bias_dropout_fusion --use_checkpoint_args
--attention_dropout 0.0 --adam_beta1 0.9 --adam_beta2 0.95 --adam_eps 1e-5
--layernorm_epsilon 1e-6
--weight_decay 0.1 --sequence_parallel --recompute_granularity selective
--log_timers_to_tensorboard
--rope_scaling_factor 1.0"
torchrun $DISTRIBUTED_ARGS finetune.py
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 1
--load /Megatron-LLM-sharded-weights
--save /Megatron-LLM-sharded-weights
--tensorboard_dir /Megatron-LLM-sharded-weights/tensorboard/
--data_path /Megatron-LLM/corpus_indexed/china_text_document
--split 100,0,0
--model_name llama2
--tokenizer_type SentencePieceTokenizer
--vocab_file=/megatron-llama-2-7b-checkpoint_TP2_PP1_DP4/tokenizer.model
--make_vocab_size_divisible_by 1
--bf16
--global_batch_size 1000
--micro_batch_size 2
--use_checkpoint_args
$COMMON_ARGS $LOG_ARGS $TRAIN_ARGS
======================================================================
Error:
Traceback (most recent call last):
File "/home/dengkaibiao/Megatron-LLM/finetune.py", line 261, in
pretrain(args, data_provider, model_provider, ModelType.encoder_or_decoder,
File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 139, in pretrain
iteration = _train(args,
File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 685, in _train
train_step(forward_step_func,
File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 412, in train_step
losses_reduced = forward_backward_func(
File "/home/dengkaibiao/Megatron-LLM/megatron/schedules.py", line 234, in forward_backward_no_pipelining
output_tensor = forward_step(forward_step_func, data_iterator,
File "/home/dengkaibiao/Megatron-LLM/megatron/schedules.py", line 117, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "/home/dengkaibiao/Megatron-LLM/finetune.py", line 227, in forward_step
output_tensor = model(tokens, position_ids, attention_mask,
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/distributed.py", line 58, in forward
return self.module(*inputs, **kwargs)
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/module.py", line 186, in forward
outputs = self.module(*inputs, **kwargs)
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/gpt_model.py", line 87, in forward
lm_output = self.language_model(
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/language_model.py", line 512, in forward
encoder_output = self.encoder(
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 1239, in forward
hidden_states = layer(
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 757, in forward
attention_output, attention_bias = self.self_attention(layernorm_output,
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 510, in forward
context_layer = self._checkpointed_attention_forward(
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 397, in checkpointed_attention_forward
hidden_states = megatron.core.tensor_parallel.checkpoint(
File "/home/dengkaibiao/Megatron-LLM/megatron/core/tensor_parallel/random.py", line 251, in checkpoint
return CheckpointFunction.apply(function,
File "/home/dengkaibiao/Megatron-LLM/megatron/core/tensor_parallel/random.py", line 194, in forward
outputs = run_function(*args)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 393, in custom_forward
output = self.core_attention(query_layer, key_layer,
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 231, in forward
attention_probs = self.scale_mask_softmax(attention_scores,
File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/fused_softmax.py", line 148, in forward
return self.forward_fused_softmax(input, mask)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/fused_softmax.py", line 183, in forward_fused_softmax
probs = ScaledUpperTriangMaskedSoftmax.apply(input, scale)
File "/home/dengkaibiao/Megatron-LLM/megatron/model/fused_softmax.py", line 22, in forward
softmax_results = scaled_upper_triang_masked_softmax_cuda.forward(
RuntimeError: seq_len <= 2048 INTERNAL ASSERT FAILED at "/home/llm-deploy/apex/csrc/megatron/scaled_upper_triang_masked_softmax_cuda.cu":38, please report a bug to PyTorch.
Did you maybe also write the counterpart of weights2megatron.py to convert megatron checkpoints back to huggingface format? Or is there a standard tool for this process somewhere else available? Would it be useful to create such a script?
I see that the script finetune.py
could be used to pretrain LLAMA2. I am looking forward to do continued-pretraining of Llama2 and was wondering what is the best way to load the llam2 weights in the pretrain.py script so that the pretraining does not start from scratch
this is my finetune script:
LOG_ARGS="--log_interval 1 --save_interval 100 --eval_interval 50"
TRAIN_ARGS="--train_iters 100 --lr_decay_style cosine --lr_warmup_iters 50 --lr 3e-4 --min_lr 1e-6"
DISTRIBUTED_ARGS="--nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
COMMON_ARGS="--num_layers 32 --num_attention_heads 32 --seq_length 4096 --max_position_embeddings 4096 --ffn_hidden_size 11008
--hidden_dropout 0.0 --position_embedding_type rotary --no_bias_gelu_fusion
--no_bias_dropout_fusion --use_checkpoint_args
--attention_dropout 0.0 --adam_beta1 0.9 --adam_beta2 0.95 --adam_eps 1e-5
--layernorm_epsilon 1e-6
--weight_decay 0.1 --sequence_parallel --recompute_granularity selective
--log_timers_to_tensorboard
--rope_scaling_factor 1.0"
torchrun $DISTRIBUTED_ARGS finetune.py
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 1
--load /Megatron-LLM-sharded-weights
--save /Megatron-LLM-sharded-weights
--tensorboard_dir /Megatron-LLM-sharded-weights/tensorboard/
--data_path /Megatron-LLM/corpus_indexed/china_text_document
--split 100,0,0
--model_name llama2
--tokenizer_type SentencePieceTokenizer
--make_vocab_size_divisible_by 1
--bf16
--global_batch_size 1000
--micro_batch_size 2
--use_checkpoint_args
$COMMON_ARGS $LOG_ARGS $TRAIN_ARGS
During fine-tuning of Falcon-40b with I got NaN
losses after ~3300 steps.
I suspect that the code in the optimizer to prevent NaN
updates does not work correctly since the training-loop continued to print number of skipped iterations: 0 | number of nan iterations: 0
for following iterations. I had expected that the number of nan iterations
or skipped iterations
counters would increment in this situation. In the situation which I observed the training unfortunately did not gracefully handle the NaN
update.
Output when NaNs started to appear (metrics of previous iterations looked normal):
iteration 3383/ 8000 | consumed samples: 216512 | elapsed time per iteration (ms): 7613.1 | learning rate: 6.680E-06 | global batch size: 64 | lm loss: 7.334521E-01 | loss scale: 1.0 | grad norm: 0.752 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 3384/ 8000 | consumed samples: 216576 | elapsed time per iteration (ms): 7845.5 | learning rate: 6.678E-06 | global batch size: 64 | lm loss: 7.072597E-01 | loss scale: 1.0 | grad norm: 0.741 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 3385/ 8000 | consumed samples: 216640 | elapsed time per iteration (ms): 7624.3 | learning rate: 6.676E-06 | global batch size: 64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 3386/ 8000 | consumed samples: 216704 | elapsed time per iteration (ms): 6596.0 | learning rate: 6.674E-06 | global batch size: 64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 3387/ 8000 | consumed samples: 216768 | elapsed time per iteration (ms): 7286.4 | learning rate: 6.673E-06 | global batch size: 64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 3388/ 8000 | consumed samples: 216832 | elapsed time per iteration (ms): 6701.7 | learning rate: 6.671E-06 | global batch size: 64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 3389/ 8000 | consumed samples: 216896 | elapsed time per iteration (ms): 8103.5 | learning rate: 6.669E-06 | global batch size: 64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 3390/ 8000 | consumed samples: 216960 | elapsed time per iteration (ms): 7057.0 | learning rate: 6.668E-06 | global batch size: 64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 3391/ 8000 | consumed samples: 217024 | elapsed time per iteration (ms): 7748.5 | learning rate: 6.666E-06 | global batch size: 64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 3392/ 8000 | consumed samples: 217088 | elapsed time per iteration (ms): 8260.2 | learning rate: 6.664E-06 | global batch size: 64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 3393/ 8000 | consumed samples: 217152 | elapsed time per iteration (ms): 7590.5 | learning rate: 6.662E-06 | global batch size: 64 | loss scale: 1.0 | grad norm: nan | number of skipped iterations: 0 | number of nan iterations: 0 |
Testing with Llam2 7B and different settings for --tensor_model_parallel_size
and --pipeline_model_parallel_size
I noticed that the elapsed time per iteration increases linearly for both TP=2, PP=1
and TP=1, PP=2
while it stays almost flat for TP=2 PP=2
.
(please ignore the 'lima' in the names of the runs, both used the same lima dropout parameters and I also tried without lima dropout which made no difference - the dropout method is unrelated to the increase in iteration time)
This is the log of a TP=2, PP=1
run (notice how "elapsed time per iteration (ms)" increases):
iteration 5/ 2000 | consumed samples: 320 | elapsed time per iteration (ms): 11075.6 | learning rate: 5.000E-07 | global batch size: 64 | lm loss: 1.793634E+00 | loss scale: 1.0 | grad norm: 6.716 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 10/ 2000 | consumed samples: 640 | elapsed time per iteration (ms): 14642.7 | learning rate: 1.000E-06 | global batch size: 64 | lm loss: 2.205030E+00 | loss scale: 1.0 | grad norm: 12.834 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 15/ 2000 | consumed samples: 960 | elapsed time per iteration (ms): 22576.9 | learning rate: 1.500E-06 | global batch size: 64 | lm loss: 1.436895E+00 | loss scale: 1.0 | grad norm: 5.681 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 20/ 2000 | consumed samples: 1280 | elapsed time per iteration (ms): 30626.8 | learning rate: 2.000E-06 | global batch size: 64 | lm loss: 1.524862E+00 | loss scale: 1.0 | grad norm: 5.253 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 25/ 2000 | consumed samples: 1600 | elapsed time per iteration (ms): 38109.3 | learning rate: 2.500E-06 | global batch size: 64 | lm loss: 1.534147E+00 | loss scale: 1.0 | grad norm: 6.936 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 30/ 2000 | consumed samples: 1920 | elapsed time per iteration (ms): 46237.4 | learning rate: 3.000E-06 | global batch size: 64 | lm loss: 1.294282E+00 | loss scale: 1.0 | grad norm: 3.555 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 35/ 2000 | consumed samples: 2240 | elapsed time per iteration (ms): 53443.8 | learning rate: 3.500E-06 | global batch size: 64 | lm loss: 1.346382E+00 | loss scale: 1.0 | grad norm: 3.461 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 40/ 2000 | consumed samples: 2560 | elapsed time per iteration (ms): 60785.3 | learning rate: 4.000E-06 | global batch size: 64 | lm loss: 1.323138E+00 | loss scale: 1.0 | grad norm: 2.622 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 45/ 2000 | consumed samples: 2880 | elapsed time per iteration (ms): 66910.9 | learning rate: 4.500E-06 | global batch size: 64 | lm loss: 1.446500E+00 | loss scale: 1.0 | grad norm: 4.450 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 50/ 2000 | consumed samples: 3200 | elapsed time per iteration (ms): 78089.7 | learning rate: 5.000E-06 | global batch size: 64 | lm loss: 1.180950E+00 | loss scale: 1.0 | grad norm: 2.712 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 55/ 2000 | consumed samples: 3520 | elapsed time per iteration (ms): 88459.2 | learning rate: 5.500E-06 | global batch size: 64 | lm loss: 1.315969E+00 | loss scale: 1.0 | grad norm: 2.561 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 60/ 2000 | consumed samples: 3840 | elapsed time per iteration (ms): 93434.8 | learning rate: 6.000E-06 | global batch size: 64 | lm loss: 1.255166E+00 | loss scale: 1.0 | grad norm: 2.143 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 65/ 2000 | consumed samples: 4160 | elapsed time per iteration (ms): 100035.0 | learning rate: 6.500E-06 | global batch size: 64 | lm loss: 1.182611E+00 | loss scale: 1.0 | grad norm: 2.422 | number of skipped iterations: 0 | number of nan iterations: 0 |
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.