stanford-futuredata / megablocks Goto Github PK

License: Apache License 2.0

Dockerfile 0.08% C++ 8.11% Cuda 0.29% Shell 20.79% Python 70.66% Makefile 0.07%

megablocks's Introduction

🤖 MegaBlocks

MegaBlocks is a light-weight library for mixture-of-experts (MoE) training. The core of the system is efficient "dropless-MoE" (dMoE, paper) and standard MoE layers.

MegaBlocks is integrated with Megatron-LM, where we support data, expert and pipeline parallel training of MoEs. Stay tuned for tighter integration with Databricks libraries and tools!

🚀 Performance

MegaBlocks dMoEs outperform MoEs trained with Tutel by up to 40% compared to Tutel's best performing capacity_factor configuration. MegaBlocks dMoEs use a reformulation of MoEs in terms of block-sparse operations, which allows us to avoid token dropping without sacrificing hardware efficiency. In addition to being faster, MegaBlocks simplifies MoE training by removing the capacity_factor hyperparameter altogether. Compared to dense Transformers trained with Megatron-LM, MegaBlocks dMoEs can accelerate training by as much as 2.4x. Check out our paper for more details!

🏗️ Installation

NOTE: This assumes you have numpy and torch installed.

Training models with Megatron-LM: We recommend using NGC's nvcr.io/nvidia/pytorch:23.09-py3 PyTorch container. The Dockerfile builds on this image with additional dependencies. To build the image, run docker build . -t megablocks-dev and then bash docker.sh to launch the container. Once inside the container, install MegaBlocks with pip install .. See Usage for instructions on training MoEs with MegaBlocks + Megatron-LM.

Using MegaBlocks in other packages: To install the MegaBlocks package for use in other frameworks, run pip install megablocks. For example, Mixtral-8x7B can be run with vLLM + MegaBlocks with this installation method.

Extras: MegaBlocks has optional dependencies that enable additional features.

Installing megablocks[gg] enables dMoE computation with grouped GEMM. This feature is enabled by setting the mlp_impl argument to grouped. This is currently our recommended path for Hopper-generation GPUs.

MegaBlocks can be installed with all dependencies via the megablocks[all] package.

🚂 Usage

We provide scripts for pre-training Transformer MoE and dMoE language models under the top-level directory. The quickest way to get started is to use one of the experiment launch scripts. These scripts require a dataset in Megatron-LM's format, which can be created by following their instructions.

✍️ Citation

@article{megablocks,
  title={{MegaBlocks: Efficient Sparse Training with Mixture-of-Experts}},
  author={Trevor Gale and Deepak Narayanan and Cliff Young and Matei Zaharia},
  journal={Proceedings of Machine Learning and Systems},
  volume={5},
  year={2023}
}

megablocks's People

Contributors

Stargazers

Watchers

Forkers

mbrukman machinelearningsystem jingmouren ericsteinberger priya2698 wyhmhs machinelearningsystem vchiley chengtbf b-chu dblalock sashadoubov dskhudia cli99 josephflowers-ra snarayan21 j316chuck bcui19 samhavens mistralai olivierdehaene techthiyanes kaynewest jeffhsu3 fdoperezi rohithreddy1095 evelynmitchell ototao apollohuang1 okoge-kaz scv119 xiechengmude anyaoha no1welldone embeddedllm syo093c yard1 simon-mo itsliupeng qmdnls tgaddair sandyhouse jpthu17 eltociear haileyschoelkopf josephrp rioyokotalab eracah monsef-noubadji alexdeng-ai sedrick-keh-tri taishi-n324 jordgedu a7mad-magdy77 ftgreat zuonet1988 userbox020 yikangshen aisingapore myaniu snowlixue jon-tow guanghui0607 codeaudit khalil-hennara fwtan imoneoi weedge liyuanlucasliu kustomzone zyan97 evdcush kkpan11 zhaopufeng call-center-together kulbirminhas-ai ersantosnet hubayirp skbennet polya20 qzl164 asokraju yangguanqunit dennyglee zhiwei35 drcaiomoreno heli-dawnlab703 c-tc yoonseokheo muennighoff pranavgrandhi eric11eca yiakwy-xpu-ml-framework-team ruliadai yingyingyang7878 thevasudevgupta ajits-github aiden-frost gavinqiangli mvpatel2000

megablocks's Issues

Gradient scale size for expert gradient

Hi @tgale96 , one naive question on gradient scale for expert weight: in the current implementation, we will scale the moe weight by 1/expert_parallel_world_size (src), my question is:
(1) how to understand this behavior in a nature way?
(2) if expert parallel(sharding on num_expert dim) and tensor parallel (sharding on ffn_hidden_state dim) both enabled, then each expert gradient will by actually scale by 1/(ep_size*dp_size), which seems a bit strange compared to the traditional gradient scale for DDP.

Routing

Is the router implemented the noisy top k routing suggested by the OUTRAGEOUSLY LARGE NEURAL NETWORKS:
THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER paper?

In the router code you seem to apply the noise at the input of the router and not at the router scores like in the paper above:

 def forward(self, x):
        if self.training and self.args.moe_jitter_eps is not None:
            x = x * self.jitter(x)

        scores = self.layer(x.view(-1, x.shape[-1])).softmax(dim=-1)
        expert_weights, expert_indices = self._top_k(scores)
        if self.args.moe_normalize_expert_weights:
            expert_weights = expert_weights / torch.norm(
                expert_weights, p=self.args.moe_normalize_expert_weights,dim=-1, keepdim=True)

        expert_indices = (
            _uniform_expert_assignment(expert_indices, self.args.moe_num_experts)
            if self.args.uniform_expert_assignment else expert_indices
        )
        return scores, expert_weights, expert_indices

In the aforementioned paper the noisy top k works like:

Is this somehting equivalent? I am not trying to argue that it is wrong, but i was just trying to figure out if this is the same.

How to integrate to transformers-based mixtral

Hi, this is awesome work. I'm wondering if there is a minimal way to integrate megablocks into transformers codebase for the mixtral architecture?

Would simply replacing the MixtralSparseMoeBlock with dmoe.dMoE with proper configuration works?

# from transformers 

class MixtralDecoderLayer(nn.Module):
    def __init__(self, config: MixtralConfig, layer_idx: int):
        super().__init__()
        self.hidden_size = config.hidden_size

        self.self_attn = MIXTRAL_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)

        self.block_sparse_moe = MixtralSparseMoeBlock(config)
        ....

Thanks!

Seeking a good multi-node training config

Thanks for the excellent work. Following the comment in #59, I am trying to train dmoe_760m using 16 GPUs (2 nodes) by changing distributed arguments to set up for two nodes but it is very slow in terms of elapsed time per iteration (ms). Can you suggest an optimal training configuration for multi-node training? A full-fledged multi-training script would be very helpful.

@tgale96

#!/bin/bash
export PYTHONPATH="/dataset/g_ckpt/gaoyuanz/megablocks-public/third_party/Granite-Megatron-LM:${PYTHONPATH}"

export NCCL_SOCKET_IFNAME="ib,bond"
export NCCL_IB_CUDA_SUPPORT=1
export NCCL_IB_PCI_RELAXED_ORDERING=1
export UCX_IB_PCI_RELAXED_ORDERING=on
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_SOCKET_NTHREADS=2
export NCCL_NSOCKS_PERTHREAD=4
export CUDA_DEVICE_MAX_CONNECTIONS=1
export TOKENIZERS_PARALLELISM=false

MASTER_ADDR=$(echo ${LSB_MCPU_HOSTS} | tr ' ' '\n' | head -n 1)
MASTER_PORT=5${LSB_JOBID: -5:-1}
NNODES=$(echo ${LSB_MCPU_HOSTS} | tr ' ' '\n' | sed 'n; d' | wc -w)
GPUS_PER_NODE=$(echo $CUDA_VISIBLE_DEVICES | tr ',' '\n' | wc -w)
NODE_RANK=$(($(echo ${LSB_MCPU_HOSTS} | tr ' ' '\n' | sed 'n; d' | grep -n -m1 $HOSTNAME | cut -d':' -f1)-1))

EXPERIMENT_NAME="g-moe-1x4"

EXP_DIR="g-dmoe"

# scaling law: 16B tokens @ 760M = 32k steps.
#
# 512 * 1k * 400k = 200b tokens.
# 512 * 1k * 200k = 100b tokens.
# 512 * 1k * 100k = 50b tokens (default).
# 512 * 1k * 20k = 10b tokens.
TRAINING_STEPS=20000
if [ -n "${2}" ]; then
    TRAINING_STEPS=$2;
fi

##
### Pre-training for GPT2 762M parameter.
##

# MoE hyperparameters.
MOE_ARGUMENTS="\
--moe-num-experts=64 \
--moe-loss-weight=0.1 \
--moe-top-k=1"

# Distributed hyperparameters.
DISTRIBUTED_ARGUMENTS="\
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT"

# Model hyperparameters.
MODEL_ARGUMENTS="\
--num-layers 24 \
--hidden-size 1536 \
--num-attention-heads 16 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--activation-function gelu \
--attention-head-type multihead \
--normalization-function layernorm"

# Training hyperparameters.
TRAINING_ARGUMENTS="\
--micro-batch-size 4 \
--global-batch-size 2048 \
--train-iters ${TRAINING_STEPS} \
--lr-decay-iters ${TRAINING_STEPS} \
--lr 0.0004 \
--min-lr 0.00004 \
--lr-decay-style cosine \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--init-method-std 0.01"

PILE_DATASET="\
1.0 \
/dataset/bluepile/slim_pajama_gptneox_megatron/train/chunk1 \
1.0 \
/dataset/bluepile/slim_pajama_gptneox_megatron/train/chunk2 \
1.0 \
/dataset/bluepile/slim_pajama_gptneox_megatron/train/chunk3 \
1.0 \
/dataset/bluepile/slim_pajama_gptneox_megatron/train/chunk4 \
1.0 \
/dataset/bluepile/slim_pajama_gptneox_megatron/train/chunk5 \
1.0 \
/dataset/bluepile/slim_pajama_gptneox_megatron/train/chunk6 \
1.0 \
/dataset/bluepile/slim_pajama_gptneox_megatron/train/chunk7 \
1.0 \
/dataset/bluepile/slim_pajama_gptneox_megatron/train/chunk8 \
1.0 \
/dataset/bluepile/slim_pajama_gptneox_megatron/train/chunk9 \
1.0 \
/dataset/bluepile/slim_pajama_gptneox_megatron/train/chunk10"

# NOTE: We don't train for enough tokens for the
# split to matter.
DATA_ARGUMENTS="\
--data-path ${PILE_DATASET} \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-path /dataset/g_ckpt/cobol_exp/Granite-Megatron-LM/tokenizers/gpt-neox-20b \
--make-vocab-size-divisible-by 1024 \
--split 969,30,1"

COMPUTE_ARGUMENTS="\
--fp16 \
--DDP-impl local \
--moe-expert-model-parallelism \
--no-async-tensor-model-parallel-allreduce \
--use-flash-attn"

CHECKPOINT_ARGUMENTS="\
--save-interval 2000 \
--save ./${EXP_DIR}"

EVALUATION_ARGUMENTS="\
--eval-iters 100 \
--log-interval 1 \
--eval-interval 1000"

python -m torch.distributed.launch ${DISTRIBUTED_ARGUMENTS} \
       pretrain_gpt.py \
       ${MOE_ARGUMENTS} \
       ${MODEL_ARGUMENTS} \
       ${TRAINING_ARGUMENTS} \
       ${DATA_ARGUMENTS} \
       ${COMPUTE_ARGUMENTS} \
       ${CHECKPOINT_ARGUMENTS} \
       --fix-infiniband \
       ${EVALUATION_ARGUMENTS} |& tee ./${EXP_DIR}/train-20k.log

ScatterMoE feature

I would like to request ScatterMoE feature in Megablocks

https://arxiv.org/abs/2403.08245

https://github.com/shawntan/scattermoe

Does megablocks support the true expert parallelism?

Hello dear authors,

Thank you for contributing such a great project. It has been very helpful for my research on MoE training.

There is something I would like to double-check with you. Although there are two arguments, moe_expert_model_parallelism and moe_weight_parallelism, during dMoE initialization, their meanings seem to be somewhat confusing. According to my understanding:

moe_expert_model_parallelism distributes experts across different GPUs. However, if the #GPU is larger than #Expert, additional #GPU/#Expert tensor parallelism is applied.
moe_weight_parallelism is actually ZeRO-DP parallelism, where model parameters need to be synchronized during each training forward and backward pass.

Can moe_weight_parallelism and moe_expert_model_parallelism used together?

1-expert worse than dense model

I'm finding that training a 1-expert dMoE (brown) has worse training loss than an otherwise equivalent dense model (green). Is there some reason why this difference is expected or can I expect them to be the same? Thanks!

Has anyone encountered this CUDA error?

File "/home/workspace/megablocks/megatron/training.py", line 455, in train_step
losses_reduced = forward_backward_func(
File "/home/workspace/megablocks/megatron/core/pipeline_parallel/schedules.py", line 331, in forward_backward_no_pipelining
backward_step(grad_scaler, input_tensor, output_tensor,
File "/home/workspace/megablocks/megatron/core/pipeline_parallel/schedules.py", line 257, in backward_step
custom_backward(output_tensor[0], output_tensor_grad[0])
File "/home/workspace/megablocks/megatron/core/pipeline_parallel/schedules.py", line 154, in custom_backward
Variable._execution_engine.run_backward(
File "/anaconda3/envs/moe/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/anaconda3/envs/moe/lib/python3.10/site-packages/stk/backend/autocast.py", line 36, in decorate_bwd
return bwd(*args, **kwargs)
File "/anaconda3/envs/moe/lib/python3.10/site-packages/megablocks/ops/padded_scatter.py", line 40, in backward
dgrad = kernels.padded_gather(
File "/anaconda3/envs/moe/lib/python3.10/site-packages/megablocks/backend/kernels.py", line 118, in padded_gather
output_rows = padded_bins[-1].cpu().item()
RuntimeError: CUDA error: an illegal memory access was encountered

I tried to run dMoE on 8x8 A100 Gpus and this error occurred frequently.

Wrong outputs for hidden dim 14336

Was trying to run this: magnet:?xt=urn:btih:5546272da9065eddeb6fcd7ffddeef5b75be79a7&dn=mixtral-8x7b-32kseqlen&tr=udp%3A%2F%2Fopentracker.i2p.rocks%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A6969%2Fannounce&tr=http%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=https%3A%2F%2Ftracker.gbitt.info%3A443%2Fannounce

but the current code generates random outputs for hidden dim 14336

Inference code

Hi. Is there any inference code available for dMoE models? I couldn't find any here. Thanks.

Error from pip about missing torch module

I am trying to install megablocks for use with vllm. I have a venv setup for vllm, and have it installed and working fine with non Mixtral models.

I am using python 3.11.6 installed with homebrew.

When I try to install megablocks I see the following output:

pip install megablocks
...
ModuleNotFoundError: No module named 'torch'
...
 Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

When I look at the packages installed I see that torch==2.1.2 is in fact installed:

λ pip freeze | grep torch
torch==2.1.2

What am I doing wrong?

Why not support tensor model parallel?

After looking at the code, neither moe nor dmoe support tensor-model-parallel.
@tgale96

Bad throughput with GLU

I'm training models with the below specs but seeing major throughput drop when switching to GLU - Do you know why? / Ideas what I could investigate? Thanks a lot! cc @mvpatel2000 @tgale96

active params: 1,011,613,696 (for glu: 1,280,049,152)
total params: 4,769,710,080 (for glu: 6,917,193,728)
8 H100s, 1 node
FSDP SHARD_GRAD_OP
mlp_impl=grouped
n_experts=8
k=1
micro_bs=1
global_bs=512
no megablocks expert/weight parallelism

With mlp_type=mlp & activation_fn=gelu I get 17000 tokens per second per device.

With mlp_type=glu & activation_fn=silu I get 1000 tokens per second per device.

A small drop is expected as it's slightly more params due to glu, but probably not this large? Switching away from grouped or trying the memory optimized mlp did not help. 🤔

[integrating megablocks with open_lm] Question about megablocks + FSDP

Hello! I'm trying to integrate megablocks with our open source LLM training library (open_lm), which uses native torch FSDP.

For some reason I am consistently seeing worse performance than my dense baselines at the same compute budget. I'm a bit stumped as to why, and I was wondering if you could provide any pointers on things to watch out for wrt integrations.

To integrate your library:

I replace our feedforward layer with a Megablocks MoE: https://github.com/mlfoundations/open_lm/pull/115/files#diff-ae20e8018bed746c1e2ec41d171f99755ef714e018909a94b5097687ea80c3a4R230-R239
I add a megablock_moe at every other layer (like in Switch/GShard): https://github.com/mlfoundations/open_lm/pull/115/files#diff-ae20e8018bed746c1e2ec41d171f99755ef714e018909a94b5097687ea80c3a4R319-R322
I add the load balancing loss: https://github.com/mlfoundations/open_lm/pull/115/files#diff-52f1d1b2a425f93f57435d70bdac50ded09d6495ee2b417f999869c18cbbcbd9R198-R203

Am I missing anything else?

One hypothesis I have is that something is going wrong when I use Megablocks with FSDP.

Here is our FSDP wrapper:

https://github.com/mlfoundations/open_lm/blob/main/open_lm/main.py#L454-L462

Is there anything I need to change in the FSDP arguments to make sure FSDP doesn't interfere with all2alls? Currently it wraps the Transformer Block module, of which the MoE is a part of.

Thanks for your help!

SFT Script and Hyperparameters used for DBRX-Instruct

Hi, I saw you mentioned that you used your fork of Megatron-LM for training - could you please provide scripts and hyperparams used for the SFT of DBRX? It would mean the world for the OSS community!

At openchat, we'd like to fine-tune your model on our data and open source it.

Add a fine-tune script for JetMoE

@tgale96

The JetMoE technical report has mentioned how they used Megablocks with Megatrone to train the model.

Then the author shared this fork of the megablokcs used during the training.

Could you please let us know how we can proceed with a fine-tuning script?

Import dmoe model into other training script?

Is it possible to import the dmoe model itself into another training script without training via megatron?

Current installation instructions don't quite work

docker build . -t megablocks-dev and then bash docker.sh gives a docker container with the wrong version of stanford-stk. We also need to still run python setup.py install from /mount/megablocks to install megablocks_ops.

Script for Full Fine-Tuning of Mixtral

Hi, I see that there is a script for training Mixtral, but not one for fine-tuning. Could you please provide it? The whole community is having a lot of issues with getting correct full fine-tuning to work, including both our team at OpenChat and the teams at Nous Research, Axolotl and more. This would be incredibly helpful

different load_balancing_loss with different pipeline_parallel_size

I load the same model trained with megatron + megablocks, and I found the load_balancing_loss is slightly different. When I increase the pipeline_parallel_size, the load_balancing_loss is also increasing. Is it just the problem of precision or there is a potential bug？

For example, when I train a 500M gpt model with 64 experts, I list the lbl and pp_size in the table below.

pp_size	lbl
1	1.005E-01
2	1.007E-01
4	1.013E-01

Does this framework support SFT?

AMP + BF16 failing

Hi there,

Great work with dMoE! I'm trying to test dMoE with regular DDP + pytorch AMP(BF16) and I get the following error:

    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/miniconda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 248, in _unscale_grads_
    torch._amp_foreach_non_finite_check_and_unscale_(

I'm just wrapping your exisiting dmoe.dMoE(args) logic.

Is this something that is currently unsupported? If I force the entire network to BF16 then everything works fine.

Docker issues with PyPI installation

When I run pip install megablocks, I seem to be getting this error:
RuntimeError: ('Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx', "This image doesn't seem to support current EC2 instance type, please check release notes for supported EC2 instance type")

However, it works fine when I build from source (i.e. pip install git+https://github.com/stanford-futuredata/megablocks.git@main)

RuntimeError: Triton Error [CUDA]: invalid argument

I fllow the next step:

run docker build . -t megablocks-dev
and then bash docker.sh to launch the container.

When I run moe_46m_8gpu.sh to test, it reported the following error：

RuntimeError: Triton Error [CUDA]: invalid argument

My environment:

And how to solve this problem？

Implement Mixture of Depth and Experts (MoDE)

Given that MegaBlocks is highly optimized for sparse MoE models like Mixtral, I am requesting support for a variant recently termed as MoDE by Google DeepMind. Benefits include much faster training and inference due to increased sparsity.

Paper: https://arxiv.org/abs/2404.02258

I found two implementations:

save loading_balancing_loss properly

Hi, in the forward function of ParallelMLP, should we save directly the load_balancing_loss or a tuple of tokens_per_expert and scores? In other words, should line 428, save_load_balancing_loss((tokens_per_expert, scores)), be replaced by save_load_balancing_loss(self.load_balancing_loss(tokens_per_expert, scores))?

Efficiency of torch mlp

I've seen a torch mlp branch of megablocks without sparse matrix multiplication here. Curious if it's as efficient as the sparse version

Besides, are there performance comparisons between grouped GEMM and sparse GEMM?

ParallelDroplessMLP initialises self.mlp twice

What the title says. In layers/dmoe.py:

class ParallelDroplessMLP(moe.ParallelMLP):

    def __init__(self, args : Arguments):
        super(ParallelDroplessMLP, self).__init__(args) # <-- first init!
        self.hidden_size = args.hidden_size
        self.ffn_hidden_size = mpu.features_per_rank(args)
        self.blocking = 128
        self.mlp = dmlp_registry.get(args) # <-- second init!

As a subclass of moe.ParallelMLP, ParallelDroplessMLP first initialises self.mlp in super().__init__() (at layers/moe.py):

class ParallelMLP(torch.nn.Module):

    def __init__(self, args : Arguments):
        # ... omitted ...

        # Expert MLP.
        self.mlp = mlp.MLP(args)

This causes extra initialisation time && init memory usage, as the weights created in this init are immediately overwritten by new weights created via self.mlp = dmlp_registry.get(args).

Apologies in advance if this double-init process is actually crucially important to the mechanics of the library; I personally did not observe anything breaking after commenting out the first initialisation.

OSError: Stale file handle with dMoE

I am getting the below error upon the first step of multinode training with dMoE. Meanwhile, multinode MoE training & single node dMoE works fine. Any ideas what the problem might be? Thanks!

File "/env/lib/conda/llmoe/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
File "/env/lib/conda/llmoe/lib/python3.10/site-packages/stk/backend/autocast.py", line 27, in decorate_fwd
    return fwd(*_cast(args, dtype), **_cast(kwargs, dtype))
File "/env/lib/conda/llmoe/lib/python3.10/site-packages/megablocks/ops/scatter.py", line 28, in forward
    return kernels.scatter(
File "/env/lib/conda/llmoe/lib/python3.10/site-packages/megablocks/backend/kernels.py", line 208, in scatter
    return padded_scatter(x, indices, bin_ids, weights, bins, bins, top_k)
File "/env/lib/conda/llmoe/lib/python3.10/site-packages/megablocks/backend/kernels.py", line 190, in padded_scatter
    _padded_copy[(indices.shape[0],)](
File "/env/lib/conda/llmoe/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 143, in run
    timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
File "/env/lib/conda/llmoe/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 143, in <dictcomp>
    timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
File "/env/lib/conda/llmoe/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 122, in _bench
    return do_bench(kernel_call, warmup=self.warmup, rep=self.rep, quantiles=(0.5, 0.2, 0.8))
File "/env/lib/conda/llmoe/lib/python3.10/site-packages/triton/testing.py", line 102, in do_bench
    fn()
File "/env/lib/conda/llmoe/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 110, in kernel_call
    self.fn.run(
File "/env/lib/conda/llmoe/lib/python3.10/site-packages/triton/runtime/jit.py", line 532, in run
    self.cache[device][key] = compile(
File "/env/lib/conda/llmoe/lib/python3.10/site-packages/triton/compiler/compiler.py", line 503, in compile
    metadata_group = fn_cache_manager.get_group(metadata_filename) or {}
File "/env/lib/conda/llmoe/lib/python3.10/site-packages/triton/runtime/cache.py", line 90, in get_group
    grp_data = json.load(f)
File "/env/lib/conda/llmoe/lib/python3.10/json/__init__.py", line 293, in load
    return loads(fp.read(),
OSError: [Errno 116] Stale file handle

Question on offsets in figures 5

Hi!
I appreciate Mistral team for this amazing work.
I was reading the paper and got stuck at what these offsets(row, column) mean in the figure 5.
Can you elaborate please?

Thanks!

multi-node problem

I have been using this MOE project and have achieved excellent results when running it on a single machine with 8 GPUs. However, I have noticed a significant drop in performance when using the same parameters on a multi-node setup(64 GPUs, 16 experts, tensor-parallel 4, same global batch size ). Additionally, I have encountered NCCL errors during the process.

I was wondering if you could provide some insights or optimizations for improving the performance of the MOE model in multi-node scenarios. Thanks

Unsharding scripts for megablocks models

The base Megatron-LM repo provides unsharding scripts for the models which can be used after training.
I didn't find any such scripts in the repo.
Would it be possible to provide the same?

support amd/rocm

when I run pip install megablocks I get this:

      clang: error: unsupported option '--ptxas-options=-v'
      clang: error: unsupported option '--generate-code=arch=compute_90,code=sm_90'

_LOAD_BALANCING_LOSS returns empty list sometimes

Hello, I am using Eleuther AI's gpt-neox implementation with megablocks, but I get 2 errors related to the _LOAD_BALANCING_LOSS.

the tokens_per_expert gives me this error at this line. ValueError: Expected 14 token_per_experts but found 7. Here's the stack trace.

  File "/home/etnguyen/test/savanna/train.py", line 10, in <module>                                                                                                                             [53/1963]
    pretrain(global_config=global_config)                                                                                                                                                                
  File "/home/etnguyen/test/savanna/savanna/training.py", line 228, in pretrain                                                                                                                          
    iteration = train(                                                                                                                                                                                   
                ^^^^^^                                                                                                                                                                                   
  File "/home/etnguyen/test/savanna/savanna/training.py", line 1004, in train                                                                                                                            
    loss_dict, skipped_iter = train_step(                                                                                                                                                                
                              ^^^^^^^^^^^                                                                                                                                                                
  File "/home/etnguyen/test/savanna/savanna/training.py", line 919, in train_step                                                                                                                        
    loss = forward_step(                                                                                                                                                                                 
           ^^^^^^^^^^^^^                                                                                                                                                                                 
  File "/home/etnguyen/test/savanna/savanna/training.py", line 515, in forward_step                                                                                                                      
    moe_loss = mb_moe_loss_func(global_config, loss_mask, outputs)[0]                                                                                                                                    
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                       
  File "/home/etnguyen/test/savanna/savanna/training.py", line 464, in mb_moe_loss_func                                                                                                                  
    lbl = moe.batched_load_balancing_loss(megablocks_args)                                                                                                                                               
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                               
  File "/home/etnguyen/.local/lib/python3.11/site-packages/megablocks/layers/moe.py", line 43, in batched_load_balancing_loss                                                                            
    raise ValueError(                                                                                                                                                                                    
ValueError: Expected 14 token_per_experts but found 7.                                                                                                                                                   
num_layers = 14                                                                                                                                                                                          
pipeline_model_parallel_size = 1                                                                                                                                                                         
num_layers_per_virtual_pipeline_stage = None

I get this error when the expert_interval=2, ie when the default value is used, and so the number of experts is actually half the number of layers (14 layers, 7 Megablocks layers used). This error gets fixed when I set the expert_interval=1 so that there are 14 Megablocks, and 14 layers. But I don't know the root cause of this discrepancy, especially if I want to change the expert_interval and number of Megablocks to every other layer.

The second issue is, let's say I do use expert_interval=1 to get around the issue above, so every layer uses a Megablock, then my next error I get is that the return value of get_load_balancing_loss occasionally returns an empty list, which then errors out, meaning the _LOAD_BALANCING_LOSS is an empty list. Critically, this happens part way through training, like 30 secs in, so some batches it's fine and returns the expected losses.

Does this sound familiar to anybody? I'd very much appreciate any insights, thank you!

selective router precision

To my understanding -- and please correct me if I am wrong about this -- there is no mechanism to selectively compute routing logits in fp32, as is suggested in e.g. switch transformers. Basis:

The only mention of fp32/float computations I see anywhere are for moe_lbl_in_fp32
the router is initialized with the same dtype as the MLP weights (as configured by Arguments).
There does not seem to be any explicit casting || autocast deactivation in router.py, nor any attempt to do so in dMoE
Given that the router is implemented as a torch.nn.Linear, and the input to the router is pre-casted to autocast's precision, I can only presume that the computation must be done in half precision under normal AMP training.

Is this correct? If so, have you observed any instabilities in practice during training? Perhaps it is just not necessary...

Cloning input `x` in `megablocks.layers.glu.SparseGLU` leads to different SDD outputs

I am debugging a data-parallel forward mismatch when using megablocks (DP and non-DP give different forward results). During debugging, I tried to reproduce such difference minimally, and found that in SparseGLU.forward(), if you save x and w1 (by monkey-patching) right before

megablocks/megablocks/layers/glu.py

Line 39 in f1a83bd

x1 = stk.ops.sdd(x, w1.t(), topo)

, then put x and w1 through this line (x1 = stk.ops.sdd(x, w1.t(), topo)). The output will be different if we simply .clone() x (i.e. x1_clone = stk.ops.sdd(x.clone(), w1.t(), topo)) gives a wildly different output.

Below is a minimal reproduction:

import io
import os
import sys
import types

import stk
import torch
from megablocks.layers.activation_fn import act_fn
from megablocks.layers.arguments import Arguments
from megablocks.layers.dmoe import dMoE
from megablocks.layers.mlp import resolve_dtensor
from torch import distributed as dist


def nonempty(t: torch.Tensor, show_features: int = 8) -> torch.Tensor:
    """Treat the last dim as features, gather all non-empty features to a 2D tensor.

    Args:
        t: Tensor of shape (..., feature_count).
        show_features: The number of features to show. If None, show all.

    Returns:
        2D tensor of shape (nonempty_count, show_features).
    """
    if not isinstance(t, torch.Tensor):
        t = t.data
    t = t.reshape(-1, t.shape[-1])  # Reshape to (-1, d)
    t = t[(t != 0).any(dim=1)]  # Remove all rows that are all 0
    return t[..., :show_features]


def are_matrices_equal(a: stk.Matrix, b: stk.Matrix) -> bool:
    """Check if two matrices are equal."""
    return (
        (a.row_indices == b.row_indices).all()
        and (a.data == b.data).all()
        and (a.column_indices == b.column_indices).all()
        and (a.offsets == b.offsets).all()
        and (a.column_indices_t == b.column_indices_t).all()
        and (a.offsets_t == b.offsets_t).all()
        and (a.block_offsets_t == b.block_offsets_t).all()
    )


def glu_forward(self, x, topo):
    self.act_dict = {}
    if self.args.memory_optimized_mlp:
        raise NotImplementedError(
            "Memory optimized implementation not yet supported with GLU with sparse kernels."
        )

    w1, v1, w2 = (
        self.scale_grad(self.w1),
        self.scale_grad(self.v1),
        self.scale_grad(self.w2),
    )
    w1, v1, w2 = resolve_dtensor(w1), resolve_dtensor(v1), resolve_dtensor(w2)

    # Compute the GLU.
    self.act_dict["x"] = x
    self.act_dict["w1_resolved"] = w1
    x1 = stk.ops.sdd(x, w1.t(), topo)
    x2 = stk.ops.sdd(x, v1.t(), topo)

    activation_fn_out = act_fn(x1, self.args.activation_fn)
    x1 = stk.ops.mul(activation_fn_out, x2)

    output = stk.ops.dsd(x1, w2)
    return output


def try_sdd() -> tuple[bool, str]:
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12362"
    if not dist.is_initialized():
        dist.init_process_group(backend="gloo", rank=0, world_size=1)

    dim = 8

    megablocks_args = Arguments(
        hidden_size=dim,
        ffn_hidden_size=128,
        bias=False,
        return_bias=False,
        activation_fn=torch.nn.functional.silu,
        moe_num_experts=2,
        moe_top_k=1,
        moe_loss_weight=0.05,
        moe_normalize_expert_weights=1.0,
        moe_jitter_eps=0.0,
        mlp_type="glu",
        mlp_impl="sparse",
        moe_expert_model_parallelism=False,
        expert_parallel_group=None,
        fp16=False,
        bf16=True,
        device=torch.device("cuda"),
    )
    dmoe_ = dMoE(megablocks_args)
    dmoe_.experts.mlp.forward = types.MethodType(glu_forward, dmoe_.experts.mlp)

    input_ = torch.randn([1, 2, dim], dtype=torch.bfloat16, device=torch.device("cuda"))
    dmoe_(input_)

    topo = stk.Matrix(
        (256, 256),
        torch.empty((2, 128, 128), device="meta", dtype=torch.bfloat16),
        torch.tensor([0, 1], device="cuda:0", dtype=torch.int16),
        torch.tensor([0, 1], device="cuda:0", dtype=torch.int16),
        torch.tensor([0, 1, 2], device="cuda:0", dtype=torch.int32),
        torch.tensor([0, 1], device="cuda:0", dtype=torch.int16),
        torch.tensor([0, 1, 2], device="cuda:0", dtype=torch.int32),
        torch.tensor([0, 1], device="cuda:0", dtype=torch.int32),
    )

    x = dmoe_.experts.mlp.act_dict["x"]
    w = dmoe_.experts.mlp.act_dict["w1_resolved"]
    x1 = stk.ops.sdd(x, w.t(), topo)

    x_clone = x.clone()
    x1_clone = stk.ops.sdd(x_clone, w.t(), topo)
    equal = are_matrices_equal(x1_clone, x1)

    with io.StringIO() as output:
        sys.stdout = output

        print("Input X is the same:", (x_clone == x).all())
        print("SDD output is the same:", equal)
        print()

        print("Breakdown of SDD output:")
        print("Shape:", x1_clone.shape == x1.shape)
        print("Data:", x1_clone.data.allclose(x1.data))
        print("Row indices:", x1_clone.row_indices == x1.row_indices)
        print("Column indices:", x1_clone.column_indices == x1.column_indices)
        print("Offsets:", x1_clone.offsets == x1.offsets)
        print("Block offsets:", x1_clone.block_offsets_t == x1.block_offsets_t)
        print()

        print("Breakdown of SDD output data:")
        print("Nonempty elements:", nonempty(x1).shape, nonempty(x1_clone).shape)
        print("Per-row mean:", nonempty(x1).mean(dim=1), nonempty(x1_clone).mean(dim=1))
        print(
            "Cross-equality:",
            (nonempty(x1)[None, :, :] - nonempty(x1_clone)[:, None, :]).sum(-1),
        )

        sys.stdout = sys.__stdout__
        output_str = output.getvalue()

    return equal, output_str


def main() -> None:
    repetition_count = 10
    for trial_id in range(repetition_count):
        equal, output_str = try_sdd()
        if not equal:
            print(f"Trial {trial_id} failed.")
            print(output_str)
            return
    print(f"All {repetition_count} repetitions passed.")


if __name__ == "__main__":
    main()

My relevant environment info:

$ pipdeptree -p megablocks
megablocks==0.5.1
├── stanford-stk [required: >=0.0.6, installed: 0.7.0]
│   └── triton [required: >=2.1.0, installed: 2.2.0]
│       └── filelock [required: Any, installed: 3.13.4]
└── triton [required: >=2.1.0, installed: 2.2.0]
    └── filelock [required: Any, installed: 3.13.4]

$ pipdeptree -p torch
torch==2.1.0+cu121py311stripe
├── filelock [required: Any, installed: 3.13.4]
├── fsspec [required: Any, installed: 2023.10.0]
├── Jinja2 [required: Any, installed: 3.1.3]
│   └── MarkupSafe [required: >=2.0, installed: 2.1.5]
├── networkx [required: Any, installed: 3.3]
├── sympy [required: Any, installed: 1.12]
│   └── mpmath [required: >=0.19, installed: 1.3.0]
└── typing_extensions [required: Any, installed: 4.11.0]

$ python -V
Python 3.11.7

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.6 LTS
Release:        20.04
Codename:       focal

Installation fails due to missing mosaicml-turbo

I can't install megablocks since the installation fails due to the package mosaicml-turbo not existing.

Processing /root/megablocks
  Preparing metadata (setup.py) ... done
Collecting stanford-stk@ git+https://github.com/stanford-futuredata/stk.git@main (from megablocks==0.5.0)
  Cloning https://github.com/stanford-futuredata/stk.git (to revision main) to /tmp/pip-install-lzaojkce/stanford-stk_9ac01a09d9c244e98964baf2a48d62be
  Running command git clone --filter=blob:none --quiet https://github.com/stanford-futuredata/stk.git /tmp/pip-install-lzaojkce/stanford-stk_9ac01a09d9c244e98964baf2a48d62be
  Resolved https://github.com/stanford-futuredata/stk.git to commit b6deb502a6bf14ab4e0c96d78fa36aa935dbbe36
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... done
Collecting grouped_gemm@ git+https://github.com/tgale96/grouped_gemm@main (from megablocks==0.5.0)
  Cloning https://github.com/tgale96/grouped_gemm (to revision main) to /tmp/pip-install-lzaojkce/grouped-gemm_3eebb3adf7bd445abfb55e69b6a50479
  Running command git clone --filter=blob:none --quiet https://github.com/tgale96/grouped_gemm /tmp/pip-install-lzaojkce/grouped-gemm_3eebb3adf7bd445abfb55e69b6a50479
  Resolved https://github.com/tgale96/grouped_gemm to commit 26b67147c96de3ab757055810f0ca8c6e6945326
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... done
INFO: pip is looking at multiple versions of megablocks to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement mosaicml-turbo==0.0.4 (from megablocks) (from versions: none)
ERROR: No matching distribution found for mosaicml-turbo==0.0.4

How do you use routing balancing loss under pipeline parallelism

Hi, I see there is balancing loss implementation with pipeline parallelism in moe.py. But I wonder how do you use it with Megatron-LM? It seem there is no example training code in the repository. With pipeline parallelism, we probably can only compute the backward of the balancing loss when the next stage passes the output gradient back.

Why the second matrix of the mlp layer has the same shape of the first one?

It's more a question than an issue. The tensor w2 of class SparseMLP has the same shape as the w1, is it because of the DSD operation? like, it requires the two matrices to have the same shape? Usually, the shape of w2 is the transpose of w1, e.g. MLP of GPT2. Thanks in advance.

the wrong loss func was chosen at evaluation

The loss func is always moe_loss_func as can be seen here. But the loss is only calculated when training, which can be seen here. We should fallback to the original loss func during evaluation.

Can we change self.blocking in dmoe.py from 128 to 64?

I use megablocks to implement a fine-granded moe, the ffn_hidden_size is divisible by 64, but is not divisible by 128, can we change it to 64? Thanks a lot

How to pip install the latest megablocks?

Hi guys, how to install the lasted megablocks version?
It seems that the latest 0.3.3 just cannot be found by pip.

pip install megablocks==0.3.3
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
ERROR: Could not find a version that satisfies the requirement megablocks==0.3.3 (from versions: 0.0.1, 0.0.2)
ERROR: No matching distribution found for megablocks==0.3.3

From https://pypi.org/project/megablocks/, it seems currently only 0.0.2 is supported...

Currently I build megeblocks from local source as an workaround.
pip install -e .

[BUG] Optimizer Weights Not Reloaded When Training with bf16 Pretrained Weights

While working with the load_checkpoint function in the file third_party/Megatron-LM/megatron/checkpointing.py, I noticed that the condition on line 585:

if args.fp16 and optimizer is not None:

should be modified to:

if (args.fp16 or args.bf16) and optimizer is not None:

Without it, when using bf16 and attempting to load pretrained model weights to continue training, the weights in the optimizer will not be reset to the pretrained model weights. As a result, the training process becomes virtually no different from starting the training from scratch.

I am aware that the proper course of action would be to submit a Pull Request to the Megatron-LM repository linked herein. However, I wanted to raise this issue here as well to alert others of the current problem in the codebase.

About the Multi-node Script

Thanks for the excellent work.

I really try to fine-tune the mistral 8x7b model based on the codebase. It is really convenient to share a launch script for multi-node cases. By the way, how could I load the 8x7b weight? It seems that there is no weight conversion script.

Illegal memory access on non-0 cuda devices from `histogram`

When the input tensor is not on device 0, histogram causes an illegal memory access which prevents indices_and_bins from being computed correctly on a model & inputs which aren't on device zero.

Reproduction:

import torch
import megablocks

idx = 1
device = f'cuda:{idx}'

test_tensor = torch.tensor([ 0 ], dtype=torch.int64, device=device)
result = megablocks.ops.histogram(test_tensor, 1).cpu()

when run with CUDA_LAUNCH_BLOCKING=1 we get

Traceback (most recent call last):
  File "/home/ubuntu/test_mb.py", line 8, in <module>
    result = megablocks.ops.histogram(test_tensor, 1).cpu()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ubuntu/.local/lib/python3.10/site-packages/megablocks/ops/histogram.py", line 17, in forward
    return ops.histogram(x, max_val)
RuntimeError: an illegal memory access was encountered

whereas when idx is set to 0 the correct values are computed. Quite confused as to how this might be possible.

I'm on megablocks 0.5.1,

numpy==2.0.0
torch==2.3.1
torchaudio==2.3.1
torchvision==0.18.1
triton==2.3.1

Cuda = 12.1 and reproduced on 2 A100's.

Computation distribution with expert parallelism

Hi,
How are computation/weights sharded when using expert-parallelism with dmoe, does each expert-parallel rank compute for only num-experts/expert-parallelism specific experts, or does each rank compute 1/expert-parallelism of the work for all the experts?
Example for the extreme case where expert-parallelism=num-experts and all tokens are routed to a single expert, does all computation happen unevenly on one device in this case or is it somehow sharded across all expert parallel ranks? If it's the latter, is there somewhere to find additional details of how this works? If it's the former, would there be any advantage of using dmoe and not the base implementation for the case where there is 1 expert per rank?

Thanks

How to add support for swiglu in Megablocks?

Hi @tgale96 @deepakn94,
During integration of Megablocks in to Megatron-LM, I found that when --swiglu (src) is enabled, the corresponding ffn_hidden_size is int((4 * args.hidden_size * 2 / 3) / 64) * 64 instead of general args.ffn_hidden_size = 4 * args.hidden_size (src), this change will break the underlying assert in dmoe.py:

assert self.ffn_hidden_size % self.blocking == 0

Could we add support for the case for swiglu where usually has self.ffn_hidden_size % self.blocking !=0 ? Thanks!

Latest GitHub release version higher than main branch setup.py

setup.py still shows 0.4.0, while the latest release is 0.5.0

Latest PyPi release is 0.2.0.

Would be nice to have these up there as well so things are pip installable without having to do pip install git+... :)

Comparison against top-2 routing?

Hi,

The paper results are very impressive. But I notice the comparision is againest top-1 routing. Do you have results against top-2 routing? This would make the comparison more challenging :D.

Sum missing axis arg in kernels.py

I get the following error when I try to run bwds pass with moe_expert_model_parallelism=False. E.g. if I run moe_test.py it fails with this error.

    # Reduce to get the final result and store.
    out = tl.sum(acc).to(wgrad.dtype.element_ty)
                 ^
TypeError("sum() missing 1 required positional argument: 'axis'")

My versions

megablocks                   0.5.1
stanford-stk                 0.7.0
triton                       2.1.0