mosaicml / diffusion Goto Github PK

View Code? Open in Web Editor NEW

649.0 649.0 63.0 20.09 MB

License: Apache License 2.0

Makefile 0.15% Python 99.47% Shell 0.38%

diffusion's People

Contributors

Stargazers

Watchers

diffusion's Issues

Could this code be used for 3090ti

I am limited GPU resources and the maximum GPU is 3090ti. Could I use your code to finetune stable diffusion under 3090ti?

FID and CLIP score

Hope this finds you well. Nice work.
One quick question: have you compared the FID/CLIP score of mosaic diffusion-2.0-base and the official diffusion-2.0-base?
I only find some human evaluation results shown in https://www.mosaicml.com/blog/training-stable-diffusion-from-scratch-part-2.

Bug Report of image_caption.py

This line of code https://github.com/mosaicml/diffusion/blob/main/diffusion/datasets/image_caption.py#L214 should be changed to

crop_transform = RandomCropSquare(resize_size) if rand_crop else LargestCenterSquare(resize_size)

FID score changes a lot during the model training

Hi, I found that FID score becomes larger and larger during the Stage 2 (512x512) training of SD-2.0-base, but the loss keeps at the same level, roughly about 0.12~0.13. Any thoughts? Thanks.

NaN loss and Watchdog timeout error caused by _accumulate_time_across_ranks

Hi, the loss of my training job becomes NaN at step 161544. The job failed due to wathdog timeout error. Looks like the error is caused by this code. Any thoughts? Thanks.

2023-10-05T03:29:36.616-07:00 | [0]:train 21%\|█████▏ \| 161543/770000 [29:39:25<397:31:10, 2.35s/ba, loss/train/t�[A
-- | --
  | 2023-10-05T03:29:36.616-07:00 | [0]:
  | 2023-10-05T04:00:22.604-07:00 | [0]:train 21%\|█████▏ \| 161544/770000 [29:39:25<310:42:33, 1.84s/ba, loss/train/t�[A[6]:[E ProcessGroupNCCL.cpp:828] [Rank 38] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805505 milliseconds before timing out.
  | 2023-10-05T04:00:22.612-07:00 | [1]:[E ProcessGroupNCCL.cpp:828] [Rank 33] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805498 milliseconds before timing out.
  | 2023-10-05T04:00:22.628-07:00 | [7]:[E ProcessGroupNCCL.cpp:828] [Rank 39] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805581 milliseconds before timing out.
  | 2023-10-05T04:00:22.636-07:00 | [3]:[E ProcessGroupNCCL.cpp:828] [Rank 35] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805568 milliseconds before timing out.
  | 2023-10-05T04:00:22.654-07:00 | [5]:[E ProcessGroupNCCL.cpp:828] [Rank 37] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805524 milliseconds before timing out.
  | 2023-10-05T04:00:22.654-07:00 | [5]:Traceback (most recent call last):
  | 2023-10-05T04:00:22.654-07:00 | [5]: File "/workspace/mosaic.py", line 28, in <module>
  | 2023-10-05T04:00:22.654-07:00 | [5]: main()
  | 2023-10-05T04:00:22.654-07:00 | [5]: File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
  | 2023-10-05T04:00:22.654-07:00 | [5]: return f(*args, **kwargs)
  | 2023-10-05T04:00:22.654-07:00 | [5]: File "/workspace/mosaic.py", line 24, in main
  | 2023-10-05T04:00:22.654-07:00 | [5]: return train(cfg)
  | 2023-10-05T04:00:22.654-07:00 | [5]: File "/workspace/models/diffusion/train.py", line 170, in train
  | 2023-10-05T04:00:22.654-07:00 | [5]: return eval_and_then_train()
  | 2023-10-05T04:00:22.654-07:00 | [5]: File "/workspace/models/diffusion/train.py", line 168, in eval_and_then_train
  | 2023-10-05T04:00:22.654-07:00 | [5]: trainer.fit()
  | 2023-10-05T04:00:22.654-07:00 | [5]: File "/opt/conda/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1766, in fit
  | 2023-10-05T04:00:22.654-07:00 | [5]: self._train_loop()
  | 2023-10-05T04:00:22.654-07:00 | [5]: File "/opt/conda/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1962, in _train_loop
  | 2023-10-05T04:00:22.654-07:00 | [5]: total_num_samples, total_num_tokens, batch_time = self._accumulate_time_across_ranks(
  | 2023-10-05T04:00:22.654-07:00 | [5]: File "/opt/conda/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1864, in _accumulate_time_across_ranks
  | 2023-10-05T04:00:22.654-07:00 | [5]: dist.all_reduce(sample_token_tensor, reduce_operation='SUM')
  | 2023-10-05T04:00:22.654-07:00 | [5]: File "/opt/conda/lib/python3.9/site-packages/composer/utils/dist.py", line 212, in all_reduce
  | 2023-10-05T04:00:22.654-07:00 | [5]: dist.all_reduce(tensor, op=reduce_op)
  | 2023-10-05T04:00:22.654-07:00 | [5]: File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
  | 2023-10-05T04:00:22.654-07:00 | [5]: return func(*args, **kwargs)
  | 2023-10-05T04:00:22.654-07:00Copy[5]:  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce | [5]: File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
  | 2023-10-05T04:00:22.654-07:00Copy[5]:    work = default_pg.allreduce([tensor], opts) | [5]: work = default_pg.allreduce([tensor], opts)
  | 2023-10-05T04:00:22.654-07:00 | [5]:RuntimeError: NCCL communicator was aborted on rank 37. Original reason for failure was: [Rank 37] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805524 milliseconds before timing out.
  | 2023-10-05T04:00:22.655-07:00 | [2]:[E ProcessGroupNCCL.cpp:828] [Rank 34] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805579 milliseconds before timing out.
  | 2023-10-05T04:00:22.656-07:00 | [4]:[E ProcessGroupNCCL.cpp:828] [Rank 36] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805597 milliseconds before timing out.
  | 2023-10-05T04:00:22.700-07:00 | [0]:[E ProcessGroupNCCL.cpp:828] [Rank 32] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805625 milliseconds before timing out.

Throughput is low

Hi, I am using 16 nodes, each of which has 8 A100s to train the first stage (256x256) of sd-2.0-base.
The total batch size is 2048=128*16. I am not using pre-computed latents, giving the throughput about 5.5k-6K samples/sec.
As mentioned here, the throughput of your model training is 11600 samples/sec by using pre-computed latents. Given the following estimation, the throughput of your model training is about 11600/1.4=8286 samples/sec without using pre-computed latents. This throughput is quite higher than ours, any ideas? Thanks.

If you are computing VAE and CLIP latents while training, expect a 1.4x increase in time and cost.

index.json file missing

Hi,
I have followed the data preparation steps as described in the laion2b-en-interactive.yaml and precompute-latents.yaml files. When computing the gradients, I get an error regarding a missing index.json file in the aesthetic output folder.My aesthetic output folder contains the sharded parquet files and corresponding stats,json files. What is the purpose of the index.json file and how can I generate it?FYI, I am only using a small subset of the dataset ~10% for this experiment. Thanks.

Traceback (most recent call last):
  File "/home/ubuntu/diffusion/precompute_latents.py", line 357, in <module>
    main(parse_args())
  File "/home/ubuntu/diffusion/precompute_latents.py", line 229, in main
    dataloader = build_streaming_laion_dataloader(
  File "/home/ubuntu/diffusion/precompute_latents.py", line 166, in build_streaming_laion_dataloader
    dataset = StreamingLAIONDataset(
  File "/home/ubuntu/diffusion/precompute_latents.py", line 58, in __init__
    super().__init__(
  File "/home/ubuntu/anaconda3/envs/mosaic/lib/python3.9/site-packages/streaming/base/dataset.py", line 264, in __init__
    stream_shards = stream.get_shards(world)
  File "/home/ubuntu/anaconda3/envs/mosaic/lib/python3.9/site-packages/streaming/base/stream.py", line 303, in get_shards
    filename = self._download_file(basename)
  File "/home/ubuntu/anaconda3/envs/mosaic/lib/python3.9/site-packages/streaming/base/stream.py", line 199, in _download_file
    download_file(remote, local, self.download_timeout)
  File "/home/ubuntu/anaconda3/envs/mosaic/lib/python3.9/site-packages/streaming/base/storage/download.py", line 234, in download_file
    download_from_local(remote, local)
  File "/home/ubuntu/anaconda3/envs/mosaic/lib/python3.9/site-packages/streaming/base/storage/download.py", line 204, in download_from_local
    shutil.copy(remote, local_tmp)
  File "/home/ubuntu/anaconda3/envs/mosaic/lib/python3.9/shutil.py", line 427, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/home/ubuntu/anaconda3/envs/mosaic/lib/python3.9/shutil.py", line 264, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/mosaic/aesthetic/1/index.json'

Grad become NaN

When training on my local machine (3090 24Gb) with batch size 12, grad value become NaN after few steps

But I don't meet this when training on Google Cloud A100 40Gb with bs 20. Why? How I fix that?

When is common canvas going to be released?

Training verbose logs

I'm trying to execute a training process with composer run.py --config-path yamls/hydra-yamls --config-name SD-2-base-256.yaml, after changing the configuration to use a custom data loader. Im getting some generic error
AttributeError("'IterableDatasetDict' object has no attribute '_distributed'") from unspecified source. How can I get more details?

How can I improve to solve this problem?

Error executing job with overrides: []
Error locating target 'diffusion.datasets.laion.laion.build_streaming_laion_dataloader', set env var HYDRA_FULL_ERROR=1 to see chained exception.
full_key: dataset.train_dataset

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 16072) exited with code 1
ERROR:composer.cli.launcher:Global rank 0 (PID 16072) exited with code 1

can we use coco_captions dataset to train your model？？？

Thanks for sharing this amizing work，the LAION-5B is too big and i want use coco_captions dataset to train the model, should i change the dataloader script coco_captions.py?

Setting for used gpus

Is there any setting in yaml file to control the used GPU number? For example, I only have 2 cards available, but this program always notices 8 cards are used.

Any plan to support "Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack"?

Hi, do you have plan to support EMU? According to the paper, this model outperforms SDXL and only requires 2k images for quality-tuning. The images showcased in the paper is very realistic

inference code?

hello, can you provide a script to convert a trained model to diffusers ckpt format so I can inference my trained model using diffusers?

ValueError('cannot mmap an empty file')

FileExistsError: [Errno 17] File exists: '/000000_shard_access_times'

During handling of the above exception, another exception occurred:

InstantiationException: Error in call to target 'diffusion.datasets.laion.laion.build_streaming_laion_dataloader':
ValueError('cannot mmap an empty file')
full_key: dataset.train_dataset
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 34542) exited with code 1
ERROR:composer.cli.launcher:Global rank 0 (PID 34542) exited with code 1

I'm wondering why it need to write in the root directory?, it seems set self.local to empty, please help

No initialize_dist in precompute_latent.py

I got this error running mainline precompute_latent.py with composer,

RuntimeError: The world_size(8) > 1, but the distributed package is not available or has not been initialized. Please check y$
u have initialized the distributed runtime and that PyTorch has been built with distributed support. If calling this function
outside Trainer, please ensure that `composer.utils.dist.initialize_dist` has been called first.

My understanding is that I need add dist.intialize_dist(device, timeout) right after get_device()
is this understanding correct? Thanks

Compare latents from training code and latents from precompute_latents.py

Hi, I am doing a sanity check to make sure that the caption and image latents generated by precompute_latents.py are identical to the latents generated from the training code. However, I found this only happens when the same batch_size is used within these two settings. Any thoughts? Note that vae uses group norm, while text_encoder uses layer norm, both do not depend on batch size. In addition, both vae and text_encoder use drop_out=0.

For latents from training code, I save conditioning and latents at this position
For latents from precompute_latents.py, I just use precompute_latents.py

Missing/Unexpected Keys when load a checkpoint

Hi, to do the stage 2 training of 2.0-base, I am using this yaml file SD-2-base-512.yam. However, this yaml file doesn't load the checkpoint from stage 1. I add a new line under trainer to handle this issue.

trainer:
    load_path: sd2.0-base-256/ep0-ba550000-rank0.pt

However, I get the following error:

Found these missing keys in the checkpoint: vae.encoder.mid_block.attentions.0.to_q.weight, vae.encoder.mid_block.attentions.0.to_q.bias, vae.encoder.mid_block.attentions.0.to_k.weight, vae.encoder.mid_block.attentions.0.to_k.bias, vae.encoder.mid_block.attentions.0.to_v.weight, vae.encoder.mid_block.attentions.0.to_v.bias, vae.encoder.mid_block.attentions.0.to_out.0.weight, vae.encoder.mid_block.attentions.0.to_out.0.bias, vae.decoder.mid_block.attentions.0.to_q.weight, vae.decoder.mid_block.attentions.0.to_q.bias, vae.decoder.mid_block.attentions.0.to_k.weight, vae.decoder.mid_block.attentions.0.to_k.bias, vae.decoder.mid_block.attentions.0.to_v.weight, vae.decoder.mid_block.attentions.0.to_v.bias, vae.decoder.mid_block.attentions.0.to_out.0.weight, vae.decoder.mid_block.attentions.0.to_out.0.bias
mosaic/0 [0]:[2023-06-19 08:43:26,877][composer.core.state][WARNING] - Found these unexpected keys in the checkpoint: vae.encoder.mid_block.attentions.0.query.weight, vae.encoder.mid_block.attentions.0.query.bias, vae.encoder.mid_block.attentions.0.key.weight, vae.encoder.mid_block.attentions.0.key.bias, vae.encoder.mid_block.attentions.0.value.weight, vae.encoder.mid_block.attentions.0.value.bias, vae.encoder.mid_block.attentions.0.proj_attn.weight, vae.encoder.mid_block.attentions.0.proj_attn.bias, vae.decoder.mid_block.attentions.0.query.weight, vae.decoder.mid_block.attentions.0.query.bias, vae.decoder.mid_block.attentions.0.key.weight, vae.decoder.mid_block.attentions.0.key.bias, vae.decoder.mid_block.attentions.0.value.weight, vae.decoder.mid_block.attentions.0.value.bias, vae.decoder.mid_block.attentions.0.proj_attn.weight, vae.decoder.mid_block.attentions.0.proj_attn.bias

Question regarding the training dataset

Hope this finds you well. Very amazing work!!!
As mentioned in https://github.com/mosaicml/diffusion#how-many-gpus-do-i-need, "Our time estimates are based on training Stable Diffusion 2.0 base on 1,126,400,000 images at 256x256 resolution and 1,740,800,000 images at 512x512 resolution".
I have three questions:

since the whole data of ChristophSchuhmann/improved_aesthetics_4.5plus is 1.2B, how come that you have 1.74B images with original resolution > 512x512?
I am also downloading ChristophSchuhmann/improved_aesthetics_4.5plus, but it turns out that the failed ratio is about 20%, which means I can only get about 960M images. How could you download 1.12B images at 256x256 resolution? I follow the downloading script provided in this repo.
According to stabilityai/stable-diffusion-2-base, they said "The model is trained from scratch 550k steps at resolution 256x256 on a subset of LAION-5B filtered for explicit pornographic material, using the LAION-NSFW classifier with punsafe=0.1 and an aesthetic score >= 4.5. Then it is further trained for 850k steps at resolution 512x512 on the same dataset on images with resolution >= 512x512". Have you applied punsafe=0.1 in your training? I suppose "punsafe=0.1" means that we only use samples with punsafe<0.1, am I correct?

Thanks very much!!!

Training with modest data

First of all I want to thanks for your amazing project.

I am implementing a diffusion from scratch, but instead using LAION dataset I using my own dataset which contains only 20 milion images. I already finish the training stage 1 with image size 256x256 and do some image generation experiment on it. The image output have very good background but if the prompt define some object, for instance "a dog" then the image output have multiple object (dogs) instead one (You can check some samples down below).

I know the quality of model mostly depend on the quality of the dataset but can you suggest any idea to improve output for my model.

Some information about my case:

I using Blip2 to generate caption for 20 milion images
I'm using the Stable Diffusion v2.1 default config for training config
I'm training with 8*A100 on AWS but my budgets are limited hence I haven't enough time to do lots of experiments

fp32

Hi, thanks for this great work.
I have a question regarding encode_latents_in_fp16. If we set encode_latents_in_fp16=False to use fp32, do we expect lower performance compared to fp16? I have tried both and found that images generated by fp32-based model have lower quality than fp16-based model. Is that expected? Thanks

Best

Bug report: ValueError: invalid literal for int() with base 10: '/tmp/mds-cache/mds-coco-2014-val-fid-clip-17'

Hi, after installing everything by following these commands

git clone https://github.com/mosaicml/diffusion.git
cd diffusion
pip install -e .

I was trying to run fid-clip-evaluation.py but got the following error:

Traceback (most recent call last):
  File "diffusion/scripts/fid-clip-evaluation.py", line 39, in <module>
    coco_val_dataloader = build_streaming_cocoval_dataloader(
  File "diffusion/diffusion/datasets/coco/coco_captions.py", line 110, in build_streaming_cocoval_dataloader
    dataset = StreamingCOCOCaption(
  File "diffusion/diffusion/datasets/coco/coco_captions.py", line 60, in __init__
    super().__init__(
  File "python3.9/site-packages/streaming/base/dataset.py", line 496, in __init__
    self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
  File "python3.9/site-packages/streaming/base/shared/prefix.py", line 189, in get_shm_prefix
    prefix_int = _check_and_find_retrying(streams_local, streams_remote, retry)
  File "python3.9/site-packages/streaming/base/shared/prefix.py", line 162, in _check_and_find_retrying
    raise errs[-1]
  File "python3.9/site-packages/streaming/base/shared/prefix.py", line 158, in _check_and_find_retrying
    return _check_and_find(streams_local, streams_remote)
  File "python3.9/site-packages/streaming/base/shared/prefix.py", line 115, in _check_and_find
    their_locals, _ = _unpack_locals(bytes(shm.buf))
  File "python3.9/site-packages/streaming/base/shared/prefix.py", line 75, in _unpack_locals
    return text[:-1], int(text[-1] or 0)
ValueError: invalid literal for int() with base 10: '/tmp/mds-cache/mds-coco-2014-val-fid-clip-17'
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.

This error is fixed after installing old packages:

pip install mosaicml==0.14.1
pip install mosaicml-streaming==0.5.0

How long does it take to run laion_cloudwriter.py and precompute_latents.py for aesthetic-4.5

Hope this finds you well.
I am trying to run laion_cloudwriter.py and precompute_latents.py on the aesthetic-4.5 dataset. However, I found that it takes very long time to finish these two tasks. For example, laion_cloudwriter.py takes about 80 seconds for each parquet file (the row data), while the whole dataset contains about 25K parquet files.
Thanks in advance.

Error when training with local precomputed features

Bug when local training with LocalDataset

Here is my config (without some personal paths), run for mosaicml's diffusion:

algorithms:
  low_precision_groupnorm:
    attribute: unet
    precision: amp_fp16
  low_precision_layernorm:
    attribute: unet
    precision: amp_fp16
model:
  _target_: diffusion.models.models.stable_diffusion_2
  model_name: runwayml/stable-diffusion-v1-5
  pretrained: true
  precomputed_latents: true
  encode_latents_in_fp16: true
  # fsdp: false
  fsdp: true
  val_metrics:
    - _target_: torchmetrics.MeanSquaredError
  val_guidance_scales: []
  loss_bins: []
dataset:
  train_batch_size: 2048 # TODO: explore composer config
  eval_batch_size: 16 # Should be 8 per device
  train_dataset:
    _target_: diffusion.datasets.pixta.pixta.build_custom_dataloader
    data_path: ...
    feature_dim: 32
    num_workers: 8
    pin_memory: false
  eval_dataset:
    _target_: diffusion.datasets.pixta.pixta.build_custom_dataloader
    data_path: ...
    feature_dim: 32
    num_workers: 8
    pin_memory: false
optimizer:
  _target_: torch.optim.AdamW
  lr: 1.0e-5
  weight_decay: 0.01
scheduler:
  _target_: composer.optim.ConstantWithWarmupScheduler
  t_warmup: 10ba
logger:
  wandb:
    _target_: composer.loggers.wandb_logger.WandBLogger
    name: ${name}
    project: ${project}
    group: ${name}
callbacks:
  speed_monitor:
    _target_: composer.callbacks.speed_monitor.SpeedMonitor
    window_size: 10
  lr_monitor:
    _target_: composer.callbacks.lr_monitor.LRMonitor
  memory_monitor:
    _target_: composer.callbacks.memory_monitor.MemoryMonitor
  runtime_estimator:
    _target_: composer.callbacks.runtime_estimator.RuntimeEstimator
  optimizer_monitor:
    _target_: composer.callbacks.OptimizerMonitor
trainer:
  _target_: composer.Trainer
  device: gpu
  max_duration: 10ep
  eval_interval: 2ep
  device_train_microbatch_size: 40
  run_name: ${name}
  seed: ${seed}
  scale_schedule_ratio: ${scale_schedule_ratio}
  save_folder:  outputs/${project}/${name}
  save_interval: 5ep
  save_overwrite: true
  autoresume: false
  fsdp_config:
    sharding_strategy: "SHARD_GRAD_OP"
    state_dict_type: "full"
    mixed_precision: 'PURE'
    activation_checkpointing: true

Here is my dataset and dataloader code:

class CustomDataset(Array, Dataset):
    def __init__(self, 
                 data_path, 
                 feature_dim=64):
        self.feature_dim = feature_dim

        index_file = os.path.join(data_path, 'index.json')
        data = json.load(open(index_file))
        if data['version'] != 2:
            raise ValueError(f'Unsupported streaming data version: {data["version"]}. ' +
                             f'Expected version 2.')
        shards = []
        for info in data['shards']:
            shard = reader_from_json(data_path, None, info)
            shards.append(shard)
        
        self.shards = shards
        samples_per_shard = np.array([shard.samples for shard in shards], np.int64)
        self.length = samples_per_shard.sum()
        self.spanner = Spanner(samples_per_shard)

    def __len__(self):
        return self.length

    @property
    def size(self) -> int:
        """Get the size of the dataset in samples.

        Returns:
            int: Number of samples.
        """
        return self.length
    
    # def __getitem__(self, index):
    def get_item(self, index):
        shard_id, shard_sample_id = self.spanner[index]
        shard = self.shards[shard_id]
        sample = shard[shard_sample_id]
        out = {}
        if 'caption_latents' in sample:
            out['caption_latents'] = torch.from_numpy(
                np.frombuffer(sample['caption_latents'], dtype=np.float16).copy()).reshape(77, 768)

        if 'image_latents' in sample:
            out['image_latents'] = torch.from_numpy(np.frombuffer(sample['image_latents'],
                                                                  dtype=np.float16).copy()).reshape(4, self.feature_dim, self.feature_dim)
        return out
    

def build_custom_dataloader(
    batch_size: int,
    data_path: str,
    image_root: str = None,
    tokenizer_name_or_path: str = 'runwayml/stable-diffusion-v1-5',
    caption_drop_prob: float = 0.0,
    resize_size: int = 512,
    feature_dim: int = 64,
    drop_last: bool = True,
    shuffle: bool = True, #TODO pass shuffle to dataloader
    **dataloader_kwargs,
):
    print('Using precomputed features!!!')
    dataset = CustomDataset(data_path, feature_dim=feature_dim)
    drop_last = False

    if isinstance(dataset, IterableDataset):
        print('Using IterableDataset!!!')
        sampler = None
    else:
        print('Using Sampler!!!')
        sampler = dist.get_sampler(dataset, drop_last=drop_last, shuffle=shuffle)

    dataloader = DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        sampler=sampler,
        drop_last=drop_last,
        shuffle=shuffle if sampler is None else False,
        **dataloader_kwargs,
    )

    return dataloader

And I got this errors while finish epoch 0 and start epoch 1:

Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/queues.py", line 113, in get
    if not self._poll(timeout):
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/connection.py", line 262, in poll
    return self._poll(timeout)
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/connection.py", line 429, in _poll
    r = wait([self], timeout)
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/connection.py", line 936, in wait
    ready = selector.select(timeout)
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3007309) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tungduongquang/workspace/mosaicml/image-generation/run.py", line 26, in <module>
    main()
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/tungduongquang/workspace/mosaicml/image-generation/run.py", line 22, in main
    return train(config)
  File "/home/tungduongquang/workspace/mosaicml/image-generation/diffusion/train.py", line 134, in train
    return eval_and_then_train()
  File "/home/tungduongquang/workspace/mosaicml/image-generation/diffusion/train.py", line 132, in eval_and_then_train
    trainer.fit()
  File "/home/tungduongquang/workspace/mosaicml/composer/composer/trainer/trainer.py", line 1796, in fit
    self._train_loop()
  File "/home/tungduongquang/workspace/mosaicml/composer/composer/trainer/trainer.py", line 1938, in _train_loop
    for batch_idx, self.state.batch in enumerate(self._iter_dataloader(TrainerMode.TRAIN)):
  File "/home/tungduongquang/workspace/mosaicml/composer/composer/trainer/trainer.py", line 2924, in _iter_dataloader
    batch = next(dataloader_iter)
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
    success, data = self._try_get_data()
  File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3007309) exited unexpectedly
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/utils/data/datal │
│ oader.py:1120 in _try_get_data                                                                   │
│                                                                                                  │
│   1117 │   │   # Returns a 2-tuple:                                                              │
│   1118 │   │   #   (bool: whether successfully get data, any: data if successful else None)      │
│   1119 │   │   try:                                                                              │
│ ❱ 1120 │   │   │   data = self._data_queue.get(timeout=timeout)                                  │
│   1121 │   │   │   return (True, data)                                                           │
│   1122 │   │   except Exception as e:                                                            │
│   1123 │   │   │   # At timeout and error, we manually check whether any worker has              │
│                                                                                                  │
│ /home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/queues.py:113 in get │
│                                                                                                  │
│   110 │   │   │   try:                                                                           │
│   111 │   │   │   │   if block:                                                                  │
│   112 │   │   │   │   │   timeout = deadline - time.monotonic()                                  │
│ ❱ 113 │   │   │   │   │   if not self._poll(timeout):                                            │
│   114 │   │   │   │   │   │   raise Empty                                                        │
│   115 │   │   │   │   elif not self._poll():                                                     │
│   116 │   │   │   │   │   raise Empty                                                            │
│                                                                                                  │
│ /home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/connection.py:262 in │
│ poll                                                                                             │
│                                                                                                  │
│   259 │   │   """Whether there is any input available to be read"""                              │
│   260 │   │   self._check_closed()                                                               │
│   261 │   │   self._check_readable()                                                             │
│ ❱ 262 │   │   return self._poll(timeout)                                                         │
│   263 │                                                                                          │
│   264 │   def __enter__(self):                                                                   │
│   265 │   │   return self                                                                        │
│                                                                                                  │
│ /home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/connection.py:429 in │
│ _poll                                                                                            │
│                                                                                                  │
│   426 │   │   return self._recv(size)                                                            │
│   427 │                                                                                          │
│   428 │   def _poll(self, timeout):                                                              │
│ ❱ 429 │   │   r = wait([self], timeout)                                                          │
│   430 │   │   return bool(r)                                                                     │
│   431                                                                                            │
│   432                                                                                            │
│                                                                                                  │
│ /home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/connection.py:936 in │
│ wait                                                                                             │
│                                                                                                  │
│   933 │   │   │   │   deadline = time.monotonic() + timeout                                      │
│   934 │   │   │                                                                                  │
│   935 │   │   │   while True:                                                                    │
│ ❱ 936 │   │   │   │   ready = selector.select(timeout)                                           │
│   937 │   │   │   │   if ready:                                                                  │
│   938 │   │   │   │   │   return [key.fileobj for (key, events) in ready]                        │
│   939 │   │   │   │   else:                                                                      │
│                                                                                                  │
│ /home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/selectors.py:416 in select           │
│                                                                                                  │
│   413 │   │   │   timeout = math.ceil(timeout * 1e3)                                             │
│   414 │   │   ready = []                                                                         │
│   415 │   │   try:                                                                               │
│ ❱ 416 │   │   │   fd_event_list = self._selector.poll(timeout)                                   │
│   417 │   │   except InterruptedError:                                                           │
│   418 │   │   │   return ready                                                                   │
│   419 │   │   for fd, event in fd_event_list:                                                    │
│                                                                                                  │
│ /home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/utils/data/_util │
│ s/signal_handling.py:66 in handler                                                               │
│                                                                                                  │
│   63 │   def handler(signum, frame):                                                             │
│   64 │   │   # This following call uses `waitid` with WNOHANG from C side. Therefore,            │
│   65 │   │   # Python can still get and update the process status successfully.                  │
│ ❱ 66 │   │   _error_if_any_worker_fails()                                                        │
│   67 │   │   if previous_handler is not None:                                                    │
│   68 │   │   │   assert callable(previous_handler)                                               │
│   69 │   │   │   previous_handler(signum, frame)                                                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: DataLoader worker (pid 3007309) is killed by signal: Aborted.

Plz help!!!

train error during evaluation with 1 GPU and train with multi GPU

Hi thanks for this contribution
as a small exercise I am training SD2 on the pokemon dataset
I precomputed the latents and it starts training on one gpu
However at the evaluation time I get the following error

File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2814, in _eval_loop
    self.state.outputs = self._original_model.eval_forward(self.state.batch)
  File "/fsx_vfx/users/csegalin/code/diffusion/diffusion/models/stable_diffusion.py", line 255, in eval_forward
    gen_images = self.generate(tokenized_prompts=prompts,
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/fsx_vfx/users/csegalin/code/diffusion/diffusion/models/stable_diffusion.py", line 464, in generate
    pred = self.unet(latent_model_input,
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py", line 934, in forward
    sample = self.conv_in(sample)
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (162 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size`

this is my confguration

name: trial0 # Insert wandb run name
project: pokemon_sd2_256 # Insert wandb project name
seed: 17
eval_first: false
algorithms:
  low_precision_groupnorm:
    attribute: unet
    precision: amp_fp16
  low_precision_layernorm:
    attribute: unet
    precision: amp_fp16
model:
  _target_: diffusion.models.models.stable_diffusion_2
  pretrained: false
  precomputed_latents: true
  encode_latents_in_fp16: true
  fsdp: true
  val_metrics:
    - _target_: torchmetrics.MeanSquaredError
    - _target_: torchmetrics.image.fid.FrechetInceptionDistance
      normalize: true
  val_guidance_scales: [3, 7]
  # val_guidance_scales: []
  loss_bins: []
dataset:
  train_batch_size: 1 # Global training batch size
  eval_batch_size: 1  # Global evaluation batch size
  train_dataset:
    _target_: diffusion.datasets.pokemon.pokemon.build_streaming_dataloader
      # Path to object store bucket(s)
    local: /fsx_vfx/users/csegalin/data/pokemon/latents2_train
      # Path to corresponding local dataset(s)
    mode: 0
    version: 2
    drop_last: False
    shuffle: true
    prefetch_factor: 2
    num_workers: 8
    persistent_workers: true
    pin_memory: true
  eval_dataset:
    _target_: diffusion.datasets.pokemon.pokemon.build_streaming_dataloader
    local: /fsx_vfx/users/csegalin/data/pokemon/latents2_eval # Path to local dataset cache
    prefetch_factor: 2
    num_workers: 8
    persistent_workers: True
    pin_memory: True
    mode: 0
    version: 2
optimizer:
  _target_: torch.optim.AdamW
  lr: 1.0e-5
  weight_decay: 0.01
scheduler:
  _target_: composer.optim.LinearWithWarmupScheduler
  t_warmup: 1000ba
  alpha_f: 1.0
logger:
  comet-ml:
    _target_: composer.loggers.cometml_logger.CometMLLogger
    name: ${name}
    project_name: ${project}
callbacks:
  speed_monitor:
    _target_: composer.callbacks.speed_monitor.SpeedMonitor
    window_size: 10
  lr_monitor:
    _target_: composer.callbacks.lr_monitor.LRMonitor
  memory_monitor:
    _target_: composer.callbacks.memory_monitor.MemoryMonitor
  runtime_estimator:
    _target_: composer.callbacks.runtime_estimator.RuntimeEstimator
  optimizer_monitor:
    _target_: composer.callbacks.OptimizerMonitor
  image_monitor:
    _target_: diffusion.callbacks.log_diffusion_images.LogDiffusionImages
    prompts: # add any prompts you would like to visualize
    - cute dragon creature
    size: 256 # generated image resolution
    guidance_scale: 3
trainer:
  _target_: composer.Trainer
  device: gpu
  max_duration: 550000ba
  eval_interval: 1000ba
  device_train_microbatch_size: 1
  run_name: ${name}
  seed: ${seed}
  save_folder:  trained_model # Insert path to save folder or bucket
  save_interval: 3000ba
  save_overwrite: true
  autoresume: false
  # fsdp_config:
  #   sharding_strategy: "SHARD_GRAD_OP"

save_overwrite conflicts with autoresume=true

When must set autoresume=true,
if I set save_overwrite = false，it says:

FileExistsError: /home/mnt/diffmodels/ep0-ba1000-rank0.pt may conflict with a future checkpoint of the current run.Please
delete that file, change to a new folder, or set overwrite=True.
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: ð�� View run sd-model at: https://wandb.ai/xxxx/diffusion/runs/hpzkfedp
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230701_152030-hpzkfedp/logs

if I set save_overwrite = true, it says:

InstantiationException: Error in call to target 'composer.trainer.trainer.Trainer':
ValueError('The flag save_overwrite must be False when autoresume is enabled as autoresume always loads the latest existing checkpoint in save_folder. ')
full_key: trainer

some question about learning rate

In SD-256.yaml, the learning rate is set to 1e-4, I wonder whether the learning rate will be scaled up with the increase of GPUs. For example, if I use 64 GPUs train the model, will the learning rate be scaled to 64*1e-4 automatically?

Weights for the trained model

Can you share the weight for the trained model
Also, is there any technical report to understand, the building blocks, that can quantify improvements

How can I solve this problem?

A question regarding the lr scheduler

Hi, is that correct to set milestones to 200ep? Thanks.

milestones:
    - 200ep

https://github.com/mosaicml/diffusion/blob/2eb6228ef205832aaf7d3111b64a38cdfa4c06f6/yamls/hydra-yamls/SD-2-base-256.yaml#LL62C14-L62C14

MissingEnvironmentError: Torch distributed is initialized but environment variable NODE_RANK is not set.

Hi, I am trying to do the stage 2 training by loading the checkpoint from stage 1. It works on a single GPU, but failed by using 8 GPUs with the following error. Any ideas? Thanks.

Traceback (most recent call last) ────────────────────────────────╮
mosaic_512/0 [0]:│ /opt/conda/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py:92 in        │
mosaic_512/0 [0]:│ _call_target                                                                                     │
mosaic_512/0 [0]:│                                                                                                  │
mosaic_512/0 [0]:│    89 │   │   │   raise InstantiationException(msg) from e                                       │
mosaic_512/0 [0]:│    90 │   else:                                                                                  │
mosaic_512/0 [0]:│    91 │   │   try:                                                                               │
mosaic_512/0 [0]:│ ❱  92 │   │   │   return _target_(*args, **kwargs)                                               │
mosaic_512/0 [0]:│    93 │   │   except Exception as e:                                                             │
mosaic_512/0 [0]:│    94 │   │   │   msg = f"Error in call to target '{_convert_target_to_string(_target_)}':\n{r   │
mosaic_512/0 [0]:│    95 │   │   │   if full_key:                                                                   │
mosaic_512/0 [0]:│                                                                                                  │
mosaic_512/0 [0]:│ /opt/conda/lib/python3.9/site-packages/composer/trainer/trainer.py:1330 in __init__              │
mosaic_512/0 [0]:│                                                                                                  │
mosaic_512/0 [0]:│   1327 │   │   │   │   if wandb.run is None:                                                     │
mosaic_512/0 [0]:│   1328 │   │   │   │   │   load_object_store.init(self.state, self.logger)                       │
mosaic_512/0 [0]:│   1329 │   │   │   _, _, parsed_load_path = parse_uri(load_path)                                 │
mosaic_512/0 [0]:│ ❱ 1330 │   │   │   self._rng_state = checkpoint.load_checkpoint(                                 │
mosaic_512/0 [0]:│   1331 │   │   │   │   state=self.state,                                                         │
mosaic_512/0 [0]:│   1332 │   │   │   │   logger=self.logger,                                                       │
mosaic_512/0 [0]:│   1333 │   │   │   │   path=parsed_load_path,                                                    │
mosaic_512/0 [0]:│                                                                                                  │
mosaic_512/0 [0]:│ /opt/conda/lib/python3.9/site-packages/composer/utils/checkpoint.py:205 in load_checkpoint       │
mosaic_512/0 [0]:│                                                                                                  │
mosaic_512/0 [0]:│   202 │   │   │   # Get the path to the proper checkpoint folder corresponding to the current    │
mosaic_512/0 [0]:│   203 │   │   │   # If fsdp_sharded_state_dict_enabled then just use that rank's unique tempdi   │
mosaic_512/0 [0]:│   204 │   │   │   node_checkpoint_folder = (tempdir                                              │
mosaic_512/0 [0]:│ ❱ 205 │   │   │   │   │   │   │   │   │     if state.fsdp_sharded_state_dict_enabled else _get   │
mosaic_512/0 [0]:│   206 │   │   │   assert node_checkpoint_folder is not None                                      │
mosaic_512/0 [0]:│   207 │   │   │                                                                                  │
mosaic_512/0 [0]:│   208 │   │   │   composer_states_filepath, extracted_checkpoint_folder, extracted_rank_n = do   │
mosaic_512/0 [0]:│                                                                                                  │
mosaic_512/0 [0]:│ /opt/conda/lib/python3.9/site-packages/composer/utils/checkpoint.py:239 in                       │
mosaic_512/0 [0]:│ _get_local_rank_zero_path                                                                        │
mosaic_512/0 [0]:│                                                                                                  │
mosaic_512/0 [0]:│   236                                                                                            │
mosaic_512/0 [0]:│   237 def _get_local_rank_zero_path(path: Optional[str]) -> str:                                 │
mosaic_512/0 [0]:│   238 │   """Broadcasts the ``path`` from the LOCAL rank zero to all LOCAL ranks."""             │
mosaic_512/0 [0]:│ ❱ 239 │   local_rank_zero = dist.get_local_world_size() * dist.get_node_rank()                   │
mosaic_512/0 [0]:│   240 │   paths = dist.all_gather_object(path)                                                   │
mosaic_512/0 [0]:│   241 │   local_rank_zero_path = paths[local_rank_zero]                                          │
mosaic_512/0 [0]:│   242 │   assert local_rank_zero_path is not None, 'local rank zero provides the path'           │
mosaic_512/0 [0]:│                                                                                                  │
mosaic_512/0 [0]:│ /opt/conda/lib/python3.9/site-packages/composer/utils/dist.py:155 in get_node_rank               │
mosaic_512/0 [0]:│                                                                                                  │
mosaic_512/0 [0]:│   152 │   Returns:                                                                               │
mosaic_512/0 [0]:│   153 │   │   int: The node rank, starting at 0.                                                 │
mosaic_512/0 [0]:│   154 │   """                                                                                    │
mosaic_512/0 [0]:│ ❱ 155 │   return _get_distributed_config_var(env_var='NODE_RANK', default=0, human_name='node    │
mosaic_512/0 [0]:│   156                                                                                            │
mosaic_512/0 [0]:│   157                                                                                            │
mosaic_512/0 [0]:│   158 def barrier() -> None:                                                                     │
mosaic_512/0 [0]:│                                                                                                  │
mosaic_512/0 [0]:│ /opt/conda/lib/python3.9/site-packages/composer/utils/dist.py:101 in _get_distributed_config_var │
mosaic_512/0 [0]:│                                                                                                  │
mosaic_512/0 [0]:│    98 │   │   return int(os.environ[env_var])                                                    │
mosaic_512/0 [0]:│    99 │                                                                                          │
mosaic_512/0 [0]:│   100 │   if dist.is_initialized():                                                              │
mosaic_512/0 [0]:│ ❱ 101 │   │   raise MissingEnvironmentError('Torch distributed is initialized but environment    │
mosaic_512/0 [0]:│   102 │   │   │   │   │   │   │   │   │     f'{env_var} is not set.')                            │
mosaic_512/0 [0]:│   103 │                                                                                          │
mosaic_512/0 [0]:│   104 │   return default                                                                         │
mosaic_512/0 [0]:╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
mosaic_512/0 [0]:MissingEnvironmentError: Torch distributed is initialized but environment variable NODE_RANK is not set.

H100 support issue

When can we expect H100 support? I have tried building environment basing on cuda11.8 and 12.0. There seems to be some issues realted to package discrepency. Any suggestion for now?

FID from the mainline code is different from https://github.com/mosaicml/diffusion/tree/ejyuen-patch-1

Hi, I found that the current mainline code can generate reasonable FID score for pre-trained models, but generate very high FID score for the model that is pre-trained using this codebase.
For example, I have a checkpoint that is pre-trained on the LAION dataset and get the following FID scores by using fid-clip-evaluation.py:
Mainline -> 18.46875
ejyuen-patch-1 -> 14.32812

For anther checkpoint, I get the following result:
Mainline -> 21.46875
ejyuen-patch-1 -> 15.89062

Note that for all those FID calculation, I use the same COCO2014-10K dataset.

An error occurred while running laion_cloudwriter.py

leaked shared_memory

frustrated after training about 1654/ba it corrupted, failed to save the checkpoint, tried two times.
Error as follows:

[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=39739, OpType=ALLREDUCE, Timeout(ms)=300000) ran for 302714 milliseconds before timing out.
train 4%|▉ /home/anaconda3/envs/control/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 3 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
----------End global rank 3 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 35121) has still not exited; return exit code 1.

Why are you using COCO2014 to calculate FID and CLIP score

Hi, your code sets COCO2014 by default to calculate FID and CLIP score.
May I know which COCO dataset should be used? COCO 2014 or COCO2017? Thanks

missing index.json when excuting laion_cloudwriter.py

Nice work!

laion_cloudwriter.py conveted format sucessfully from parquet to mds, but failed to generate index.json file which is neccessary for dataloader in precompute_latents.py.

Is this a bug? Thanks for help.

upsample_nearest_nhwc only supports output tensors with less than int_max

Hi, I get the following error when I use fid-clip-evaluation.py with size=512, but it works with size=256. Any suggestions? Thanks.

train error

i try to run train with composer run.py --config_path yamls/hydra-yamls --config_name SD-2-base-256.yaml but got following error

Why do we load clip-vit-large-patch14?

Hi, I was wondering why do I see the following log by using stable_diffusion_2? I didn't see the training code is supposed to load openai--clip-vit-large-patch14, isn't it?

mosaic/0 [0]:[INFO|configuration_utils.py:712] 2023-10-03 22:40:58,340 >> loading configuration file config.json from cache at .cache/huggingface/hub/models--openai--clip-vit-large-patch14/snapshots/32bd64288804d66eefd0ccbe215aa642df71cc41/config.json

TypeError("'NoneType' object is not iterable")

hydra.errors.InstantiationException: Error in call to target 'diffusion.datasets.laion.laion.build_streaming_laion_dataloader':
TypeError("'NoneType' object is not iterable")
full_key: dataset.train_dataset
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 247336) exited with code 1
ERROR:composer.cli.launcher:Global rank 0 (PID 247336) exited with code 1

Are you planning releasing the trained model?

I see it referenced in the source code as oci://mosaicml-internal-checkpoints/stable-diffusion-hero-run/4-13-512-ema/ep5-ba850000-rank0.pt. It would be nice to try it.

NaN loss after 6K steps with batch size 2048

Hi, I got NaN loss during the training of SD-2-base-256.yaml

[dynamo] `UnspecializedNNModuleVariable` does not implement object identity

[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert: [DEBUG] break_graph_if_unsupported triggered compile
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] Graph break: call_method SetVariable() __contains__ [UnspecializedNNModuleVariable(Linear)] {} from user code at:
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/home/jonch/Desktop/sdpa.py", line 911, in <resume in train>
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     m = M(linear, encode=encode)
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/home/jonch/Desktop/sdpa.py", line 893, in __init__
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     self.linear.requires_grad_(False)
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/home/jonch/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2439, in requires_grad_
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     for p in self.parameters():
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/home/jonch/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2192, in parameters
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     for name, param in self.named_parameters(recurse=recurse):
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/home/jonch/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2223, in named_parameters
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     gen = self._named_members(
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/home/jonch/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2159, in _named_members
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     modules = self.named_modules(prefix=prefix, remove_duplicate=remove_duplicate) if recurse else [(prefix, self)]
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/home/jonch/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2369, in named_modules
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     if self not in memo:

Error doing composer run.py --config-path yamls/hydra-yamls --config-name SD-2-base-512.yaml

composer run.py --config-path yamls/hydra-yamls --config-name SD-2-base-512.yaml
[2023-06-13 20:29:52,077][composer.utils.reproducibility][INFO] - Setting seed to 17
Error executing job with overrides: []
Error in call to target 'diffusion.models.models.stable_diffusion_2':
TypeError("UNet2DConditionModel.init() got an unexpected keyword argument 'dual_cross_attention'")
full_key: model

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
ERROR:composer.cli.launcher:Rank 3 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 3 (PID 40553) exited with code 1
----------Begin global rank 3 STDOUT----------
[2023-06-13 20:29:52,032][composer.utils.reproducibility][INFO] - Setting seed to 17

----------End global rank 3 STDOUT----------
----------Begin global rank 3 STDERR----------
Error executing job with overrides: []
Error in call to target 'diffusion.models.models.stable_diffusion_2':
TypeError("UNet2DConditionModel.init() got an unexpected keyword argument 'dual_cross_attention'")
full_key: model

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

----------End global rank 3 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 40550) exited with code -15

Forward order differs across ranks: rank 0 is all-gathering 1 parameters while rank 4 is all-gathering -2142209024 parameters

Hi, I am trying to use autoresume to continue train my failed jobs, but get the following error:

File "/opt/conda/lib/python3.9/site-packages/torch/distributed/fsdp/_exec_order_utils.py", line 243, in _check_order
RuntimeError: Forward order differs across ranks: rank 0 is all-gathering 1 parameters while rank 4 is all-gathering -2142209024 parameters

When I use a single node to train a model, save checkpoint, and set autoresume=True to continue the training by using a single node, it works.
However, when I use 16 nodes to train a model, save checkpoint, and use 1 or 16 nodes to do autoresume, I get the aforementioned error.
I googled it, but only find this Stack Overflow. Same error, but no answer yet.

I am researching SD and it is awesome to dive into training techniques.

Hope to get a reply soon.
Peace.

mosaicml / diffusion Goto Github PK

diffusion's People

Contributors

Stargazers

Watchers

Forkers

diffusion's Issues

Bug when local training with LocalDataset

Recommend Projects

Recommend Topics

Recommend Org