mosaicml / diffusion Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
I am limited GPU resources and the maximum GPU is 3090ti. Could I use your code to finetune stable diffusion under 3090ti?
Hope this finds you well. Nice work.
One quick question: have you compared the FID/CLIP score of mosaic diffusion-2.0-base and the official diffusion-2.0-base?
I only find some human evaluation results shown in https://www.mosaicml.com/blog/training-stable-diffusion-from-scratch-part-2.
This line of code https://github.com/mosaicml/diffusion/blob/main/diffusion/datasets/image_caption.py#L214 should be changed to
crop_transform = RandomCropSquare(resize_size) if rand_crop else LargestCenterSquare(resize_size)
Hi, I found that FID score becomes larger and larger during the Stage 2 (512x512) training of SD-2.0-base, but the loss keeps at the same level, roughly about 0.12~0.13. Any thoughts? Thanks.
Hi, the loss of my training job becomes NaN at step 161544. The job failed due to wathdog timeout error. Looks like the error is caused by this code. Any thoughts? Thanks.
2023-10-05T03:29:36.616-07:00 | [0]:train 21%\|█████▏ \| 161543/770000 [29:39:25<397:31:10, 2.35s/ba, loss/train/t�[A
-- | --
| 2023-10-05T03:29:36.616-07:00 | [0]:
| 2023-10-05T04:00:22.604-07:00 | [0]:train 21%\|█████▏ \| 161544/770000 [29:39:25<310:42:33, 1.84s/ba, loss/train/t�[A[6]:[E ProcessGroupNCCL.cpp:828] [Rank 38] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805505 milliseconds before timing out.
| 2023-10-05T04:00:22.612-07:00 | [1]:[E ProcessGroupNCCL.cpp:828] [Rank 33] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805498 milliseconds before timing out.
| 2023-10-05T04:00:22.628-07:00 | [7]:[E ProcessGroupNCCL.cpp:828] [Rank 39] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805581 milliseconds before timing out.
| 2023-10-05T04:00:22.636-07:00 | [3]:[E ProcessGroupNCCL.cpp:828] [Rank 35] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805568 milliseconds before timing out.
| 2023-10-05T04:00:22.654-07:00 | [5]:[E ProcessGroupNCCL.cpp:828] [Rank 37] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805524 milliseconds before timing out.
| 2023-10-05T04:00:22.654-07:00 | [5]:Traceback (most recent call last):
| 2023-10-05T04:00:22.654-07:00 | [5]: File "/workspace/mosaic.py", line 28, in <module>
| 2023-10-05T04:00:22.654-07:00 | [5]: main()
| 2023-10-05T04:00:22.654-07:00 | [5]: File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
| 2023-10-05T04:00:22.654-07:00 | [5]: return f(*args, **kwargs)
| 2023-10-05T04:00:22.654-07:00 | [5]: File "/workspace/mosaic.py", line 24, in main
| 2023-10-05T04:00:22.654-07:00 | [5]: return train(cfg)
| 2023-10-05T04:00:22.654-07:00 | [5]: File "/workspace/models/diffusion/train.py", line 170, in train
| 2023-10-05T04:00:22.654-07:00 | [5]: return eval_and_then_train()
| 2023-10-05T04:00:22.654-07:00 | [5]: File "/workspace/models/diffusion/train.py", line 168, in eval_and_then_train
| 2023-10-05T04:00:22.654-07:00 | [5]: trainer.fit()
| 2023-10-05T04:00:22.654-07:00 | [5]: File "/opt/conda/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1766, in fit
| 2023-10-05T04:00:22.654-07:00 | [5]: self._train_loop()
| 2023-10-05T04:00:22.654-07:00 | [5]: File "/opt/conda/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1962, in _train_loop
| 2023-10-05T04:00:22.654-07:00 | [5]: total_num_samples, total_num_tokens, batch_time = self._accumulate_time_across_ranks(
| 2023-10-05T04:00:22.654-07:00 | [5]: File "/opt/conda/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1864, in _accumulate_time_across_ranks
| 2023-10-05T04:00:22.654-07:00 | [5]: dist.all_reduce(sample_token_tensor, reduce_operation='SUM')
| 2023-10-05T04:00:22.654-07:00 | [5]: File "/opt/conda/lib/python3.9/site-packages/composer/utils/dist.py", line 212, in all_reduce
| 2023-10-05T04:00:22.654-07:00 | [5]: dist.all_reduce(tensor, op=reduce_op)
| 2023-10-05T04:00:22.654-07:00 | [5]: File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
| 2023-10-05T04:00:22.654-07:00 | [5]: return func(*args, **kwargs)
| 2023-10-05T04:00:22.654-07:00Copy[5]: File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce | [5]: File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
| 2023-10-05T04:00:22.654-07:00Copy[5]: work = default_pg.allreduce([tensor], opts) | [5]: work = default_pg.allreduce([tensor], opts)
| 2023-10-05T04:00:22.654-07:00 | [5]:RuntimeError: NCCL communicator was aborted on rank 37. Original reason for failure was: [Rank 37] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805524 milliseconds before timing out.
| 2023-10-05T04:00:22.655-07:00 | [2]:[E ProcessGroupNCCL.cpp:828] [Rank 34] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805579 milliseconds before timing out.
| 2023-10-05T04:00:22.656-07:00 | [4]:[E ProcessGroupNCCL.cpp:828] [Rank 36] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805597 milliseconds before timing out.
| 2023-10-05T04:00:22.700-07:00 | [0]:[E ProcessGroupNCCL.cpp:828] [Rank 32] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4447361, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805625 milliseconds before timing out.
Hi, I am using 16 nodes, each of which has 8 A100s to train the first stage (256x256) of sd-2.0-base.
The total batch size is 2048=128*16. I am not using pre-computed latents, giving the throughput about 5.5k-6K samples/sec.
As mentioned here, the throughput of your model training is 11600 samples/sec by using pre-computed latents. Given the following estimation, the throughput of your model training is about 11600/1.4=8286 samples/sec without using pre-computed latents. This throughput is quite higher than ours, any ideas? Thanks.
If you are computing VAE and CLIP latents while training, expect a 1.4x increase in time and cost.
Hi,
I have followed the data preparation steps as described in the laion2b-en-interactive.yaml and precompute-latents.yaml files. When computing the gradients, I get an error regarding a missing index.json file in the aesthetic output folder.My aesthetic output folder contains the sharded parquet files and corresponding stats,json files. What is the purpose of the index.json file and how can I generate it?FYI, I am only using a small subset of the dataset ~10% for this experiment. Thanks.
Traceback (most recent call last):
File "/home/ubuntu/diffusion/precompute_latents.py", line 357, in <module>
main(parse_args())
File "/home/ubuntu/diffusion/precompute_latents.py", line 229, in main
dataloader = build_streaming_laion_dataloader(
File "/home/ubuntu/diffusion/precompute_latents.py", line 166, in build_streaming_laion_dataloader
dataset = StreamingLAIONDataset(
File "/home/ubuntu/diffusion/precompute_latents.py", line 58, in __init__
super().__init__(
File "/home/ubuntu/anaconda3/envs/mosaic/lib/python3.9/site-packages/streaming/base/dataset.py", line 264, in __init__
stream_shards = stream.get_shards(world)
File "/home/ubuntu/anaconda3/envs/mosaic/lib/python3.9/site-packages/streaming/base/stream.py", line 303, in get_shards
filename = self._download_file(basename)
File "/home/ubuntu/anaconda3/envs/mosaic/lib/python3.9/site-packages/streaming/base/stream.py", line 199, in _download_file
download_file(remote, local, self.download_timeout)
File "/home/ubuntu/anaconda3/envs/mosaic/lib/python3.9/site-packages/streaming/base/storage/download.py", line 234, in download_file
download_from_local(remote, local)
File "/home/ubuntu/anaconda3/envs/mosaic/lib/python3.9/site-packages/streaming/base/storage/download.py", line 204, in download_from_local
shutil.copy(remote, local_tmp)
File "/home/ubuntu/anaconda3/envs/mosaic/lib/python3.9/shutil.py", line 427, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/home/ubuntu/anaconda3/envs/mosaic/lib/python3.9/shutil.py", line 264, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/mosaic/aesthetic/1/index.json'
I'm trying to execute a training process with composer run.py --config-path yamls/hydra-yamls --config-name SD-2-base-256.yaml
, after changing the configuration to use a custom data loader. Im getting some generic error
AttributeError("'IterableDatasetDict' object has no attribute '_distributed'") from unspecified source. How can I get more details?
Error executing job with overrides: []
Error locating target 'diffusion.datasets.laion.laion.build_streaming_laion_dataloader', set env var HYDRA_FULL_ERROR=1 to see chained exception.
full_key: dataset.train_dataset
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 16072) exited with code 1
ERROR:composer.cli.launcher:Global rank 0 (PID 16072) exited with code 1
Thanks for sharing this amizing work,the LAION-5B is too big and i want use coco_captions dataset to train the model, should i change the dataloader script coco_captions.py?
Is there any setting in yaml file to control the used GPU number? For example, I only have 2 cards available, but this program always notices 8 cards are used.
Hi, do you have plan to support EMU? According to the paper, this model outperforms SDXL and only requires 2k images for quality-tuning. The images showcased in the paper is very realistic
hello, can you provide a script to convert a trained model to diffusers ckpt format so I can inference my trained model using diffusers?
FileExistsError: [Errno 17] File exists: '/000000_shard_access_times'
During handling of the above exception, another exception occurred:
InstantiationException: Error in call to target 'diffusion.datasets.laion.laion.build_streaming_laion_dataloader':
ValueError('cannot mmap an empty file')
full_key: dataset.train_dataset
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 34542) exited with code 1
ERROR:composer.cli.launcher:Global rank 0 (PID 34542) exited with code 1
I'm wondering why it need to write in the root directory?, it seems set self.local to empty, please help
I got this error running mainline precompute_latent.py with composer,
RuntimeError: The world_size(8) > 1, but the distributed package is not available or has not been initialized. Please check y$
u have initialized the distributed runtime and that PyTorch has been built with distributed support. If calling this function
outside Trainer, please ensure that `composer.utils.dist.initialize_dist` has been called first.
My understanding is that I need add dist.intialize_dist(device, timeout) right after get_device()
is this understanding correct? Thanks
Hi, I am doing a sanity check to make sure that the caption and image latents generated by precompute_latents.py are identical to the latents generated from the training code. However, I found this only happens when the same batch_size is used within these two settings. Any thoughts? Note that vae
uses group norm, while text_encoder
uses layer norm, both do not depend on batch size. In addition, both vae
and text_encoder
use drop_out=0.
conditioning
and latents
at this positionHi, to do the stage 2 training of 2.0-base, I am using this yaml file SD-2-base-512.yam. However, this yaml file doesn't load the checkpoint from stage 1. I add a new line under trainer
to handle this issue.
trainer:
load_path: sd2.0-base-256/ep0-ba550000-rank0.pt
However, I get the following error:
Found these missing keys in the checkpoint: vae.encoder.mid_block.attentions.0.to_q.weight, vae.encoder.mid_block.attentions.0.to_q.bias, vae.encoder.mid_block.attentions.0.to_k.weight, vae.encoder.mid_block.attentions.0.to_k.bias, vae.encoder.mid_block.attentions.0.to_v.weight, vae.encoder.mid_block.attentions.0.to_v.bias, vae.encoder.mid_block.attentions.0.to_out.0.weight, vae.encoder.mid_block.attentions.0.to_out.0.bias, vae.decoder.mid_block.attentions.0.to_q.weight, vae.decoder.mid_block.attentions.0.to_q.bias, vae.decoder.mid_block.attentions.0.to_k.weight, vae.decoder.mid_block.attentions.0.to_k.bias, vae.decoder.mid_block.attentions.0.to_v.weight, vae.decoder.mid_block.attentions.0.to_v.bias, vae.decoder.mid_block.attentions.0.to_out.0.weight, vae.decoder.mid_block.attentions.0.to_out.0.bias
mosaic/0 [0]:[2023-06-19 08:43:26,877][composer.core.state][WARNING] - Found these unexpected keys in the checkpoint: vae.encoder.mid_block.attentions.0.query.weight, vae.encoder.mid_block.attentions.0.query.bias, vae.encoder.mid_block.attentions.0.key.weight, vae.encoder.mid_block.attentions.0.key.bias, vae.encoder.mid_block.attentions.0.value.weight, vae.encoder.mid_block.attentions.0.value.bias, vae.encoder.mid_block.attentions.0.proj_attn.weight, vae.encoder.mid_block.attentions.0.proj_attn.bias, vae.decoder.mid_block.attentions.0.query.weight, vae.decoder.mid_block.attentions.0.query.bias, vae.decoder.mid_block.attentions.0.key.weight, vae.decoder.mid_block.attentions.0.key.bias, vae.decoder.mid_block.attentions.0.value.weight, vae.decoder.mid_block.attentions.0.value.bias, vae.decoder.mid_block.attentions.0.proj_attn.weight, vae.decoder.mid_block.attentions.0.proj_attn.bias
Hope this finds you well. Very amazing work!!!
As mentioned in https://github.com/mosaicml/diffusion#how-many-gpus-do-i-need, "Our time estimates are based on training Stable Diffusion 2.0 base on 1,126,400,000 images at 256x256 resolution and 1,740,800,000 images at 512x512 resolution".
I have three questions:
Thanks very much!!!
First of all I want to thanks for your amazing project.
I am implementing a diffusion from scratch, but instead using LAION dataset I using my own dataset which contains only 20 milion images. I already finish the training stage 1 with image size 256x256 and do some image generation experiment on it. The image output have very good background but if the prompt define some object, for instance "a dog" then the image output have multiple object (dogs) instead one (You can check some samples down below).
I know the quality of model mostly depend on the quality of the dataset but can you suggest any idea to improve output for my model.
Some information about my case:
Hi, thanks for this great work.
I have a question regarding encode_latents_in_fp16
. If we set encode_latents_in_fp16=False
to use fp32, do we expect lower performance compared to fp16? I have tried both and found that images generated by fp32-based model have lower quality than fp16-based model. Is that expected? Thanks
Best
Hi, after installing everything by following these commands
git clone https://github.com/mosaicml/diffusion.git
cd diffusion
pip install -e .
I was trying to run fid-clip-evaluation.py but got the following error:
Traceback (most recent call last):
File "diffusion/scripts/fid-clip-evaluation.py", line 39, in <module>
coco_val_dataloader = build_streaming_cocoval_dataloader(
File "diffusion/diffusion/datasets/coco/coco_captions.py", line 110, in build_streaming_cocoval_dataloader
dataset = StreamingCOCOCaption(
File "diffusion/diffusion/datasets/coco/coco_captions.py", line 60, in __init__
super().__init__(
File "python3.9/site-packages/streaming/base/dataset.py", line 496, in __init__
self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
File "python3.9/site-packages/streaming/base/shared/prefix.py", line 189, in get_shm_prefix
prefix_int = _check_and_find_retrying(streams_local, streams_remote, retry)
File "python3.9/site-packages/streaming/base/shared/prefix.py", line 162, in _check_and_find_retrying
raise errs[-1]
File "python3.9/site-packages/streaming/base/shared/prefix.py", line 158, in _check_and_find_retrying
return _check_and_find(streams_local, streams_remote)
File "python3.9/site-packages/streaming/base/shared/prefix.py", line 115, in _check_and_find
their_locals, _ = _unpack_locals(bytes(shm.buf))
File "python3.9/site-packages/streaming/base/shared/prefix.py", line 75, in _unpack_locals
return text[:-1], int(text[-1] or 0)
ValueError: invalid literal for int() with base 10: '/tmp/mds-cache/mds-coco-2014-val-fid-clip-17'
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
This error is fixed after installing old packages:
pip install mosaicml==0.14.1
pip install mosaicml-streaming==0.5.0
Hope this finds you well.
I am trying to run laion_cloudwriter.py and precompute_latents.py on the aesthetic-4.5 dataset. However, I found that it takes very long time to finish these two tasks. For example, laion_cloudwriter.py takes about 80 seconds for each parquet file (the row data), while the whole dataset contains about 25K parquet files.
Thanks in advance.
Here is my config (without some personal paths), run for mosaicml's diffusion:
algorithms:
low_precision_groupnorm:
attribute: unet
precision: amp_fp16
low_precision_layernorm:
attribute: unet
precision: amp_fp16
model:
_target_: diffusion.models.models.stable_diffusion_2
model_name: runwayml/stable-diffusion-v1-5
pretrained: true
precomputed_latents: true
encode_latents_in_fp16: true
# fsdp: false
fsdp: true
val_metrics:
- _target_: torchmetrics.MeanSquaredError
val_guidance_scales: []
loss_bins: []
dataset:
train_batch_size: 2048 # TODO: explore composer config
eval_batch_size: 16 # Should be 8 per device
train_dataset:
_target_: diffusion.datasets.pixta.pixta.build_custom_dataloader
data_path: ...
feature_dim: 32
num_workers: 8
pin_memory: false
eval_dataset:
_target_: diffusion.datasets.pixta.pixta.build_custom_dataloader
data_path: ...
feature_dim: 32
num_workers: 8
pin_memory: false
optimizer:
_target_: torch.optim.AdamW
lr: 1.0e-5
weight_decay: 0.01
scheduler:
_target_: composer.optim.ConstantWithWarmupScheduler
t_warmup: 10ba
logger:
wandb:
_target_: composer.loggers.wandb_logger.WandBLogger
name: ${name}
project: ${project}
group: ${name}
callbacks:
speed_monitor:
_target_: composer.callbacks.speed_monitor.SpeedMonitor
window_size: 10
lr_monitor:
_target_: composer.callbacks.lr_monitor.LRMonitor
memory_monitor:
_target_: composer.callbacks.memory_monitor.MemoryMonitor
runtime_estimator:
_target_: composer.callbacks.runtime_estimator.RuntimeEstimator
optimizer_monitor:
_target_: composer.callbacks.OptimizerMonitor
trainer:
_target_: composer.Trainer
device: gpu
max_duration: 10ep
eval_interval: 2ep
device_train_microbatch_size: 40
run_name: ${name}
seed: ${seed}
scale_schedule_ratio: ${scale_schedule_ratio}
save_folder: outputs/${project}/${name}
save_interval: 5ep
save_overwrite: true
autoresume: false
fsdp_config:
sharding_strategy: "SHARD_GRAD_OP"
state_dict_type: "full"
mixed_precision: 'PURE'
activation_checkpointing: true
Here is my dataset and dataloader code:
class CustomDataset(Array, Dataset):
def __init__(self,
data_path,
feature_dim=64):
self.feature_dim = feature_dim
index_file = os.path.join(data_path, 'index.json')
data = json.load(open(index_file))
if data['version'] != 2:
raise ValueError(f'Unsupported streaming data version: {data["version"]}. ' +
f'Expected version 2.')
shards = []
for info in data['shards']:
shard = reader_from_json(data_path, None, info)
shards.append(shard)
self.shards = shards
samples_per_shard = np.array([shard.samples for shard in shards], np.int64)
self.length = samples_per_shard.sum()
self.spanner = Spanner(samples_per_shard)
def __len__(self):
return self.length
@property
def size(self) -> int:
"""Get the size of the dataset in samples.
Returns:
int: Number of samples.
"""
return self.length
# def __getitem__(self, index):
def get_item(self, index):
shard_id, shard_sample_id = self.spanner[index]
shard = self.shards[shard_id]
sample = shard[shard_sample_id]
out = {}
if 'caption_latents' in sample:
out['caption_latents'] = torch.from_numpy(
np.frombuffer(sample['caption_latents'], dtype=np.float16).copy()).reshape(77, 768)
if 'image_latents' in sample:
out['image_latents'] = torch.from_numpy(np.frombuffer(sample['image_latents'],
dtype=np.float16).copy()).reshape(4, self.feature_dim, self.feature_dim)
return out
def build_custom_dataloader(
batch_size: int,
data_path: str,
image_root: str = None,
tokenizer_name_or_path: str = 'runwayml/stable-diffusion-v1-5',
caption_drop_prob: float = 0.0,
resize_size: int = 512,
feature_dim: int = 64,
drop_last: bool = True,
shuffle: bool = True, #TODO pass shuffle to dataloader
**dataloader_kwargs,
):
print('Using precomputed features!!!')
dataset = CustomDataset(data_path, feature_dim=feature_dim)
drop_last = False
if isinstance(dataset, IterableDataset):
print('Using IterableDataset!!!')
sampler = None
else:
print('Using Sampler!!!')
sampler = dist.get_sampler(dataset, drop_last=drop_last, shuffle=shuffle)
dataloader = DataLoader(
dataset=dataset,
batch_size=batch_size,
sampler=sampler,
drop_last=drop_last,
shuffle=shuffle if sampler is None else False,
**dataloader_kwargs,
)
return dataloader
And I got this errors while finish epoch 0 and start epoch 1:
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/queues.py", line 113, in get
if not self._poll(timeout):
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/connection.py", line 262, in poll
return self._poll(timeout)
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/connection.py", line 429, in _poll
r = wait([self], timeout)
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/connection.py", line 936, in wait
ready = selector.select(timeout)
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/selectors.py", line 416, in select
fd_event_list = self._selector.poll(timeout)
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3007309) is killed by signal: Aborted.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/tungduongquang/workspace/mosaicml/image-generation/run.py", line 26, in <module>
main()
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/home/tungduongquang/workspace/mosaicml/image-generation/run.py", line 22, in main
return train(config)
File "/home/tungduongquang/workspace/mosaicml/image-generation/diffusion/train.py", line 134, in train
return eval_and_then_train()
File "/home/tungduongquang/workspace/mosaicml/image-generation/diffusion/train.py", line 132, in eval_and_then_train
trainer.fit()
File "/home/tungduongquang/workspace/mosaicml/composer/composer/trainer/trainer.py", line 1796, in fit
self._train_loop()
File "/home/tungduongquang/workspace/mosaicml/composer/composer/trainer/trainer.py", line 1938, in _train_loop
for batch_idx, self.state.batch in enumerate(self._iter_dataloader(TrainerMode.TRAIN)):
File "/home/tungduongquang/workspace/mosaicml/composer/composer/trainer/trainer.py", line 2924, in _iter_dataloader
batch = next(dataloader_iter)
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
idx, data = self._get_data()
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
success, data = self._try_get_data()
File "/home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3007309) exited unexpectedly
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/utils/data/datal │
│ oader.py:1120 in _try_get_data │
│ │
│ 1117 │ │ # Returns a 2-tuple: │
│ 1118 │ │ # (bool: whether successfully get data, any: data if successful else None) │
│ 1119 │ │ try: │
│ ❱ 1120 │ │ │ data = self._data_queue.get(timeout=timeout) │
│ 1121 │ │ │ return (True, data) │
│ 1122 │ │ except Exception as e: │
│ 1123 │ │ │ # At timeout and error, we manually check whether any worker has │
│ │
│ /home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/queues.py:113 in get │
│ │
│ 110 │ │ │ try: │
│ 111 │ │ │ │ if block: │
│ 112 │ │ │ │ │ timeout = deadline - time.monotonic() │
│ ❱ 113 │ │ │ │ │ if not self._poll(timeout): │
│ 114 │ │ │ │ │ │ raise Empty │
│ 115 │ │ │ │ elif not self._poll(): │
│ 116 │ │ │ │ │ raise Empty │
│ │
│ /home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/connection.py:262 in │
│ poll │
│ │
│ 259 │ │ """Whether there is any input available to be read""" │
│ 260 │ │ self._check_closed() │
│ 261 │ │ self._check_readable() │
│ ❱ 262 │ │ return self._poll(timeout) │
│ 263 │ │
│ 264 │ def __enter__(self): │
│ 265 │ │ return self │
│ │
│ /home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/connection.py:429 in │
│ _poll │
│ │
│ 426 │ │ return self._recv(size) │
│ 427 │ │
│ 428 │ def _poll(self, timeout): │
│ ❱ 429 │ │ r = wait([self], timeout) │
│ 430 │ │ return bool(r) │
│ 431 │
│ 432 │
│ │
│ /home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/multiprocessing/connection.py:936 in │
│ wait │
│ │
│ 933 │ │ │ │ deadline = time.monotonic() + timeout │
│ 934 │ │ │ │
│ 935 │ │ │ while True: │
│ ❱ 936 │ │ │ │ ready = selector.select(timeout) │
│ 937 │ │ │ │ if ready: │
│ 938 │ │ │ │ │ return [key.fileobj for (key, events) in ready] │
│ 939 │ │ │ │ else: │
│ │
│ /home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/selectors.py:416 in select │
│ │
│ 413 │ │ │ timeout = math.ceil(timeout * 1e3) │
│ 414 │ │ ready = [] │
│ 415 │ │ try: │
│ ❱ 416 │ │ │ fd_event_list = self._selector.poll(timeout) │
│ 417 │ │ except InterruptedError: │
│ 418 │ │ │ return ready │
│ 419 │ │ for fd, event in fd_event_list: │
│ │
│ /home/tungduongquang/miniconda3/envs/mosaicml/lib/python3.9/site-packages/torch/utils/data/_util │
│ s/signal_handling.py:66 in handler │
│ │
│ 63 │ def handler(signum, frame): │
│ 64 │ │ # This following call uses `waitid` with WNOHANG from C side. Therefore, │
│ 65 │ │ # Python can still get and update the process status successfully. │
│ ❱ 66 │ │ _error_if_any_worker_fails() │
│ 67 │ │ if previous_handler is not None: │
│ 68 │ │ │ assert callable(previous_handler) │
│ 69 │ │ │ previous_handler(signum, frame) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: DataLoader worker (pid 3007309) is killed by signal: Aborted.
Plz help!!!
Hi thanks for this contribution
as a small exercise I am training SD2 on the pokemon dataset
I precomputed the latents and it starts training on one gpu
However at the evaluation time I get the following error
File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2814, in _eval_loop
self.state.outputs = self._original_model.eval_forward(self.state.batch)
File "/fsx_vfx/users/csegalin/code/diffusion/diffusion/models/stable_diffusion.py", line 255, in eval_forward
gen_images = self.generate(tokenized_prompts=prompts,
File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/fsx_vfx/users/csegalin/code/diffusion/diffusion/models/stable_diffusion.py", line 464, in generate
pred = self.unet(latent_model_input,
File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py", line 934, in forward
sample = self.conv_in(sample)
File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (162 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size`
this is my confguration
name: trial0 # Insert wandb run name
project: pokemon_sd2_256 # Insert wandb project name
seed: 17
eval_first: false
algorithms:
low_precision_groupnorm:
attribute: unet
precision: amp_fp16
low_precision_layernorm:
attribute: unet
precision: amp_fp16
model:
_target_: diffusion.models.models.stable_diffusion_2
pretrained: false
precomputed_latents: true
encode_latents_in_fp16: true
fsdp: true
val_metrics:
- _target_: torchmetrics.MeanSquaredError
- _target_: torchmetrics.image.fid.FrechetInceptionDistance
normalize: true
val_guidance_scales: [3, 7]
# val_guidance_scales: []
loss_bins: []
dataset:
train_batch_size: 1 # Global training batch size
eval_batch_size: 1 # Global evaluation batch size
train_dataset:
_target_: diffusion.datasets.pokemon.pokemon.build_streaming_dataloader
# Path to object store bucket(s)
local: /fsx_vfx/users/csegalin/data/pokemon/latents2_train
# Path to corresponding local dataset(s)
mode: 0
version: 2
drop_last: False
shuffle: true
prefetch_factor: 2
num_workers: 8
persistent_workers: true
pin_memory: true
eval_dataset:
_target_: diffusion.datasets.pokemon.pokemon.build_streaming_dataloader
local: /fsx_vfx/users/csegalin/data/pokemon/latents2_eval # Path to local dataset cache
prefetch_factor: 2
num_workers: 8
persistent_workers: True
pin_memory: True
mode: 0
version: 2
optimizer:
_target_: torch.optim.AdamW
lr: 1.0e-5
weight_decay: 0.01
scheduler:
_target_: composer.optim.LinearWithWarmupScheduler
t_warmup: 1000ba
alpha_f: 1.0
logger:
comet-ml:
_target_: composer.loggers.cometml_logger.CometMLLogger
name: ${name}
project_name: ${project}
callbacks:
speed_monitor:
_target_: composer.callbacks.speed_monitor.SpeedMonitor
window_size: 10
lr_monitor:
_target_: composer.callbacks.lr_monitor.LRMonitor
memory_monitor:
_target_: composer.callbacks.memory_monitor.MemoryMonitor
runtime_estimator:
_target_: composer.callbacks.runtime_estimator.RuntimeEstimator
optimizer_monitor:
_target_: composer.callbacks.OptimizerMonitor
image_monitor:
_target_: diffusion.callbacks.log_diffusion_images.LogDiffusionImages
prompts: # add any prompts you would like to visualize
- cute dragon creature
size: 256 # generated image resolution
guidance_scale: 3
trainer:
_target_: composer.Trainer
device: gpu
max_duration: 550000ba
eval_interval: 1000ba
device_train_microbatch_size: 1
run_name: ${name}
seed: ${seed}
save_folder: trained_model # Insert path to save folder or bucket
save_interval: 3000ba
save_overwrite: true
autoresume: false
# fsdp_config:
# sharding_strategy: "SHARD_GRAD_OP"
``
When must set autoresume=true,
if I set save_overwrite = false,it says:
FileExistsError: /home/mnt/diffmodels/ep0-ba1000-rank0.pt may conflict with a future checkpoint of the current run.Please
delete that file, change to a new folder, or set overwrite=True.
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: � View run sd-model at: https://wandb.ai/xxxx/diffusion/runs/hpzkfedp
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230701_152030-hpzkfedp/logs
if I set save_overwrite = true, it says:
InstantiationException: Error in call to target 'composer.trainer.trainer.Trainer':
ValueError('The flagsave_overwrite
must be False when autoresume is enabled as autoresume always loads the latest existing checkpoint insave_folder
. ')
full_key: trainer
In SD-256.yaml, the learning rate is set to 1e-4, I wonder whether the learning rate will be scaled up with the increase of GPUs. For example, if I use 64 GPUs train the model, will the learning rate be scaled to 64*1e-4 automatically?
Can you share the weight for the trained model
Also, is there any technical report to understand, the building blocks, that can quantify improvements
Hi, is that correct to set milestones to 200ep? Thanks.
milestones:
- 200ep
Hi, I am trying to do the stage 2 training by loading the checkpoint from stage 1. It works on a single GPU, but failed by using 8 GPUs with the following error. Any ideas? Thanks.
Traceback (most recent call last) ────────────────────────────────╮
mosaic_512/0 [0]:│ /opt/conda/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py:92 in │
mosaic_512/0 [0]:│ _call_target │
mosaic_512/0 [0]:│ │
mosaic_512/0 [0]:│ 89 │ │ │ raise InstantiationException(msg) from e │
mosaic_512/0 [0]:│ 90 │ else: │
mosaic_512/0 [0]:│ 91 │ │ try: │
mosaic_512/0 [0]:│ ❱ 92 │ │ │ return _target_(*args, **kwargs) │
mosaic_512/0 [0]:│ 93 │ │ except Exception as e: │
mosaic_512/0 [0]:│ 94 │ │ │ msg = f"Error in call to target '{_convert_target_to_string(_target_)}':\n{r │
mosaic_512/0 [0]:│ 95 │ │ │ if full_key: │
mosaic_512/0 [0]:│ │
mosaic_512/0 [0]:│ /opt/conda/lib/python3.9/site-packages/composer/trainer/trainer.py:1330 in __init__ │
mosaic_512/0 [0]:│ │
mosaic_512/0 [0]:│ 1327 │ │ │ │ if wandb.run is None: │
mosaic_512/0 [0]:│ 1328 │ │ │ │ │ load_object_store.init(self.state, self.logger) │
mosaic_512/0 [0]:│ 1329 │ │ │ _, _, parsed_load_path = parse_uri(load_path) │
mosaic_512/0 [0]:│ ❱ 1330 │ │ │ self._rng_state = checkpoint.load_checkpoint( │
mosaic_512/0 [0]:│ 1331 │ │ │ │ state=self.state, │
mosaic_512/0 [0]:│ 1332 │ │ │ │ logger=self.logger, │
mosaic_512/0 [0]:│ 1333 │ │ │ │ path=parsed_load_path, │
mosaic_512/0 [0]:│ │
mosaic_512/0 [0]:│ /opt/conda/lib/python3.9/site-packages/composer/utils/checkpoint.py:205 in load_checkpoint │
mosaic_512/0 [0]:│ │
mosaic_512/0 [0]:│ 202 │ │ │ # Get the path to the proper checkpoint folder corresponding to the current │
mosaic_512/0 [0]:│ 203 │ │ │ # If fsdp_sharded_state_dict_enabled then just use that rank's unique tempdi │
mosaic_512/0 [0]:│ 204 │ │ │ node_checkpoint_folder = (tempdir │
mosaic_512/0 [0]:│ ❱ 205 │ │ │ │ │ │ │ │ │ if state.fsdp_sharded_state_dict_enabled else _get │
mosaic_512/0 [0]:│ 206 │ │ │ assert node_checkpoint_folder is not None │
mosaic_512/0 [0]:│ 207 │ │ │ │
mosaic_512/0 [0]:│ 208 │ │ │ composer_states_filepath, extracted_checkpoint_folder, extracted_rank_n = do │
mosaic_512/0 [0]:│ │
mosaic_512/0 [0]:│ /opt/conda/lib/python3.9/site-packages/composer/utils/checkpoint.py:239 in │
mosaic_512/0 [0]:│ _get_local_rank_zero_path │
mosaic_512/0 [0]:│ │
mosaic_512/0 [0]:│ 236 │
mosaic_512/0 [0]:│ 237 def _get_local_rank_zero_path(path: Optional[str]) -> str: │
mosaic_512/0 [0]:│ 238 │ """Broadcasts the ``path`` from the LOCAL rank zero to all LOCAL ranks.""" │
mosaic_512/0 [0]:│ ❱ 239 │ local_rank_zero = dist.get_local_world_size() * dist.get_node_rank() │
mosaic_512/0 [0]:│ 240 │ paths = dist.all_gather_object(path) │
mosaic_512/0 [0]:│ 241 │ local_rank_zero_path = paths[local_rank_zero] │
mosaic_512/0 [0]:│ 242 │ assert local_rank_zero_path is not None, 'local rank zero provides the path' │
mosaic_512/0 [0]:│ │
mosaic_512/0 [0]:│ /opt/conda/lib/python3.9/site-packages/composer/utils/dist.py:155 in get_node_rank │
mosaic_512/0 [0]:│ │
mosaic_512/0 [0]:│ 152 │ Returns: │
mosaic_512/0 [0]:│ 153 │ │ int: The node rank, starting at 0. │
mosaic_512/0 [0]:│ 154 │ """ │
mosaic_512/0 [0]:│ ❱ 155 │ return _get_distributed_config_var(env_var='NODE_RANK', default=0, human_name='node │
mosaic_512/0 [0]:│ 156 │
mosaic_512/0 [0]:│ 157 │
mosaic_512/0 [0]:│ 158 def barrier() -> None: │
mosaic_512/0 [0]:│ │
mosaic_512/0 [0]:│ /opt/conda/lib/python3.9/site-packages/composer/utils/dist.py:101 in _get_distributed_config_var │
mosaic_512/0 [0]:│ │
mosaic_512/0 [0]:│ 98 │ │ return int(os.environ[env_var]) │
mosaic_512/0 [0]:│ 99 │ │
mosaic_512/0 [0]:│ 100 │ if dist.is_initialized(): │
mosaic_512/0 [0]:│ ❱ 101 │ │ raise MissingEnvironmentError('Torch distributed is initialized but environment │
mosaic_512/0 [0]:│ 102 │ │ │ │ │ │ │ │ │ f'{env_var} is not set.') │
mosaic_512/0 [0]:│ 103 │ │
mosaic_512/0 [0]:│ 104 │ return default │
mosaic_512/0 [0]:╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
mosaic_512/0 [0]:MissingEnvironmentError: Torch distributed is initialized but environment variable NODE_RANK is not set.
When can we expect H100 support? I have tried building environment basing on cuda11.8 and 12.0. There seems to be some issues realted to package discrepency. Any suggestion for now?
Hi, I found that the current mainline code can generate reasonable FID score for pre-trained models, but generate very high FID score for the model that is pre-trained using this codebase.
For example, I have a checkpoint that is pre-trained on the LAION dataset and get the following FID scores by using fid-clip-evaluation.py
:
Mainline -> 18.46875
ejyuen-patch-1 -> 14.32812
For anther checkpoint, I get the following result:
Mainline -> 21.46875
ejyuen-patch-1 -> 15.89062
Note that for all those FID calculation, I use the same COCO2014-10K dataset.
frustrated after training about 1654/ba it corrupted, failed to save the checkpoint, tried two times.
Error as follows:
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=39739, OpType=ALLREDUCE, Timeout(ms)=300000) ran for 302714 milliseconds before timing out.
train 4%|▉ /home/anaconda3/envs/control/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 3 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
----------End global rank 3 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 35121) has still not exited; return exit code 1.
Hi, your code sets COCO2014 by default to calculate FID and CLIP score.
May I know which COCO dataset should be used? COCO 2014 or COCO2017? Thanks
Nice work!
laion_cloudwriter.py conveted format sucessfully from parquet to mds, but failed to generate index.json file which is neccessary for dataloader in precompute_latents.py.
Is this a bug? Thanks for help.
Hi, I get the following error when I use fid-clip-evaluation.py with size=512, but it works with size=256. Any suggestions? Thanks.
Hi, I was wondering why do I see the following log by using stable_diffusion_2
? I didn't see the training code is supposed to load openai--clip-vit-large-patch14
, isn't it?
mosaic/0 [0]:[INFO|configuration_utils.py:712] 2023-10-03 22:40:58,340 >> loading configuration file config.json from cache at .cache/huggingface/hub/models--openai--clip-vit-large-patch14/snapshots/32bd64288804d66eefd0ccbe215aa642df71cc41/config.json
hydra.errors.InstantiationException: Error in call to target 'diffusion.datasets.laion.laion.build_streaming_laion_dataloader':
TypeError("'NoneType' object is not iterable")
full_key: dataset.train_dataset
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 247336) exited with code 1
ERROR:composer.cli.launcher:Global rank 0 (PID 247336) exited with code 1
I see it referenced in the source code as oci://mosaicml-internal-checkpoints/stable-diffusion-hero-run/4-13-512-ema/ep5-ba850000-rank0.pt
. It would be nice to try it.
Hi, I got NaN loss during the training of SD-2-base-256.yaml
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert: [DEBUG] break_graph_if_unsupported triggered compile
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] Graph break: call_method SetVariable() __contains__ [UnspecializedNNModuleVariable(Linear)] {} from user code at:
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/home/jonch/Desktop/sdpa.py", line 911, in <resume in train>
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] m = M(linear, encode=encode)
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/home/jonch/Desktop/sdpa.py", line 893, in __init__
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] self.linear.requires_grad_(False)
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/home/jonch/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2439, in requires_grad_
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] for p in self.parameters():
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/home/jonch/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2192, in parameters
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] for name, param in self.named_parameters(recurse=recurse):
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/home/jonch/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2223, in named_parameters
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] gen = self._named_members(
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/home/jonch/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2159, in _named_members
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] modules = self.named_modules(prefix=prefix, remove_duplicate=remove_duplicate) if recurse else [(prefix, self)]
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/home/jonch/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2369, in named_modules
[2023-10-23 14:05:10,051] [1/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] if self not in memo:
composer run.py --config-path yamls/hydra-yamls --config-name SD-2-base-512.yaml
[2023-06-13 20:29:52,077][composer.utils.reproducibility][INFO] - Setting seed to 17
Error executing job with overrides: []
Error in call to target 'diffusion.models.models.stable_diffusion_2':
TypeError("UNet2DConditionModel.init() got an unexpected keyword argument 'dual_cross_attention'")
full_key: model
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
ERROR:composer.cli.launcher:Rank 3 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 3 (PID 40553) exited with code 1
----------Begin global rank 3 STDOUT----------
[2023-06-13 20:29:52,032][composer.utils.reproducibility][INFO] - Setting seed to 17
----------End global rank 3 STDOUT----------
----------Begin global rank 3 STDERR----------
Error executing job with overrides: []
Error in call to target 'diffusion.models.models.stable_diffusion_2':
TypeError("UNet2DConditionModel.init() got an unexpected keyword argument 'dual_cross_attention'")
full_key: model
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
----------End global rank 3 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 40550) exited with code -15
Hi, I am trying to use autoresume
to continue train my failed jobs, but get the following error:
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/fsdp/_exec_order_utils.py", line 243, in _check_order
RuntimeError: Forward order differs across ranks: rank 0 is all-gathering 1 parameters while rank 4 is all-gathering -2142209024 parameters
When I use a single node to train a model, save checkpoint, and set autoresume=True
to continue the training by using a single node, it works.
However, when I use 16 nodes to train a model, save checkpoint, and use 1 or 16 nodes to do autoresume
, I get the aforementioned error.
I googled it, but only find this Stack Overflow. Same error, but no answer yet.
Hi, for example I am training a job using this yaml, how to do continue training if this job failed? Thanks.
Hi, thank all of you guys for releasing such a masterpiece on training SD.
Currently, I am wondering what is the benefit of training SD with 2 phases (256 and 512 in size). Is the first stage like a warm-up or something? What if I start directly with 512x512 training images?
I am researching SD and it is awesome to dive into training techniques.
Hope to get a reply soon.
Peace.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.