wangt-cn / disco Goto Github PK
View Code? Open in Web Editor NEW[CVPR2024] DisCo: Referring Human Dance Generation in Real World
Home Page: https://disco-dance.github.io/
License: Apache License 2.0
[CVPR2024] DisCo: Referring Human Dance Generation in Real World
Home Page: https://disco-dance.github.io/
License: Apache License 2.0
where is the pose encoder in your genius code?
Hi, Thanks a lot for this great work!
I am trying to run the finetuning code with the provided instructions but the code references a lot of data that I don't have (tiktok data etc.). Is all this data needed for the finetuning? could you perhaps clarify the structure of the finetune data and how to reference it?
Thanks!
Hello, thank you for the great code. However, I have some slight concerns about the image quality,
so I wanted to ask if it's possible to replace the sd-image-variations-diffusers model with another model.
It seems difficult to make an immediate change due to the image_encoder file.
Thank you, and I hope you have a wonderful day.
How can I mask the foreground and background? Is there an easy utility for that?
Whichever port I use for multiple-GPU inference, I always get the error that is the address already in use.
For example, I set "export MASTER_PORT=65530" before inference on multiple GPUs and then I will get an error as follows:
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:65530 (errno: 98 - Address already in use). The server socket has failed to bind to?UNKNOWN? (errno: 98 - Address already in use).
When I run the google colab DisCo_Demo.ipynb, the following error occurred.
ModuleNotFoundError Traceback (most recent call last)
[<ipython-input-25-7aa731641df4>](https://localhost:8080/#) in <cell line: 6>()
4
5 from utils.wutils_ldm import *
----> 6 from agent import Agent_LDM, WarmupLinearLR, WarmupLinearConstantLR
7 import torch
8 from config import BasicArgs
3 frames
[/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/utils.py](https://localhost:8080/#) in <module>
16
17 import torch
---> 18 from torch._six import inf
19 import torch.distributed as dist
20
ModuleNotFoundError: No module named 'torch._six'
But, 'torch._six' was on or under torch==1.7.0.
Next, the following error occurred.
!pip install pip install torch==1.7.0
Requirement already satisfied: pip in /usr/local/lib/python3.10/dist-packages (23.1.2)
Collecting install
Downloading install-1.3.5-py3-none-any.whl (3.2 kB)
ERROR: Could not find a version that satisfies the requirement torch==1.7.0 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0)
ERROR: No matching distribution found for torch==1.7.0
Please revise the gooble colab torch version on or over 2.0.0.
I set my dataset like the toy_dataset you provided. However, it can't execute. I encountered this problem
Original Traceback (most recent call last):
File "/root/.local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/root/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data/DisCo/dataset/tiktok_controlnet_t2i_imagevar_combine_specifcimg_web_upsquare.py", line 569, in getitem
raw_data = self.get_img_txt_pair(idx)
File "/data/DisCo/dataset/tiktok_controlnet_t2i_imagevar_combine_specifcimg_web_upsquare.py", line 512, in get_img_txt_pair
anno = list(open(anno_path))
FileNotFoundError: [Errno 2] No such file or directory: './719__242.png'
This is the parameters I deployed
AZFUSE_USE_FUSE=0 QD_USE_LINEIDX_8B=0 NCCL_ASYNC_ERROR_HANDLING=0 mpirun -np 8 --allow-run-as-root python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet_attr_pretraining/coco_S256_xformers_tsv_strongrand.py --do_train --root_dir /data/DisCo --local_train_batch_size 64 --local_eval_batch_size 64 --log_dir exp/tiktok_pretrain --epochs 40 --deepspeed --eval_step 2000 --save_step 2000 --gradient_accumulate_steps 1 --learning_rate 1e-3 --fix_dist_seed --loss_target "noise" --train_yaml ./TSV_dataset/Human_Attribute_Pretrain/composite/train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.yaml --val_yaml ./TSV_dataset/Human_Attribute_Pretrain/composite/val_TiktokDance-coco-single_person-SHHQ-1.0-masks-single_cap.yaml --unet_unfreeze_type "transblocks" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask --conds "masks" --max_eval_samples 2000 --strong_aug_stage1 --node_split_sampler 0
and the loss I got after Human attribute pretraining.
Metering:{'loss_total': '0.0667'}: 100%|██████████| 55280/55280 [81:48:04<00:00, 5.33s/it]
I noticed that on your article, you mentioned All pre-training experiments are conducted on 4x8 NVIDIA V100 GPUs for 25K steps with image size 256×256 and learning rate 1e−3.
Because I only got one-fourth of the number of gpus you have,should I reduce the learning rate to one-fourth of 1e-3?
what is your loss after training on this stage?Thank you!
Appreciate your great works !@Wangt-CN ,how can I deploy the model inference locally?
Could you please release the inference code? Thanks a lot :)
Hi, thanks for the great work.
My GPU: 2080ti * 10
AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 mpirun -np 8 python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet_attr_pretraining/coco_S256_xformers_tsv_strongrand.py --do_train --root_dir run_test \ --local_train_batch_size 8 --local_eval_batch_size 8 --log_dir exp/tiktok_pretrain \ --epochs 40 --deepspeed --eval_step 2000 --save_step 2000 --gradient_accumulate_steps 1 \ --learning_rate 1e-3 --fix_dist_seed --loss_target "noise" \ --train_yaml /data/mfyan/Human_Attribute_Pretrain/composite/train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.yaml --val_yaml /data/mfyan/Human_Attribute_Pretrain/composite/val_TiktokDance-coco-single_person-SHHQ-1.0-masks-single_cap.yaml \ --unet_unfreeze_type "transblocks" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask \ --conds "masks" --max_eval_samples 2000 --strong_aug_stage1 --node_split_sampler 0
The first is error reporting RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:12475 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:12475 (errno: 98 - Address already in use).
. Then I changed the port number in utils/dist.py
to something else, and found that the same type of error was still reported, so I changed the port number to random.randint(10000, 20000)
, and it worked. But I found 8 processes running only on GPU 0, resulting in RuntimeError: CUDA error: out of memory
.
Hi, thanks for your great work.
I only change the '--root_dir', '--pretrained_model', '--pretrained_model_path' according to my local settings. But when I run the cell in human_img_edit_gradio.ipynb
`## prepare the eval
logger.warning("Do eval_visu...")
if getattr(args, 'refer_clip_preprocess', None):
eval_dataset = BaseDataset(args, args.val_yaml, split='val', preprocesser=model.feature_extractor)
else:
eval_dataset = BaseDataset(args, args.val_yaml, split='val')
eval_dataloader, eval_info = make_data_loader(
args, args.local_eval_batch_size,
eval_dataset)
trainer = Agent_LDM(args=args, model=model)
trainer.eval_demo_pre()`
seems lots of weights of controlnet failed to load from mp_rank_00_model_states.pt.
And after the gradio launched, the background and pose not work.
Please help me out, thx.
cf = import_filename(args.cf)
Net, inner_collect_fn = cf.Net, cf.inner_collect_fn
cf is config/ref_attn_clip_combine_controlnet/app_demo_image_edit.py, and app_demo_image_edit.py does not have Net and inner_collect_fn
Great work @Wangt-CN. Currently the repo is using 2D keypoints to control the pose of output. Is it possible to replace pose with canny or depth based control? If yes, would it require only replacing controlnet model or retraining complete model? Thanks.
[2023-07-04 16:13:34 <finetune_sdm_yaml.py:89> main_worker] Building models...
[2023-07-04 16:13:34 <finetune_sdm_yaml.py:89> main_worker] Building models...
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 355, in load_config
config_file = hf_hub_download(
File "/root/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 112, in _inner_fn
validate_repo_id(arg_value)
File "/root/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id
raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home1/wangtan/code/ms_internship2/github_repo/run_test/diffusers/sd-image-variations-diffusers'. Use repo_type
argument if needed.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/DisCo/finetune_sdm_yaml.py", line 209, in
main_worker(parsed_args)
File "/data/DisCo/finetune_sdm_yaml.py", line 90, in main_worker
model = Net(args)
File "/data/DisCo/config/ref_attn_clip_combine_controlnet_attr_pretraining/net.py", line 38, in init
tr_noise_scheduler = DDPMScheduler.from_pretrained(
File "/root/anaconda3/lib/python3.10/site-packages/diffusers/schedulers/scheduling_utils.py", line 139, in from_pretrained
config, kwargs, commit_hash = cls.load_config(
File "/root/anaconda3/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 391, in load_config
raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like /home1/wangtan/code/ms_internship2/github_repo/run_test/diffusers/sd-image-variations-diffusers is not the path to a directory containing a scheduler_config.json file.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/diffusers/installation#offline-mode'.
how can I fix this?Thx
Thanks for great work. @Wangt-CN
I tried to reproduce the results using "gen_eval.sh," but I noticed that the FID-VID and FVD do not match the results reported in the paper. Can you help me with this issue? Is it possible that I am using the incorrect checkpoints?
download checkpoints:
pth : TikTok Training Data (FID-FVD: 18.8)
FID-VID:resnet-50-kinetics.pth : "https://github.com/yjh0410/YOWOF/releases/download/yowof-weight/resnet-50-kinetics.pth"
FVD: i3d_pretrained_400.pt : "https://drive.google.com/file/d/1mQK8KD8G6UWRa5t87SRMm5PVXtlpneJT/edit"
Hi, thanks for your great work. I check the TikTok tsv dataset and find that you've already split the dataset into trainig set and validation set. Since it's not easy to match the image with original sequence id of the dataset each by each, Then could you please just clarify that which sequences from original TikTok datset(from 000 to 340) are used for tranining and which are for validation? Thanks!
Thanks for sharing this great work!
I find that you uploaded the training data to the google cloud in tsv format. It is inconvenient for me to download the data with google cloud. Could you please upload a copy of the data to other cloud storage, such as google drive, aliyun, or baiduyun?
Thanks
Your work is very good, I would love to experience your work, but found that the URL does not open, can you open it for a while?
Hi, @Wangt-CN, first off, great work!!
I want to run inference through code, not gradio. I tried searching for the function to do that, the closest I found is the Agent_LDM. But this takes a reference fg, bg and skeleton. Is there a function which just takes in an image (with a character in it) and a skeleton, and returns the output?
Additionally, any function for end to end video gen?
Thanks
Just another: Will the code be compatible with PyTorch 2?
what are your criteria about choosing the reference image in 1) Pre-training, 2) General fine-tuning and 3) Human-specific fine-tuning respectively? are they all the first images of a dataset?
When running the Gradio Demo, this error kept generating when it was loading the pre-trained unet: TypeError: get_down_block() got an unexpected keyword argument 'attn_num_head_channels'
Looking at how the other parameters were named, I tried changing it to 'attention_head_dim', however that then created this error: TypeError: unsupported operand type(s) for //: 'int' and 'NoneType'
Once I expanded the error and viewed it in full, I noticed num_attention_heads
was mentioned yet this was not present in any of the scripts. Therefore, I tried changing the parameter name to this and the code ran successfully.
Hence, all instances of attn_num_head_channels
in the following scripts need to be changed to num_attention_heads
:
您好,
我想从你处理好的数据中获取skeleton中key points的信息。
我从tsv文件中读取出来pose image的信息后,通过数值分析发现对应位置的像素值和create_custom_dataset_tsvs.py中设置的colors里面rgb对应不上,,看上去好像pose image在存储在tsv中的时候使用的是有损压缩?
期待您的回答,谢谢!
I modify the args and used
python ./annotator/grounded-sam/run.py --dataset_root ./single/ --partition 1。
Under groundsam_vis folder,I got 001.png.mask.jpg and 001.png.mask.png. The original size of pic is 540960, 001.png.mask.png is 540960, but it is all black. 001.png.mask.jpg's foreground color is yellow and its background is purple,
but its size is 1299 x 2310.
AttributeError: 'DDPMScheduler' object has no attribute 'remove_noise'
Thanks for your great work!
I am curious about this model how to process the 'video frame consistency'. The paper seems to not consider this issue.
I try the video pose transfer and result as follows (far from paper shows. Am I missing some steps ? ):
Can I use high-res samples for fine tuning to get high-res results?
Hello, thanks for this great work! When I was trying to run through the code
AZFUSE_USE_FUSE=0 QD_USE_LINEIDX_8B=0 NCCL_ASYNC_ERROR_HANDLING=0 python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet_attr_pretraining/coco_S256_xformers_tsv_strongrand.py --do_train --root_dir /home1/wangtan/code/ms_internship2/github_repo/run_test \
--local_train_batch_size 64 --local_eval_batch_size 64 --log_dir exp/tiktok_pretrain \
--epochs 40 --deepspeed --eval_step 2000 --save_step 2000 --gradient_accumulate_steps 1 \
--learning_rate 1e-3 --fix_dist_seed --loss_target "noise" \
--train_yaml ./blob_dir/debug_output/video_sythesis/dataset/composite/train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.yaml --val_yaml ./blob_dir/debug_output/video_sythesis/dataset/composite/val_TiktokDance-coco-single_person-SHHQ-1.0-masks-single_cap.yaml \
--unet_unfreeze_type "transblocks" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask \
--conds "masks" --max_eval_samples 2000 --strong_aug_stage1 --node_split_sampler 0
I met the following raise exception:
Traceback (most recent call last):
File "finetune_sdm_yaml.py", line 209, in <module>
main_worker(parsed_args)
File "finetune_sdm_yaml.py", line 135, in main_worker
trainer.setup_model_for_training()
File "/data1/tao.wu/DisCo/agent.py", line 978, in setup_model_for_training
self.prepare_dist_model()
File "/data1/tao.wu/DisCo/agent.py", line 205, in prepare_dist_model
lr_scheduler=self.scheduler)
File "/data1/tao.wu/anaconda3/envs/disco/lib/python3.7/site-packages/deepspeed/__init__.py", line 181, in initialize
config_class=config_class)
File "/data1/tao.wu/anaconda3/envs/disco/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 310, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/data1/tao.wu/anaconda3/envs/disco/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1196, in _configure_optimizer
raise ZeRORuntimeException(msg)
deepspeed.runtime.zero.utils.ZeRORuntimeException: You are using ZeRO-Offload with a client provided optimizer (<class 'torch.optim.adamw.AdamW'>) which in most cases will yield poor performance. Please either use deepspeed.ops.adam.DeepSpeedCPUAdam or set an optimizer in your ds-config (https://www.deepspeed.ai/docs/config-json/#optimizer-parameters). If you really want to use a custom optimizer w. ZeRO-Offload and understand the performance impacts you can also set <"zero_force_ds_cpu_optimizer": false> in your configuration file.
I wonder what may cause such exception, could anyone help me out? Thanks a lot!
In grounded-sam of PREPRO.md, the command is wrong to preprocess images. I think it should be something like run_local_test.sh.
run_local_test.sh processes single image, but openpose processes images in directories.
What is the file structure of outputs?
hi, I've tried running the code on multiple GPUs, but it seems that it doesn't utilize all available GPU resources. Could you please provide some guidance on how I can modify the code or which commands I should use to enable multi-GPU support? Thank you very much for your help.
No problem now
Hi, thanks for the great work. I'm trying to run the training code, but when I do pre_train with multiple 4090 gpus, it always gets stuck and no report is shown. But when training with multiple 3090 and single 4090 everything is fine. I strongly suspect that a deadlock occurs in 4090. I narrow the problem down to deepspeed.initialize
in agent.py. But I don't know how to solve this problem.
Any response will be greatly appreciated.
Hi, thank you for your great work! I have a question about the FVD evaluation. I intend to follow this work, but I have some problems when evaluating FVD. (The other quanti results are consistent with the paper)
When I check the configs of the videos generated from gif (in tool/metrics/utils.py 'DatasetFVDVideoResize'), I find that the video has the size of [128, 112, 112, 3], however, the gif has only 16 frames. So when I check out the ffmpeg function in tool/metrics/utils.py line 358
out, _ = (ffmpeg.input(path).output('pipe:', format='rawvideo', pix_fmt='rgb24').run(capture_stdout=True, quiet=False))
it outputs something like below, which means it transfers the 16-frame gif to a 128 frame video (and segment it into 8 pieces for the num_seg parameter):
Input #0, gif, from '/root/autodl-tmp/DisCo/run_test/exp/tiktok_ft/outputs//pred_gs1.5_scale-cond1.0-ref1.0_gif/TiktokDance_00337_0010png.gif': Duration: 00:00:05.28, start: 0.000000, bitrate: 866 kb/s Stream #0:0: Video: gif, bgra, 256x256, 3.03 fps, 24.25 tbr, 100 tbn, 100 tbc
Output #0, rawvideo, to 'pipe:': Metadata: encoder : Lavf58.29.100 Stream #0:0: Video: rawvideo (RGB[24] / 0x18424752), rgb24, 256x256, q=2-31, 38141 kb/s, 24.25 fps, 24.25 tbn, 24.25 tbc Metadata: encoder : Lavc58.54.100 rawvideo
And if I set the fps in gen_eval.sh as 25 (and the video will be 16 frames), the FVD-3DRN50 will become 96.15 (from More TikTok-Style Training Data (FID-FVD: 15.7))
even if I don't change the fps (remain as 3), the FVD-3DRN50 is 20.34, different from the paper.
So I have 3 questions on this evaluation:
Thank you!
Hi, thanks for the great work.
I noticed that the LPIPS evaluation does not include normalize=True
given that the inputs are in the [0,1] range. Adding this would change the results from 0.292 to 0.339. Despite this increase, the result still remains significantly better than the baseline.
DisCo/tool/metrics/ssim_l1_lpips_psnr.py
Lines 74 to 97 in 8cb9387
Thanks for your great work!
I am curious about the Human Attribute Pre-training stage, if you pre-trained the model with the full-body image in SHHQ or just use the cropped upper-body image (e.g. the showed tiktok video results)?
In the gen_eval.sh, # Generate GIFs of 16 frames, 3 fps
python tool/video/gen_gifs_for_fvd.py --root_dir. So where is the gen_gifs_for_fvd.py and the folder video?
Hi, I would like to use a different dataset for the second step of fine-tuning. How should I structure the data as provided by you? For example, how can I obtain the files train_images.lineidx
and train_images.lineidx.8b
?
Can you provide a brief tutorial on how to use tsv_file_ops.py
and tsv_file.py
?
"./demo_data/pose_img/0049.png","./demo_data/pose_img/0198.png","./demo_data/pose_img/0213.png","./demo_data/pose_img/0264.png","./demo_data/pose_img/0144.png","./demo_data/pose_img/0054.png"
obviously, these are part of poses. How can I get the entire poses dataset.
In addition, there are dance1-5. How can I get these poses dataset?Thx
Why dose if args.deepspeed: ...
is commented in finetune_sdm_yaml.py? Is there any bug with that?
Hi,
Could you please provide some more information about the human specific finetuning model?
I tried running it and have generated checkpoint files however their dictionary keys are wildly different to the provided checkpoint, 'mp_rank_00_model_states.pt':
My checkpoint: dict_keys(['models', 'optimizer', 'epoch', 'global_step', 'scheduler'])
Checkpoint provided: dict_keys(['module', 'buffer_names', 'optimizer', 'param_shapes', 'lr_scheduler', 'sparse_tensor_module_names', 'skipped_steps', 'global_steps', 'global_samples', 'dp_world_size', 'mp_world_size', 'ds_config', 'ds_version'])
Therefore, when I try to generate new images using my checkpoint it fails at the load_checkpoint_for_deepspeed_diff_gpu function with this message:
Traceback (most recent call last): File "/home/emily/DisCo/VideoGenerationModel/run.py", line 645, in <module> trainer.eval_demo_pre() File "/home/emily/DisCo/agent.py", line 422, in eval_demo_pre self.prepare_dist_model() File "/home/emily/DisCo/agent.py", line 199, in prepare_dist_model self.load_checkpoint_for_deepspeed_diff_gpu(self.pretrained_model) # load pt model with default pytorch File "/home/emily/DisCo/agent.py", line 813, in load_checkpoint_for_deepspeed_diff_gpu adaptively_load_state_dict(self.model, checkpoint['module']) KeyError: 'module'
I'm not really sure what to do about this issue as it seems the new checkpoints are supposed to be made this way, please advise. Many thanks :)
Hi, thank you very much for your amazing work. I have successfully run the Gradio Demo using the model you provided.
However, I encountered the following error during the fine-tuning training phase using fp16 deepspeed:
[INFO] [stage_1_and_2.py:1651:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
. How should I configure deepspeed.py
to solve this problem?
Thank you for great work @Wangt-CN.
When Fine-tuning with Disentangled Control in TiktokDance, the paper states that "it is trained on 8 NVIDIA V100 GPUs for 70K steps with an image size of 256 × 256 and a learning rate of 2e−4". I would like to know the value of the local_batch_size in this case.
Thanks a lot.
Thank you very much for such an outstanding work, will the pre-training dataset be open sourced?
Hi @Wangt-CN, thanks for the great work!
I have noticed that you've collected an additional 250 TikTok-style short videos from the internet. Will you consider uploading them? This would enable us to make a comparison with the the released model trained on it.
Thank you for great work @Wangt-CN .
Where is run.py in openpose?
And can you release inference code? so I can run demo on my machine.
Thanks.
Hi, i'd like to try the demo with another dataset, and i follow the PREPRO.md, successfully run the GroundSAM and Openpose script.
The question is Openpose does not return the images show as demo_data/pose_img/*.png
, Openpose output the keypoints with json style , and draw the skeleton directly on original rgb images. So is there any script to generate this style images? The pose image is resized to 256x256, if my dataset image is larger, should i crop the pose area and resize to 256 at first?
e.g. (openpose result i got)
00001.jpg.json.txt
hope your response, thanks!
Hi ! Thanks for your Disco paper and explanation for the TSV file preparation.
In the composite yaml file, you have a 'caption linelist' file which is used.
caption_linelist: train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.caption.linelist.tsv
Could you explain how you make this file ?
Hello author, I would like to ask, how to get the “self.anno_path = 'GIT/{:05d}/labels/{:04d}.txt'” of images in the tiktok dataset
if 'youtube' in anno_pose_path:
img_key = self.anno_list[idx % self.num_images]
else:
anno = list(open(anno_path))
img_key = json.loads(anno[0].strip())['image_key']
"""
example:
{"num_region": 6, "image_key": "TiktokDance_00001_0002.png", "image_split": "00001", "image_read_error": false}
{"box_id": 0, "class_name": "aerosol_can", "norm_bbox": [0.5, 0.5, 1.0, 1.0], "conf": 0.0, "region_caption": "a woman with an orange dress with butterflies on her shirt.", "caption_conf": 0.9404542168542169}
{"box_id": 1, "class_name": "person", "norm_bbox": [0.46692365407943726, 0.4977584183216095, 0.9338473081588745, 0.995516836643219], "conf": 0.912740170955658, "region_caption": "a woman with an orange dress with butterflies on her shirt.", "caption_conf": 0.9404542168542169}
{"box_id": 2, "class_name": "butterfly", "norm_bbox": [0.2368704378604889, 0.5088028907775879, 0.1444256454706192, 0.04199704900383949], "conf": 0.8738771677017212, "region_caption": "a brown butterfly sitting on an orange background.", "caption_conf": 0.9297735554473283}
{"box_id": 3, "class_name": "butterfly", "norm_bbox": [0.6688584089279175, 0.5137135982513428, 0.11311062425374985, 0.05455022677779198], "conf": 0.8287128806114197, "region_caption": "a brown butterfly sitting on an orange wall.", "caption_conf": 0.9264783379302365}
{"box_id": 4, "class_name": "blouse", "norm_bbox": [0.4692786931991577, 0.6465241312980652, 0.9283269643783569, 0.6027728319168091], "conf": 0.6851752400398254, "region_caption": "a woman wearing an orange shirt with butterflies on it.", "caption_conf": 0.9978814544264754}
{"box_id": 5, "class_name": "short_pants", "norm_bbox": [0.44008955359458923, 0.8769687414169312, 0.8799525499343872, 0.2431662678718567], "conf": 0.6741859316825867, "region_caption": "a person wearing an orange shirt and grey sweatpants.", "caption_conf": 0.9731313580907464}
"""
as for metric.json, it is
{"Step 0": {"eval": {"FID": 290.6777284435151, "time": "0:03:31.655149"}}, "Epoch2": {"train": {"loss_total": 0.09805237877084273, "time": "2:00:58.214095"}}, "Step2000": {"eval": {"FID": 39.84386908484396, "time": "0:04:10.178784"}}, "Epoch3": {"train": {"loss_total": 0.08273238215732012, "time": "1:05:12.567647"}}, "Step4000": {"eval": {"FID": 43.90990567108605, "time": "0:04:07.709888"}}, "Epoch4": {"train": {"loss_total": 0.08007360270170316, "time": "0:12:27.958903"}}, "Epoch5": {"train": {"loss_total": 0.07951762963632482, "time": "2:10:26.386287"}}, "Step6000": {"eval": {"FID": 47.38045255087343, "time": "0:04:09.181442"}}, "Epoch6": {"train": {"loss_total": 0.0778031051322654, "time": "1:17:38.705795"}}, "Step8000": {"eval": {"FID": 44.090458160650826, "time": "0:04:09.042723"}}, "Epoch7": {"train": {"loss_total": 0.07623744776395902, "time": "0:24:55.471577"}}, "Epoch8": {"train": {"loss_total": 0.07622077868163159, "time": "2:22:49.398254"}}, "Step10000": {"eval": {"FID": 31.904819727152358, "time": "0:04:09.004639"}}, "Epoch9": {"train": {"loss_total": 0.07504117791188147, "time": "1:30:05.510775"}}, "Step12000": {"eval": {"FID": 27.483082697985367, "time": "0:04:09.374085"}}, "Epoch10": {"train": {"loss_total": 0.07409243798521284, "time": "0:37:20.796301"}}, "Epoch11": {"train": {"loss_total": 0.07395339482515068, "time": "2:35:10.650475"}}, "Step14000": {"eval": {"FID": 31.168737757947156, "time": "0:04:09.885283"}}, "Epoch12": {"train": {"loss_total": 0.07372492651599219, "time": "1:42:29.514715"}}, "Step16000": {"eval": {"FID": 27.21106589500107, "time": "0:04:08.266375"}}, "Epoch13": {"train": {"loss_total": 0.07312383905231748, "time": "0:49:47.800272"}}, "Epoch14": {"train": {"loss_total": 0.07289745142666333, "time": "2:47:37.351242"}}, "Step18000": {"eval": {"FID": 23.106254980103927, "time": "0:04:07.923059"}}, "Epoch15": {"train": {"loss_total": 0.07242165016734459, "time": "1:54:56.220633"}}, "Step20000": {"eval": {"FID": 27.248582831371834, "time": "0:04:08.544414"}}, "Epoch16": {"train": {"loss_total": 0.07194297805632631, "time": "1:02:14.190223"}}, "Step22000": {"eval": {"FID": 24.803106175247194, "time": "0:04:07.721933"}}, "Epoch17": {"train": {"loss_total": 0.07178588963246771, "time": "0:09:32.512436"}}, "Epoch18": {"train": {"loss_total": 0.07133925958989136, "time": "2:07:23.062100"}}, "Step24000": {"eval": {"FID": 24.043788111684535, "time": "0:04:08.888000"}}, "Epoch19": {"train": {"loss_total": 0.07101746586461861, "time": "1:14:44.514627"}}, "Step26000": {"eval": {"FID": 21.995370168790316, "time": "0:04:09.391666"}}, "Epoch20": {"train": {"loss_total": 0.0711026608135349, "time": "0:21:59.766503"}}, "Epoch21": {"train": {"loss_total": 0.07048133608498951, "time": "2:19:49.997902"}}, "Step28000": {"eval": {"FID": 24.72485237611329, "time": "0:04:08.996139"}}, "Epoch22": {"train": {"loss_total": 0.07035766966611788, "time": "1:27:07.072584"}}, "Step30000": {"eval": {"FID": 23.718398035524274, "time": "0:04:08.026119"}}, "Epoch23": {"train": {"loss_total": 0.07033483951472409, "time": "0:34:28.609709"}}, "Epoch24": {"train": {"loss_total": 0.0700808005451255, "time": "2:32:20.204781"}}, "Step32000": {"eval": {"FID": 22.28186004474327, "time": "0:04:09.402551"}}, "Epoch25": {"train": {"loss_total": 0.06941193997643072, "time": "1:39:37.250015"}}, "Step34000": {"eval": {"FID": 22.008737701972393, "time": "0:04:08.751428"}}, "Epoch26": {"train": {"loss_total": 0.06975648029961369, "time": "0:46:56.325861"}}, "Epoch27": {"train": {"loss_total": 0.06926694908420368, "time": "2:44:55.461831"}}, "Step36000": {"eval": {"FID": 20.646719371436802, "time": "0:04:09.315956"}}, "Epoch28": {"train": {"loss_total": 0.06902978069161715, "time": "1:51:58.898925"}}, "Step38000": {"eval": {"FID": 20.947136795766653, "time": "0:04:08.238444"}}, "Epoch29": {"train": {"loss_total": 0.06883154165577786, "time": "0:59:21.183679"}}, "Step40000": {"eval": {"FID": 21.91535396280665, "time": "0:04:08.135138"}}, "Epoch30": {"train": {"loss_total": 0.0685038694108908, "time": "0:06:39.512788"}}, "Epoch31": {"train": {"loss_total": 0.06833649117448559, "time": "2:04:39.496344"}}, "Step42000": {"eval": {"FID": 21.03997708430046, "time": "0:04:08.954742"}}, "Epoch32": {"train": {"loss_total": 0.06803931710874384, "time": "1:11:49.735317"}}, "Step44000": {"eval": {"FID": 20.869328025712207, "time": "0:04:08.273694"}}, "Epoch33": {"train": {"loss_total": 0.06827863852959126, "time": "0:19:08.113691"}}, "Epoch34": {"train": {"loss_total": 0.06779086924764614, "time": "2:16:59.691355"}}, "Step46000": {"eval": {"FID": 21.228485860542378, "time": "0:04:07.829478"}}, "Epoch35": {"train": {"loss_total": 0.06772437718459348, "time": "1:24:18.001555"}}, "Step48000": {"eval": {"FID": 21.43211396473822, "time": "0:04:08.974074"}}, "Epoch36": {"train": {"loss_total": 0.06746692015109836, "time": "0:31:33.721808"}}}
Does the code have the mechanism of early stop or my code have encounter sth error?
I ran this code using this
AZFUSE_USE_FUSE=0 QD_USE_LINEIDX_8B=0 NCCL_ASYNC_ERROR_HANDLING=0 mpirun -np 8 --allow-run-as-root python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet_attr_pretraining/coco_S256_xformers_tsv_strongrand.py --do_train --root_dir /data/DisCo --local_train_batch_size 64 --local_eval_batch_size 64 --log_dir exp/tiktok_pretrain --epochs 40 --deepspeed --eval_step 2000 --save_step 2000 --gradient_accumulate_steps 1 --learning_rate 1e-3 --fix_dist_seed --loss_target "noise" --train_yaml ./TSV_dataset/Human_Attribute_Pretrain/composite/train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.yaml --val_yaml ./TSV_dataset/Human_Attribute_Pretrain/composite/val_TiktokDance-coco-single_person-SHHQ-1.0-masks-single_cap.yaml --unet_unfreeze_type "transblocks" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask --conds "masks" --max_eval_samples 2000 --strong_aug_stage1 --node_split_sampler 0 >> log.txt 2>&1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.