lucidrains / phenaki-pytorch Goto Github PK

Implementation of Phenaki Video, which uses Mask GIT to produce text guided videos of up to 2 minutes in length, in Pytorch

License: MIT License

Python 100.00%

artificial-intelligence attention-mechanisms deep-learning text-to-video transformers imagination-machine

phenaki-pytorch's Issues

Different video sizes

While yesterday's updates allow for all training videos to be rectangular, there is no current way to allow them to be different sizes among each other, I believe

to run C-ViViT...

import torch
from phenaki_pytorch import CViViT

cvivit = CViViT(
    dim = 512,
    codebook_size = 5000,
    patch_size = 32,
    temporal_patch_size = 2,
    spatial_depth = 4,
    temporal_depth = 4,
    dim_head = 64,
    heads = 8
).cuda()

video = torch.randn(1, 3, 17, 256, 256).cuda() # (batch, channels, frames + 1 leading frame, image height, image width)

loss = cvivit(video)
loss.backward()

you wrote 17 (=frames + 1 leading frame) then does it mean that the first frame should follow the other frames?
like
torch.cat([videos, image])

Running out of CUDA/GPU spaces

I have a GPU with 15GB and it seems it runs out of space when I try to train the network with 50 videos at a time. Do you think it would be better to repeat the loss training video per video, instead of all the videos at once?

video preprocessing

In Phenaki paper, they downsample MiT dataset from 25fps to 6fps before video quantization.

Then, I wonder how to get downsampled video in preprocessing and whether input video is downsampled or not during training transformer and video generation inference.
Even if you don't upload training and dataloader code for video, I want some advices from you who should have tried to implement it.

One more, I have implemented your c-vivit code for reconstruction. Then, after I got feasible outputs, I have gotten bad results in the very next checkpoint iteration like below. The left one is GT and the right one is the output. (I set checkpoint interval as 3000.)

Could I ask you what is wrong and is it supposed to be like that early stopping is required for tokenization learning?

Thank you.

In phenaki to output video_codebook_ids...

I got (batchsize, t x h x w, hidden dimension) shape of video_codebook_ids output from C-ViViT. But if video is made into 1 dimension like this, I don't think each temporal and spatial dimension is considered for transformer and vq.

In paper figure, tokens are separated into each frame so I can not understand why the output of c-vivit shape is (batchsize, thw, hidden dimension)....

Thank you

training data

for how long and how many videos should i train for good results?
As i tried to train it with just two 10 sec videos and the samples it is saving is just noise

how to condition on text embeddings from T5X

'To train MaskGIT, we include a text conditioning in the form of T5X embeddings which are used as input through the use of cross attention with the video tokens.'
In paper, explanations about how to condtion on text embeddings are too simple to understand for me....
Could you explain it more?

Einops Error. Shape mismatch

I trained the C-ViViT encoder with images first, got the vae loss down to ~0.05 and disc loss to ~0.007.

Then, I tried to fine-tune it with a gif dataset. But it returned an Einops shape mismatch error

This only happens when I use a batch_size other than 1. The model keeps training properly with batch_size 1.

Are we supposed to use only 1 as batch_size? The sample code in the README has 4 as batch_size.

Training the Phenaki - RuntimeError: CUDA error: device-side assert triggered

I have created a notebook and pasted the training code from the README.md file so that I can experiment with training the model, but I encounter the following error when training the Phenaki. is there anything I'm doing wrong.

The same error emerges both on Google Colab and AWS EC2 - p4d instance with 8 A100 GPUs.

Here's the error I getting:

<@beartype(phenaki_pytorch.phenaki_pytorch.Phenaki.forward) at 0x7fd70d61cf70> in forward(__beartype_object_94770277747520, __beartype_getrandbits, __beartype_get_violation, __beartype_conf, __beartype_func, *args, **kwargs)

[/content/phenaki-pytorch/phenaki_pytorch/phenaki_pytorch.py](https://localhost:8080/#) in forward(self, videos, texts, video_codebook_ids, video_frame_mask, text_embeds, cond_drop_prob, only_train_generator, only_train_critic)
    642         if not only_train_critic:
    643             loss = F.cross_entropy(
--> 644                 logits[mask_token_mask],
    645                 video_codebook_ids[mask_token_mask]
    646             )

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

When will the code be released?

Wistfully... :)

how to run cvivit trainer or phenaki trainer

Could you explain how to run cvivit trainer.py or phenaki trainer.py?

Some errors that appear on the training results

When we started trying your sample model, the result was a noisy video, again, each frame is a noisy image, I wonder if we need input or other errors during training? thank

This is one frame in entire_video

This is our results video

result.mp4

Thanks a lot

Where can find '/path/to/trained/cvivit.pt'

I have a theoretical question about maskgit cross attention

In cvivit, you would train video clip with the fixed number of frames.
When training maskgit to do cross attention with text tokens, how did you cut(?) corresponding text tokens for the given frames?
Thank you!

There is error in ema

After one iteration of training, there is difference btw ema_vae.online_model.buffers. So, I got an error in ema_vae.update().

How can I solve it?

c-vivit return_recons = True with use_vgg_and_gan = True

if return_recons:
    return loss, fmap

But there is no fmap

Release pretrained models?

Hello,

Is there a plan to release pretrained models to do text-to-video out of the box with no training required?

If you upload your pretrained model somewhere, I can add download and loading of pretrained weight script and PR.

Thanks

Clarity on Datasets used for Training!

phenaki-pytorch package not found

$ pip install phenaki-pytorch

Above command doesn't work as it is not found. Will it be updated soon?

Unconditional Training returns errors

I'm trying to train an unconditional model with image and gif data I have, to have coherent video generated from gifs of manga panels:

cvivit = CViViT(
    dim = 512,
    codebook_size = 5000,
    image_size = 256,
    patch_size = 32,
    temporal_patch_size = 2,
    spatial_depth = 4,
    temporal_depth = 4,
    dim_head = 64,
    heads = 8
)

maskgit = MaskGit(
    num_tokens = 5000,
    max_seq_len = 1024,
    dim = 512,
    dim_context = 768,
    depth = 6,
    unconditional = True # Kept this true, otherwise it asks for text samples (I only have image data)
)

phenaki = Phenaki(
    cvivit = cvivit,
    maskgit = maskgit
).cuda()

trainer = PhenakiTrainer(
    phenaki=phenaki,
    batch_size=4,
    grad_accum_every=4,
    train_on_images=True,
    folder='../dataset/compressed_manga/'
)

trainer.train()

When training, the following error is raised:
sample_images() got an unexpected keyword argument 'num_frames'

I think the arg num_frames is being passed to the method sample_images.
Can someone confirm this is a bug? I'll submit a PR with a fix

Error in Sample(): Expected scalar type float but found double

When running the example code, I keep getting the following error (see below). Do you have any idea how to fix it?

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_21777/4126247202.py in <module>
     44 # do the above for many steps, then ...
     45 
---> 46 video = phenaki.sample(text = 'a squirrel examines an acorn', num_frames = 17, cond_scale = 5.) # (1, 3, 17, 256, 256)
     47 
     48 # so in the paper, they do not really achieve 2 minutes of coherent video

~/Phenaki/phenaki-pytorch/phenaki_pytorch2/phenaki_pytorch2.py in inner(model, *args, **kwargs)
     36         was_training = model.training
     37         model.eval()
---> 38         out = fn(model, *args, **kwargs)
     39         model.train(was_training)
     40         return out

/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
     26         def decorate_context(*args, **kwargs):
     27             with self.__class__():
---> 28                 return func(*args, **kwargs)
     29         return cast(F, decorate_context)
     30 

~/Phenaki/phenaki-pytorch/phenaki_pytorch2/phenaki_pytorch2.py in sample(self, text, num_frames, prime_frames, cond_scale, starting_temperature, noise_K)
   1115                     scores = 1 - rearrange(scores, '... 1 -> ...')
   1116                     
-> 1117                     scores = torch.where(mask, scores, -1e4)
   1118 
   1119         if has_prime:

RuntimeError: expected scalar type float but found double

What is out of index?

I got this error in loss = F.cross_entropy(logits[mask_token_mask], x[mask_token_mask])
Then I found out the error line 181 phenaki_pytorch.py

After pos emb(torch.arange(~)), I got all the tensors as torch.tensor.Object
And then the CUDA error occurs....

CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

How can I solve it?

ImportError: T5Tokenizer requires the SentencePiece library but it was not found in your environment.

When I run the given

import torch
from phenaki_pytorch import CViViT, MaskGit, TokenCritic, PhenakiCritic

cvivit = CViViT(
    dim = 512,
    codebook_size = 5000,
    image_size = (256, 128),
    patch_size = 32,
    temporal_patch_size = 2,
    spatial_depth = 4,
    temporal_depth = 4,
    dim_head = 64,
    heads = 8
)

maskgit = MaskGit(
    num_tokens = 5000,
    max_seq_len = 1024,
    dim = 512,
    dim_context = 768,
    depth = 6,
)

critic = TokenCritic(
    num_tokens = 5000,
    max_seq_len = 1024,
    dim = 512,
    dim_context = 768,
    depth = 6
)

critic_trainer = PhenakiCritic(
    maskgit = maskgit,
    critic = critic,
    cvivit = cvivit
)

texts = [
    'a whale breaching from afar',
    'young girl blowing out candles on her birthday cake',
    'fireworks with blue and green sparkles'
]

videos = torch.randn(3, 3, 3, 256, 128) # (batch, channels, frames, height, width)

loss = critic_trainer(videos = videos, texts = texts)
loss.backward()

I get this error:
ImportError:
T5Tokenizer requires the SentencePiece library but it was not found in your environment.

Successfully trained the CViViT! Working on the second step

Hello @lucidrains !

Based on your great repo, we have been working on replicating phenaki for a few months. Thus we worked on the first step of training, the CViViT. Things were pretty tough to setup with our cluster, and we made quite a lot of changes from your code. Since there were too many changes and we wanted to release the first step of training only without the code for the second step, we decided to create a separate repo here instead of doing a PR here which would have been a mess.

Here's the repo: https://github.com/obvious-research/phenaki-cvivit, with the model weights release on huggingface: https://huggingface.co/obvious-research/phenaki-cvivit

The model works well, it's been trained on Webvid10M and we intend on doing the same for the second step of training.

As you seemed interested in knowing about progress based on this repo (#28), we thought it was ok to open an issue here just to contact you :) We are a collective of artists working with AI that are opening our AI research lab, and we are interested in creating image and video models and releasing them open source.

What's the best way to contact you/dm you ? We are working on the second step of training at the moment. We have access to quite a lot of computing power so we think we have everything we need to actually do it. As we are making good progress towards a full replication, it'll be nice to collaborate if needed, maybe we'll have some questions and your help would be incredible! No pressure though, it's only if this is interesting to you.

Have a great day, and thanks again for the great repo!

Compatibility issue with A100-80G with version 0.0.67

I'm not able to use the A100-80G GPU because the torch version with 0.0.67 is not compatible with the card.

I'm not sure what my previous version was, probably the one before 0.0.66.

Is there a plan to upgrade the torch version? Any workaround?

model.buffer is EMPTY list in EMA get_buffers_iter function

With vscode debugging, I found that the buffer lists of both online_model and ema_model are all the empty in 1 iteration. Is this right? The model is c-vivit, not custom model. I am new to EMA so, I have difficulty understanding this code. Sorry for that!

Problem With Multi-GPU Training

Hello,

I have been training the c-vivit encoder and have encountered an issue when attempting to use multiple GPUs. While the encoder works well with a single GPU, I receive a RuntimeError when attempting to use multiple GPUs. Specifically, the error message is: "The size of tensor a (64) must match the size of tensor b (0) at non-singleton dimension 2." I have noticed that changing the dim_head parameter causes the size of tensor a to change as well. Could you please provide some insights into what might be causing this issue?

Save video?

How do you save or view the video created?

discriminator loss goes to infinity

Hi,

I'm trying to train the cvivit on a set of 10000 images. The vae-loss keeps going down, but the discriminator loss keeps rising infinity. It's easy to fool :)

Any idea what the problem is?

cannot reproduce

I really dont think this can be reproduced,

RuntimeError: Error(s) in loading state_dict for CViViT:
Missing key(s) in state_dict: "discr.attn_blocks.3.null_kv", "discr.attn_blocks.3.q_scale", "discr.attn_blocks.3.k_scale", "discr.attn_blocks.3.norm.gamma", "discr.attn_blocks.3.norm.beta", "discr.attn_blocks.3.context_norm.gamma", "discr.attn_blocks.3.context_norm.beta", "discr.attn_blocks.3.to_q.weight", "discr.attn_blocks.3.to_kv.weight", "discr.attn_blocks.3.to_out.weight".
Unexpected key(s) in state_dict: "discr.blocks.6.conv_res.weight", "discr.blocks.6.conv_res.bias", "discr.blocks.6.net.0.weight", "discr.blocks.6.net.0.bias", "discr.blocks.6.net.2.weight", "discr.blocks.6.net.2.bias", "discr.blocks.5.downsample.1.weight", "discr.blocks.5.downsample.1.bias".
size mismatch for discr.to_logits.3.weight: copying a param with shape torch.Size([1, 8192]) from checkpoint, the shape in current model is torch.Size([1, 16384]).

lucidrains / phenaki-pytorch Goto Github PK

phenaki-pytorch's Issues

When we started trying your sample model, the result was a noisy video, again, each frame is a noisy image, I wonder if we need input or other errors during training? thank

This is one frame in entire_video

This is our results video

Thanks a lot

Recommend Projects

Recommend Topics

Recommend Org