alpha-vllm / lumina-t2x Goto Github PK

View Code? Open in Web Editor NEW

2.0K 2.0K 83.0 59.4 MB

Lumina-T2X is a unified framework for Text to Any Modality Generation

License: MIT License

Python 98.82% Shell 1.18% CSS 0.01%

aigc diffusion diffusion-model diffusion-models diffusion-transformer generation-models transformer transformers

lumina-t2x's Introduction

Alpha-VLLM Lab Website

This is the website of our academic research group at Shanghai AI Lab.

This website is powered by Jekyll and some Bootstrap, Bootwatch.

lumina-t2x's People

Contributors

Stargazers

Watchers

Forkers

sam9799 tonghengcheng nemonameless multipath yiboz2001 zuiwomeirenxi xqiprogramming ameerazam08 de30 sanyaade-teachings utopic-dev wbclark christophschuhmann fangwudi robin202208 liunix61 yuxumin zirenlegend mr-harry kingfener yuan-manx strategist922 techthiyanes jiniaoxu eltociear leaderyangzi thomascherickal paperwave folkevil hhy5277 kustomzone zhaigogo gino2013 guspan-tanadi jinwook-shim gqadonis quintiontang thanhpham1987 shivammehta25 mr2cool ptradellc jiangzhengkai hafred hubert137 long2017v zxxngod laxr24 pavlog keyman9848 burhanultayyab siddie zperzendetta yhc-777 maoshuiyang siliciuss itutopia municef1 camenduru python279 npjd conglesolutionx shadowkun shabri-arrahim gaopengpjlab sanjayasl ego leejodie opensorceycw kp-forks navezjt mengmengbai yiming992 bruinxiong fashizel undercontroller kaimingzhu nakedlittlezombie mengchuang123 linecode bizai3000 thomaskalnik liuxing9848

lumina-t2x's Issues

ComfyUI support?

Is there a way to use this in comfyui?
Really impressed with the prompt following that a user shared into a discord channel.

Also can lora's be created for it? Can it be trained?

Why are you using logit-normal sampling only for ImageNet experiment

From the statements of your paper, logit-normal seems to be a general strategy to improve flow matching models training, why are you using logit-normal sampling only for ImageNet experiment?

Can you tell me the compute requirements for inference and training?

How to access the text to 3d functionality?

Hi, thanks for the amazing work. Is there any instructions on how to use the text-to-3d functionality?

Do you think using a vae with more channels is good?

very impressive work. i like that you are using a decoder llm.
btw do you think using a vae with more channels would be good?

conrolnet support?

conrolnet support? Can lumina trained with input conditional images and generate image with image conditions like controlnet?

🤩 [User Study] Report your bad examples!

Hi, all

Thank you for being so interested in Lumina-T2X. If you encounter images of poor quality, please feel free to report them in this issue to assist in our model's enhancement.

You can directly add a comment on this issue and use the template below:

prompt: <copy-paste prompt here>
image: <copy-paste image here>

You can also optionally report the hyperparameters you have used if it does influence the quality.

About data spec used for training Lumina-T2A

Hello, I found the text-to-audio example you shared very inspiring. Could you please provide the specifications of the data used for training lumina-t2a? Thank you.

出图效果不够惊艳

为什么你们模型出图效果不够惊艳呢
甚至不如pixart-sigma效果
是因为模型训练数据量不够吗
还是你们模型对prompt做了你们自己指令规范

Question about lognorm

Hi, great work! The paper mentions lognorm, but I couldn't find the implementation. Could you let me know if it's used in the code? If so, please tell me where I can find it. Thank you very much!

Showcase lumina-t2x on Huggingface Spaces with get free GPU grant(A100s)

Congratulations on your amazing project and the successful live gradio demo! It would also be great to have the demo available on Huggingface Spaces. This could help with more community engagement and drive more visibility to the project. We at Huggingface also provide free GPU grants through the ZeroGPU program, which includes free A100s. We would be happy to extend the grant to your application.

Here are some useful links to help you get started on Spaces:

A step-by-step guide to building Gradio apps on Spaces: https://huggingface.co/docs/hub/en/spaces-sdks-gradio
Zero GPU Org: https://huggingface.co/zero-gpu-explorers

Please let us know if you need any further assistance or support in integrating your project with Spaces or any other relevant Huggingface offerings.

still bad compare to hunyuan dit try this complex violinist prompt, one shot no cherry picking

kittychan is a beautiful female acoustic violin player with long brown hair which holds a violin, in the style of romantic riverscapes, yuumei, gil elvgren, hannah flowers, uhd image, album covers, synthetism-inspired ,ultra detailed realistic face

About time shifting factor.

For Lumina-T2I, it seems that time_shifting_factor is only implemented in the ODE integrator, but not in the SDE integrator. Does this factor have a big impact? SDE and ODE, which one is more recommended? Thanks！

Image Neural Field

https://arxiv.org/abs/2406.07480

Can not generate the same image with the demo

I have download the models on my computer and run the models locally. But I found that my generated images can not align the images with your demo, even with the same settings, eg. seed and sample_steps. Do you use some default negative prompts in your demo?

Training details about the t2v model.

Hi, I am currently using one A100 40 doing test on lumina-t2v model, may I ask the gpu type used for training the T2V model. And I also wonder the number of frames?

My implementation follows these steps:

I followed the paper, added another flatten and unflatten operations along the frame dimension.
In order to save time, I did the preprocessing separatedly before starting training, including llama and vae. But the vae is identical to the one used in t2i, so I worry it might not be able to capture enough temporal consistency.

In my testing, the video tensor stops at b=4,f=8,c=4,h=32,w=32 (after embedding) out of the memory issue. So it might be sort of impossible to even do the small-scale tests to verify your temporal-spatial merging method.

I am really interested in reading your training details, and the comparison between temporal-spatial dividing and merging strtegies. Your insights would be greatly helpful.

t2v timing

I've implemented Lumina-T2V model and training it on Panda dataset. The paper mentions initial training takes 8 GPUs. I assume they are 8xA100 80GBs (which I'm using). May I know how long does it take (in terms of GPU hours)?

About synthetic dataset of T2I

Hi, thank you for your great work
I'm curious whether, in the synthetic T2I dataset, both the captions and images are synthetic, or if only one of them is.

Hi , when T2A model? also T2Music? best regards

I can't make it work in Colab

I can't make it work in Colab
Hi, I'm trying to run Lumina-T2X in Google Colab, but I'm encountering an error when trying to import what I believe is a necessary function.

Here's the error message I'm getting:

ERROR: Could not find a version that satisfies the requirement lumina_next_sft (from versions: none)
ERROR: No matching distribution found for lumina_next_sft

ModuleNotFoundError Traceback (most recent call last)
in <cell line: 11>()
9 import matplotlib.pyplot as plt
10 import gradio as gr
---> 11 from lumina_next_sft import generate_text as lumina_generate_text

ModuleNotFoundError: No module named 'lumina_next_sft'

NOTE: If my import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

I'm assuming "lumina_next_sft" is part of the Lumina-T2X codebase, but I haven't been able to figure out the correct way to import the generate_text function.

Could you please provide some guidance on how to resolve this issue or point me to any resources that might help?

Thanks in advance!

any plan to release text to speech demo or model?

Hi, Lumina-T2X is a fantastic work, in your paper the text to speech task performs very good, have you guys planning to release the demo or model?

how to use the style editing and subject editing function?

how to use the style editing and subject editing function? i don't see the corresponding selection in the provided demo. Thanks

Training scripts hangs early at "Initializing pipeline" line with GPUs running at 100% when GPU number > 1

Basically self explanatory. I've installed everything and can infer just fine, I can even almost train on 1 GPU (I do OOM eventually unfortunately) but when I set nproc-per-node to 2 instead of 1 I just get stuck. I've tried both the default 5B script and 2B Next

error in Next-DiT

When I run Next-DiT according to the readme, I get the following problem when loading the DiT model. How can I solve it?
KeyError: 'NextDiT_2B_GQA_patch2'

ComfyUI support

Are you planning to add support for ComfyUI?

测试中文效果不大好

what is the funciton of learn_sigma

hi, I find that in the NextDiT model, the learn_sigma is set to True as default, which will double the out_channels and then return the half of them at the end of foward. How does this help to training. Is there any document for it?

Distributed package doesn't have NCCL built in

Hello,

On windows, I have to change line 88 of demo.py to use "gloo" instead of NCCL

I guess this only impacts training?

sincerely

when the t2v training & inference codes would be release?

Amazing work! And when the t2v training & inference codes would be release?

Batch Generation

Hello, currently Lumina-T2I only supports Web Demo and CLI. I would like to ask how to achieve batch generation, that is, to continuously generate multiple prompts. Looking forward to your reply, thank you.

great work

this is a great work for t2x. can you update the checkpoint url of the t2i, since the old url is invalid

论文里展示的更小的T2I模型，大概什么适合会释放出来啊？现有的2B的NEXT模型V100也还是训练不起来吧

Any plan for developing a diffusers version?

Thanks for your great work! Do you have any plan for developing a diffusers version of the infer pipeline?

Support Generation for Low Resolution Generation

Hi, thanks for the release! Is there any plan to support good quality low res image gen? For eg., 256x256 or 128x128? Since, directly changing the res in the config file doesn't work

Lumina-T2X/lumina_next_t2i/configs/infer/settings.yaml

Line 29 in 0d719ab

 resolution: "1024x1024" # option: ["1024x1024", "512x2048", "2048x512", "(Extrapolation) 1664x1664", "(Extrapolation) 1024x2048", "(Extrapolation) 2048x1024"] 

For eg., for prompt: A black Honda motorcycle parked in front of a garage. (from COCO) and sampling steps 60, the results with 256x256 are pretty bad

Although for 1024x1024, they look better

“unexpected keyword argument 'max_seq_len'” in model initialization in train.py

TypeError: DiT_Llama.init() got an unexpected keyword argument 'max_seq_len'

about alignment evaluation or prompt following study?

Hi, thanks for the amazing work.
I have read your technical report about T2I part, but no found about alignment evaluation or prompt following study.

Inquiry on training data and setup for T2I training

Firstly, I would like to express my gratitude and respect for the remarkable work you’ve done by open-sourcing the T2I model, which is a significant contribution to the community.

I have two questions:

I have gone through the associated paper but was unable to find specific details on the datasets used for training the T2I model. Could you please confirm if this information is available elsewhere or if I may have overlooked it in the paper? Any details you could share would be greatly appreciated.
While I have read about the training section you’ve shared for the T2I model, there seems to be a lack of information regarding the training data setup. I am particularly interested in the data structure and how to properly organize it for training. Additionally, it would be extremely helpful if you could provide an example of a toy dataset, similar to the one shown in Pixart-Sigma, and instructions to verify if the training CLI is functioning as intended.

I understand that providing this detailed information might be demanding, but I believe that such transparency would greatly benefit the wider adoption of the Lumina project within the open-source community.

Thank you for considering my request. I look forward to your response and any guidance you can provide.

which llama 7b version use?

did you use llama 7b in the training with the InternViT−6B?
and is there any plan to release a technical report?

Examples of recommended training parameters?

I saw train.py , but there seems to be no example of calling it. Can you provide a simple example?

Got Unpickling error when sampling lumina_next_t2i

Hi!

According to https://github.com/Alpha-VLLM/Lumina-T2X/tree/main/lumina_next_t2i#inference,

I downloaded Lumina-Next-T2I from huggingface
and try python -u sample.py --ckpt {my_ckpt_path} but got following error.

FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead.
  deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
/home/Lumina-T2X/lumina_next_t2i/models/components.py:9: UserWarning: Cannot import apex RMSNorm, switch to vanilla implementation
  warnings.warn("Cannot import apex RMSNorm, switch to vanilla implementation")
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "/home/Lumina-T2X/lumina_next_t2i/sample.py", line 326, in <module>
    main(args, 0, master_port)
  File "/home/Lumina-T2X/lumina_next_t2i/sample.py", line 95, in main
    train_args = torch.load(os.path.join(args.ckpt, "model_args.pth"))
  File "/home/.conda/envs/lumina_t2x/lib/python3.10/site-packages/torch/serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/.conda/envs/lumina_t2x/lib/python3.10/site-packages/torch/serialization.py", line 1033, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'v'.

Checked model_args.pth file exists in ckpt_path

Why doesn't inference time increase with pixel area?

When I generate 1024x1024 images on one A800, inference time is 15s/image
However, when I generate 1664x1664 images, inference time is 55s/image

What dataset was used to train the video generation model?

What dataset was used to train the video generation model? I noticed that you have mentioned that you've used a 14M dataset for the image model but I'm not sure what the video model was trained on.

What is the GPU mem requirement for finetuning?

text to music demo not working

gives an error every time

Any plans integrating to diffusers?

Hi, is there any plans to integrate lumina into diffusers? which would greatly improve usability for many users.

[Feedback] Share your good examples!

If you've generated some great images, we'd love to see your prompts, hyper-parameters and hear about your experience!

Let's discuss how to generate a perfect image!

What does Next-DiT stands for?

Hi, I'm curious about Next-DiT, it is not mentioned in your paper.

demo显示色块、网格比较明显

请问是网络结构的问题吗，还是训练时间太短呢

Questions about next line / next frame token

Hi. Thanks for sharing great work!

In case of video generation, how next line / next frame token attached to latent frames ??
- It seems like next line token attached to every end of height, and next frame token attached to the very end of one latent frame. Is this right?
- Since videos have different resolution and duration, how this thing managed in one batch of different videos ? Is there learnable PAD tokens attached to make same length inside a batch?
Do you have any plan to release training code (especially T2V) and dataset ??

Thank you!

how to try sd3 sample_sd3.py ?

I saw you made changes yesterday involving ODEs, and I'd love to try them out in a simple Diffusers Colab notebook. Is that possible? Does this change involve taking the transformer and VAE from SD3 and using them with Lumina? Thanks in advance!