Implementation of DiffusionOverDiffusion architecture presented in NUWA-XL in a form of ControlNet-like module on top of ModelScope text2video model for extremely long video generation.
Add a low level clip/image captioner like BLIP2 if the description is not available
Use an external LLM like OpenAI's ChatGPT or plain GPT-3, or even a local model like LLaMa via APIs to generate descriptions for higher level videos by combining the subclips with a global prompt