Giter VIP home page Giter VIP logo

ml-4m's Introduction

4M: Massively Multimodal Masked Modeling

A framework for training any-to-any multimodal foundation models.
Scalable. Open-sourced. Across tens of modalities and tasks.

EPFL - Apple

Website | BibTeX | πŸ€— Demo

Official implementation and pre-trained models for :

4M: Massively Multimodal Masked Modeling, NeurIPS 2023 (Spotlight)
David Mizrahi*, Roman Bachmann*, Oğuzhan Fatih Kar, Teresa Yeo, Mingfei Gao, Afshin Dehghan, Amir Zamir

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities, arXiv 2024
Roman Bachmann*, Oğuzhan Fatih Kar*, David Mizrahi*, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir


4M main figure 4M main figure

4M is a framework for training "any-to-any" foundation models, using tokenization and masking to scale to many diverse modalities. Models trained using 4M can perform a wide range of vision tasks, transfer well to unseen tasks and modalities, and are flexible and steerable multimodal generative models. We are releasing code and models for "4M: Massively Multimodal Masked Modeling" (here denoted 4M-7), as well as "4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities" (here denoted 4M-21).

Table of contents

Usage

Installation

  1. Clone this repository and navigate to the root directory:
git clone https://github.com/apple/ml-4m
cd ml-4m
  1. Create a new conda environment, then install the package and its dependencies:
conda create -n fourm python=3.9 -y
conda activate fourm
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Verify that CUDA is available in PyTorch by running the following in a Python shell:
# Run in Python shell
import torch
print(torch.cuda.is_available())  # Should return True

If CUDA is not available, consider re-installing PyTorch following the official installation instructions. Likewise, if you want to install xFormers (optional, for faster tokenizers), follow their README to ensure that the CUDA version is correct.

Getting started

We provide a demo wrapper to quickly get started with using 4M models for RGB-to-all or {caption, bounding boxes}-to-all generation tasks. For example, to generate all modalities from a given RGB input, call:

from fourm.demo_4M_sampler import Demo4MSampler, img_from_url
sampler = Demo4MSampler(fm='EPFL-VILAB/4M-21_XL').cuda()
img = img_from_url('https://storage.googleapis.com/four_m_site/images/demo_rgb.png') # 1x3x224x224 ImageNet-standardized PyTorch Tensor
preds = sampler({'rgb@224': img.cuda()}, seed=None) 
sampler.plot_modalities(preds, save_path=None)

You should expect to see an output like the following:

4M demo sampler output 4M demo sampler output

For performing caption-to-all generation, you can replace the sampler input by: preds = sampler({'caption': 'A lake house with a boat in front [S_1]'}). For a list of available 4M models, please see the model zoo below, and see README_GENERATION.md for more instructions on generation.

Data

See README_DATA.md for instructions on how to prepare aligned multimodal datasets.

Tokenization

See README_TOKENIZATION.md for instructions on how to train modality-specific tokenizers.

4M Training

See README_TRAINING.md for instructions on how to train 4M models.

Generation

See README_GENERATION.md for instructions on how to use 4M models for inference / generation. We also provide a generation notebook that contains examples for 4M inference, specifically performing conditional image generation and common vision tasks (i.e. RGB-to-All).

Model Zoo

We provide 4M and tokenizer checkpoints as safetensors, and also offer easy loading via Hugging Face Hub.

4M models

Model # Mod. Datasets # Params Config Weights
4M-B 7 CC12M 198M Config Checkpoint / HF Hub
4M-B 7 COYO700M 198M Config Checkpoint / HF Hub
4M-B 21 CC12M+COYO700M+C4 198M Config Checkpoint / HF Hub
4M-L 7 CC12M 705M Config Checkpoint / HF Hub
4M-L 7 COYO700M 705M Config Checkpoint / HF Hub
4M-L 21 CC12M+COYO700M+C4 705M Config Checkpoint / HF Hub
4M-XL 7 CC12M 2.8B Config Checkpoint / HF Hub
4M-XL 7 COYO700M 2.8B Config Checkpoint / HF Hub
4M-XL 21 CC12M+COYO700M+C4 2.8B Config Checkpoint / HF Hub

To load models from Hugging Face Hub:

from fourm.models.fm import FM

fm7b_cc12m  = FM.from_pretrained('EPFL-VILAB/4M-7_B_CC12M')
fm7b_coyo   = FM.from_pretrained('EPFL-VILAB/4M-7_B_COYO700M')
fm21b       = FM.from_pretrained('EPFL-VILAB/4M-21_B')

fm7l_cc12m  = FM.from_pretrained('EPFL-VILAB/4M-7_L_CC12M')
fm7l_coyo   = FM.from_pretrained('EPFL-VILAB/4M-7_L_COYO700M')
fm21l       = FM.from_pretrained('EPFL-VILAB/4M-21_L')

fm7xl_cc12m = FM.from_pretrained('EPFL-VILAB/4M-7_XL_CC12M')
fm7xl_coyo  = FM.from_pretrained('EPFL-VILAB/4M-7_XL_COYO700M')
fm21xl      = FM.from_pretrained('EPFL-VILAB/4M-21_XL')

To load the checkpoints manually, first download the safetensors files from the above links and call:

from fourm.utils import load_safetensors
from fourm.models.fm import FM

ckpt, config = load_safetensors('/path/to/checkpoint.safetensors')
fm = FM(config=config)
fm.load_state_dict(ckpt)

4M text-to-image specialist models

These models were initialized with the standard 4M-7 CC12M models, but continued training with a modality mixture heavily biased towards text inputs. They are still able to perform all other tasks, but perform better at text-to-image generation compared to the non-finetuned models.

Model # Mod. Datasets # Params Config Weights
4M-T2I-B 7 CC12M 198M Config Checkpoint / HF Hub
4M-T2I-L 7 CC12M 705M Config Checkpoint / HF Hub
4M-T2I-XL 7 CC12M 2.8B Config Checkpoint / HF Hub

To load models from Hugging Face Hub:

from fourm.models.fm import FM

fm7b_t2i_cc12m  = FM.from_pretrained('EPFL-VILAB/4M-7-T2I_B_CC12M')
fm7l_t2i_cc12m  = FM.from_pretrained('EPFL-VILAB/4M-7-T2I_L_CC12M')
fm7xl_t2i_cc12m  = FM.from_pretrained('EPFL-VILAB/4M-7-T2I_XL_CC12M')

Loading manually from checkpoints is performed in the same way as above for the base 4M models.

4M super-resolution models

Model # Mod. Datasets # Params Config Weights
4M-SR-L 7 CC12M 198M Config Checkpoint / HF Hub

To load models from Hugging Face Hub:

from fourm.models.fm import FM

fm7l_sr_cc12m  = FM.from_pretrained('EPFL-VILAB/4M-7-SR_L_CC12M')

Loading manually from checkpoints is performed in the same way as above for the base 4M models.

Tokenizers

Modality Resolution Number of tokens Codebook size Diffusion decoder Weights
RGB 224-448 196-784 16k βœ“ Checkpoint / HF Hub
Depth 224-448 196-784 8k βœ“ Checkpoint / HF Hub
Normals 224-448 196-784 8k βœ“ Checkpoint / HF Hub
Edges (Canny, SAM) 224-512 196-1024 8k βœ“ Checkpoint / HF Hub
COCO semantic segmentation 224-448 196-784 4k βœ— Checkpoint / HF Hub
CLIP-B/16 224-448 196-784 8k βœ— Checkpoint / HF Hub
DINOv2-B/14 224-448 256-1024 8k βœ— Checkpoint / HF Hub
DINOv2-B/14 (global) 224 16 8k βœ— Checkpoint / HF Hub
ImageBind-H/14 224-448 256-1024 8k βœ— Checkpoint / HF Hub
ImageBind-H/14 (global) 224 16 8k βœ— Checkpoint / HF Hub
SAM instances - 64 1k βœ— Checkpoint / HF Hub
3D Human poses - 8 1k βœ— Checkpoint / HF Hub

To load models from Hugging Face Hub:

from fourm.vq.vqvae import VQVAE, DiVAE

# 4M-7 modalities
tok_rgb = DiVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_rgb_16k_224-448')
tok_depth = DiVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_depth_8k_224-448')
tok_normal = DiVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_normal_8k_224-448')
tok_semseg = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_semseg_4k_224-448')
tok_clip = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_CLIP-B16_8k_224-448')

# 4M-21 modalities
tok_edge = DiVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_edge_8k_224-512')
tok_dinov2 = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_DINOv2-B14_8k_224-448')
tok_dinov2_global = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_DINOv2-B14-global_8k_16_224')
tok_imagebind = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_ImageBind-H14_8k_224-448')
tok_imagebind_global = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_ImageBind-H14-global_8k_16_224')
sam_instance = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_sam-instance_1k_64')
human_poses = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_human-poses_1k_8')

To load the checkpoints manually, first download the safetensors files from the above links and call:

from fourm.utils import load_safetensors
from fourm.vq.vqvae import VQVAE, DiVAE

ckpt, config = load_safetensors('/path/to/checkpoint.safetensors')
tok = VQVAE(config=config) # Or DiVAE for models with a diffusion decoder
tok.load_state_dict(ckpt)

License

The code in this repository is released under the Apache 2.0 license as found in the LICENSE file.

The model weights in this repository are released under the Sample Code license as found in the LICENSE_WEIGHTS file.

Citation

If you find this repository helpful, please consider citing our work:

@inproceedings{4m,
    title={{4M}: Massively Multimodal Masked Modeling},
    author={David Mizrahi and Roman Bachmann and O{\u{g}}uzhan Fatih Kar and Teresa Yeo and Mingfei Gao and Afshin Dehghan and Amir Zamir},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
}

@article{4m21,
    title={{4M-21}: An Any-to-Any Vision Model for Tens of Tasks and Modalities},
    author={Roman Bachmann and O{\u{g}}uzhan Fatih Kar and David Mizrahi and Ali Garjani and Mingfei Gao and David Griffiths and Jiaming Hu and Afshin Dehghan and Amir Zamir},
    journal={arXiv 2024},
    year={2024},
}

ml-4m's People

Contributors

roman-bachmann avatar yahya010 avatar kdu4108 avatar dmizr avatar amir32002 avatar epoyraz avatar ofkar avatar

ml-4m's Issues

Transform from v2d format into video_description format and save in `video_description/` directory.

Goal: given v2d format of

 β”œβ”€β”€ 00000.tar
 |     β”œβ”€β”€ 00000.mp4
 |     β”œβ”€β”€ 00000.txt
 |     β”œβ”€β”€ 00000.json
 |     β”œβ”€β”€ 00001.mp4
 |     β”œβ”€β”€ 00001.txt
 |     β”œβ”€β”€ 00001.json
 |     └── ...
 |     β”œβ”€β”€ 10000.mp4
 |     β”œβ”€β”€ 10000.txt
 |     β”œβ”€β”€ 10000.json
 β”œβ”€β”€ 00001.tar
 |     β”œβ”€β”€ 10001.mp4
 |     β”œβ”€β”€ 10001.txt
 |     β”œβ”€β”€ 10001.json
 β”‚     ...
 ...

produce a video_description/ modality data folder of the following format:

root/video_description/shard-00000.tar
 |     β”œβ”€β”€ 00000.jsonl # this corresponds to one video. each line within it corresponds to one subsequence of frames.
 |     β”œβ”€β”€ 00001.jsonl
 |     └── ...

Each jsonl should look something like

[
            {
                "description": "here's a description",
                "start_frame_index": 0,
                "end_frame_index": 5,
            },
            {
                "description": "here's another description",
                "start_frame_index": 5,
                "end_frame_index": 12,
            } 
]

Note that the txt/jsons in the v2d might not correspond exactly to the representation we want here (e.g., we might need some logic to determine the start/end frame indices from timestamps).

Where are these descriptions coming from? Do we pseudolabel them out with another description model? @smontariol?
Child issue of #3.

Transform from v2d format into a metadata format and save in `metadata/` directory.

Goal: given v2d format of

 β”œβ”€β”€ 00000.tar
 |     β”œβ”€β”€ 00000.mp4
 |     β”œβ”€β”€ 00000.txt
 |     β”œβ”€β”€ 00000.json
 |     β”œβ”€β”€ 00001.mp4
 |     β”œβ”€β”€ 00001.txt
 |     β”œβ”€β”€ 00001.json
 |     └── ...
 |     β”œβ”€β”€ 10000.mp4
 |     β”œβ”€β”€ 10000.txt
 |     β”œβ”€β”€ 10000.json
 β”œβ”€β”€ 00001.tar
 |     β”œβ”€β”€ 10001.mp4
 |     β”œβ”€β”€ 10001.txt
 |     β”œβ”€β”€ 10001.json
 β”‚     ...
 ...

produce a metadata/ modality data folder of the following format:

root/metadata/shard-00000.tar
 |     β”œβ”€β”€ 00000.json # this corresponds to one video.
 |     β”œβ”€β”€ 00001.json
 |     └── ...

Each json should look something like

{
    "video": {
        "fps": 10,
        "resolution": (512, 512),
        "dataset": "howto100m"
    },
},

(exact format/required keys TBD, since there probably should be more than just video metadata here? like maybe caption quality or something would be nice? 1st person vs 3rd person? what else?)

Transform from `video_description` modality to `video_global_description` modality.

Given data from:

video_description modality

root/video_description/shard-00000.tar
 |     β”œβ”€β”€ 00000.jsonl # this corresponds to one video. each line within it corresponds to one subsequence of frames.
 |     β”œβ”€β”€ 00001.jsonl
 |     └── ...

and the video_metadata modality

root/metadata/shard-00000.tar
 |     β”œβ”€β”€ 00000.json # this corresponds to one video.
 |     β”œβ”€β”€ 00001.json
 |     └── ...

and maybe video_transcript modality

root/video_transcript/shard-00000.tar
 |     β”œβ”€β”€ 00000.jsonl # this corresponds to one video. each line within it corresponds to one subsequence of frames.
 |     β”œβ”€β”€ 00001.jsonl
 |     └── ...

generate a global description (one string) for each video. The output format should look like

root/video_transcript/shard-00000.tar
 |     β”œβ”€β”€ 00000.txt # this corresponds to the global description for one video.
 |     β”œβ”€β”€ 00001.txt
 |     └── ...

And xxxxx.txt just looks contains a string, e.g.

This is a global description of a video!

Implement the encoder embeddings to encode which frame it is (a temporal embedding, in addition to the patch position and modality embeddings).

According to https://docs.google.com/presentation/d/1AY3QV1N_hoi9aXI1r8QTqrNmDK9LyorgJDQMPWb8hBo/edit#slide=id.g2e696416940_0_144, we have to add the temporal/frame encoding to IMAGE-based modality embeddings (but not sequence based ones).

A good starting point: check out this

x_emb = repeat(self.pos_emb + self.mod_emb, '() n d -> b n d', b=B)
and kinda do the same but with an extra temporal embedding?

Things to consider: make sure the embedding for temporal frame doesn't interfere with the positional patch embedding somehow?

Definition of Done: all image based encoder embeddings are augmented with a temporal embedding.

@vesteinn @garjania

[PARENT ISSUE] Implement the temporal changes in 4M to account for video

Implement the model according to this design: https://docs.google.com/presentation/d/1AY3QV1N_hoi9aXI1r8QTqrNmDK9LyorgJDQMPWb8hBo/edit#slide=id.g2e696416940_0_144.

This includes (at least) several steps, each which will be detailed in its own github issue/PR:

  • Determine the correct format for storing each video modality and implement pseudolabelers/data downloaders/etc. to get the video data stored in parallel directories as usable by 4M and video2dataset. "Definition of done" here means we have the data in the right directories and we can load them in the correct format. (#3)
  • Implement modality_info and modality_transforms to map the new video modalities to their transformations which prepare them from the input filetype to be passable downstream to the model. (WIP PR: #1)
  • Implement the encoder embeddings to be encode frame position (in addition to the patch position and modality embeddings). (#4)
  • Implement a masking strategy which masks consistently across temporal frames (i.e., if you mask out patch position 7 for one frame, do it for all frames in that clip). (#5)
  • TODO? @garjania what other steps are required here? anything for decoder embeddings?

Create `filter_raw` function

Given a path to a raw dataset of v2d tar files, filter out all files within the tar that are "bad", i.e., don't have an mp4.
E.g., given the path raw/howto100m containing:

raw/howto100m
|    | - shard-00000.tar
|    | - | - 00000.mp4
|    | - | - 00000.m4a
|    | - | - 00000.json
|    | - | - 00002.mp4
|    | - | - 00002.m4a
|    | - | - 00002.json
|    | - | - 00010.json
|    | - | - 00011.json
|    | - | - 00012.json

we should end up with filtered_raw/howto100m containing:

filtered_raw/howto100m
|    | - shard-00000.tar
|    | - | - 00000.mp4
|    | - | - 00000.m4a
|    | - | - 00000.json
|    | - | - 00002.mp4
|    | - | - 00002.m4a
|    | - | - 00002.json

The video_rgb and other modalities should then all be drawn/pseudolabeled from filtered_raw/howto100m.

Transform from v2d format into video_transcript format and save in `video_transcript/` directory.

Goal: given v2d format of

 β”œβ”€β”€ 00000.tar
 |     β”œβ”€β”€ 00000.mp4
 |     β”œβ”€β”€ 00000.txt
 |     β”œβ”€β”€ 00000.json
 |     β”œβ”€β”€ 00001.mp4
 |     β”œβ”€β”€ 00001.txt
 |     β”œβ”€β”€ 00001.json
 |     └── ...
 |     β”œβ”€β”€ 10000.mp4
 |     β”œβ”€β”€ 10000.txt
 |     β”œβ”€β”€ 10000.json
 β”œβ”€β”€ 00001.tar
 |     β”œβ”€β”€ 10001.mp4
 |     β”œβ”€β”€ 10001.txt
 |     β”œβ”€β”€ 10001.json
 β”‚     ...
 ...

produce a video_transcript/ modality data folder of the following format:

root/video_transcript/shard-00000.tar
 |     β”œβ”€β”€ 00000.jsonl # this corresponds to one video. each line within it corresponds to one subsequence of frames.
 |     β”œβ”€β”€ 00001.jsonl
 |     └── ...

Each jsonl should look something like

[
            {
                "transcript": "here's a transcript",
                "start_frame_index": 0,
                "end_frame_index": 5,
            },
            {
                "transcript": "here's another transcript",
                "start_frame_index": 10,
                "end_frame_index": 13,
            } 
]

Note that the txt/jsons in the v2d might not correspond exactly to the representation we want here (e.g., we might need some logic to determine the start/end frame indices from timestamps).

We may also want to run whisper to get improved transcripts.

Child issue of #3.

Download data to todi in video2dataset format.

Download HowTo100M and Hd-vila datasets in video2dataset format using the scripts/methods described in https://github.com/swiss-ai/video2dataset/tree/main/swiss_ai.

This will be downloaded in this format:

 β”œβ”€β”€ 00000.tar
 |     β”œβ”€β”€ 00000.mp4
 |     β”œβ”€β”€ 00000.txt
 |     β”œβ”€β”€ 00000.json
 |     β”œβ”€β”€ 00001.mp4
 |     β”œβ”€β”€ 00001.txt
 |     β”œβ”€β”€ 00001.json
 |     └── ...
 |     β”œβ”€β”€ 10000.mp4
 |     β”œβ”€β”€ 10000.txt
 |     β”œβ”€β”€ 10000.json
 β”œβ”€β”€ 00001.tar
 |     β”œβ”€β”€ 10001.mp4
 |     β”œβ”€β”€ 10001.txt
 |     β”œβ”€β”€ 10001.json
 β”‚     ...
 ...

Child issue of #3.

Transform from video_rgb format into video_det format and save in `video_det/` directory.

Goal: given a video_rgb/ modality data folder of the following format:

root/video_rgb/shard-00000.tar
 |     β”œβ”€β”€ 00000.mp4 # this corresponds to one video.
 |     β”œβ”€β”€ 00001.mp4
 |     └── ...

produce a video_det/ modality folder of the following format:

root/video_det/shard-00000.tar
 |     β”œβ”€β”€ 00000.jsonl # this corresponds to one video. each line within it corresponds to one frame.
 |     β”œβ”€β”€ 00001.jsonl
 |     └── ...

For now, let's use YOLO pseudolabeler.
This should involve calling the YOLO pseudolabeler on the videos in video_rgb, generating outputs, and moving it into the right directory paths. Each jsonl corresponds to one video. Each line in the jsonl corresponds to one frame.

Example jsonl:

[
        # FRAME 0 Bounding boxes
        {
            "num_instances": 5,
            "image_height": 512,
            "image_width": 906,
            "instances": [
                {
                    "boxes": [
                        0.4229210317134857,
                        0.00020096010121051222,
                        0.5715101361274719,
                        0.13699540495872498
                    ],
                    "score": 0.9029952883720398,
                    "class_id": 74,
                    "class_name": "clock",
                    "segmentation": [
                        [
                            0.5055187637969095,
                            0.1337890625,
                            ...
                        ]
                    ]
                },
                {
                    "boxes": [
                        ...
                    ],
                    ...
                },
                    ...
            ]
        },
        # FRAME 1 Bounding boxes
        {
            "num_instances": 5,
            "image_height": 512,
            "image_width": 906,
            "instances": [
                ...,
            ],
            ...
        }
]

Child issue of #3.

Transform from video_rgb format into video_tok_rgb format and save in `video_tok_rgb/` directory.

TODO: Modify and run save_vq_tokens.py to tokenize RGB videos.

save_vq_tokens.py is the file which you run in 4M to pretokenize images, e.g., to go from images of the modality rgb to examples of the modality tok_rgb. It takes a pretrained tokenizer and input dataset directory (among other things) and applies the tokenizer to the images in the input dataset to create the tokens in a new output dataset directory.

We want to get tokens for the rgb videos, so going from and input directory of root/video_rgb to the tokenized examples in the output directory root/video_tok_rgb.

The steps to do this are to modify save_vq_tokens.py to have the following capabilities:

  1. It can load video files from the dataset folder in webdataset format (see the structure under video_rgb modality directory proposed in this post #3 (comment)).
  2. you can run the pretrained rgb tokenizer on each frame of each video.
  3. It saves the tokens as .npy files in webdataset format in the directory root/video_tok_rgb.

Proposed input directory format:

root/video_rgb/shard-00000.tar
 |     β”œβ”€β”€ 00000.mp4 # this corresponds to one video.
 |     β”œβ”€β”€ 00001.mp4
 |     └── ...

Proposed output directory format:

root/video_tok_rgb/shard-00000.tar
 |     β”œβ”€β”€ 00000.npy # this corresponds to one video. shape: something like (num_frames, H, C, W)
 |     β”œβ”€β”€ 00001.npy
 |     └── ...

Definition of Done:

  • we have a script which can, given an input directory (e.g. video_rgb), pretrained tokenizer (e.g., from https://huggingface.co/collections/EPFL-VILAB/4m-tokenizers-66019388bda47e9bcff3f887), and output directory (e.g., video_tok_rgb), generate the tokenized representations of those videos according to the structure above saved to the output directory.
  • This script is run and we actually have tokenized rgb videos in root/video_tok_rgb.

(This is a subtask of #3)

[PARENT ISSUE] Data preprocessing and pseudolabeling

We want to get video RGB, video RGB tokens, video bounding boxes, video transcriptions, and video descriptions downloaded in a format that matches what 4M expects. Maybe that's something like

root/video_rgb/shard-00000.tar
root/video_rgb/shard-00001.tar
root/video_rgb/shard-00002.tar

root/video_tok_rgb/shard-00000.tar
root/video_tok_rgb/shard-00001.tar
root/video_tok_rgb/shard-00002.tar

root/video_det/shard-00000.tar
root/video_det/shard-00001.tar
root/video_det/shard-00002.tar

root/video_transcript/shard-00000.tar
root/video_transcript/shard-00001.tar
root/video_transcript/shard-00002.tar

root/video_description/shard-00000.tar
root/video_description/shard-00001.tar
root/video_description/shard-00002.tar

except I'm not sure because maybe the text should just be like jsonlines or something? This is very much just a suggestion. First task is to decide what makes the most sense, second is to then implement it. Keep an eye also on #1 because that loads the data and that PR will need to be fixed to correspond to the decisions made here (e.g., rn it assumes text is saved as JSONL, which I picked kinda arbitrarily and is def up for change).

To get video_rgb, we just need to download using video2dataset, probably, with some file saving shenanigans to make it fit our naming/path formats/requirements.
To get video_tok_rgb, we need to run (for now) a pretrained tokenizer on the video_rgb files and save it appropriately with right filetype and names/paths/etc.
To get video_det, we need to run the YOLO pseudolabeler on the video_rgb files and save appropriately (maybe as JSONL?)
To get video_description, we need to run ???something??? on the video_rgb files and save appropriately (maybe as JSONL?)
To get video_description, we need to run whisper on the video_rgb files and get the transcripts appropriately (maybe as JSONL?). (we can also start with the default youtube captions as an easier thing so we don't bring whisper in the mix yet.)

Thank you @yahya for taking the lead on implementing these steps. @garjania if you could provide feedback/suggestions on the right format for saving these things/how this corresponds with video2dataset that'd be super helpful! I think one concrete unknown to pursue in making the decision is to first look at how video2dataset stores files and decide whether we bend more to follow video2dataset or if we use that as an intermediary to extract the captions, etc. and form them into this format for 4M. also @vesteinn if you're familiar with v2d?

Transform from v2d format into video_rgb format and save in `video_rgb/` directory

Goal: given v2d format of

 β”œβ”€β”€ 00000.tar
 |     β”œβ”€β”€ 00000.mp4
 |     β”œβ”€β”€ 00000.txt
 |     β”œβ”€β”€ 00000.json
 |     β”œβ”€β”€ 00001.mp4
 |     β”œβ”€β”€ 00001.txt
 |     β”œβ”€β”€ 00001.json
 |     └── ...
 |     β”œβ”€β”€ 10000.mp4
 |     β”œβ”€β”€ 10000.txt
 |     β”œβ”€β”€ 10000.json
 β”œβ”€β”€ 00001.tar
 |     β”œβ”€β”€ 10001.mp4
 |     β”œβ”€β”€ 10001.txt
 |     β”œβ”€β”€ 10001.json
 β”‚     ...
 ...

produce a video_rgb/ modality data folder of the following format:

root/video_rgb/shard-00000.tar
 |     β”œβ”€β”€ 00000.mp4 # this corresponds to one video.
 |     β”œβ”€β”€ 00001.mp4
 |     └── ...

Option 1: This should mostly just involve extracting the mp4/video files from the video2dataset format and moving it into the right directory paths.

Option 2: We can use v2d now to normalize the videos by making them same number of frames, etc.

We choose option #2 because by the time we get something in a modality folder, it should already be the last preprocessing step before pseudolabeling for aligned data.

Child issue of #3.

Add train/val/test split script for raw

Given a folder data/raw/... where ... = the downloaded v2d datasets, we need a script to partition (move) them into subdirectories called train, val, and test.

This is because the 4M folder directories expect the train/val/test to precede the modality, e.g.

train/video_rgb/class0/000.tar
train/video_det/class0/000.tar
train/video_transcript/class0/000.tar
```.

Then we can just run the merge_data.sh to go from raw to 4m-data 3 times (one for each split).

For now we can use a split of 70/10/20.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.