rizavelioglu / hateful_memes-hate_detectron Goto Github PK

Detecting Hate Speech in Memes Using Multimodal Deep Learning Approaches: Prize-winning solution to Hateful Memes Challenge. https://arxiv.org/abs/2012.12975

Home Page: https://rizavelioglu.github.io/publication/2021-04-msc-thesis

License: MIT License

Python 0.09% Shell 0.02% Jupyter Notebook 99.90%

hateful-memes multimodal-deep-learning challenge hateful-memes-challenge vision-and-language

hateful_memes-hate_detectron's Introduction

Hateful Memes Challenge-Team HateDetectron Submissions

Check out the paper on and check out my which offers an in-depth analysis of the approach as well as an overview of Multimodal Research and its foundations.

This repository contains all the code used at the Hateful Memes Challenge by Facebook AI. There are 2 main Jupyter notebooks where all the job is done and documented:

The 'reproducing results' notebook -->
The 'end-to-end' notebook -->

The first notebook is only for reproducing the results of Phase-2 submissions by the team HateDetectron. In other words, just loading the final models and getting predictions for the test set. See the end-to-end notebook to have a look at the whole approach in detail: how the models are trained, how the image features are extracted, which datasets are used, etc.

About the Competition

The Hateful Memes Challenge and Data Set is a competition and open source data set designed to measure progress in multimodal vision-and-language classification.

Check out the following sources to get more on the challenge:

Competition Results:

We are placed the 3rd out of 3.173 participants in total!

See the official Leaderboard here!

Repository structure

The repository consists of the following folders:

hyperparameter_sweep/ : where scripts for hyperparameter search are.

get_27_models.py: iterates through the folders those that were created for hyperparameter search and collects the metrics (ROC-AUC, accuracy) on the 'dev_unseen' set and stores them in a pd.DataFrame. Then, it sorts the models according to AUROC metric and moves the best 27 models into a generated folder majority_voting_models/
remove_unused_file.py: removes unused files, e.g. old checkpoints, to free the disk.
sweep.py: defines the hyperparameters and starts the process by calling /sweep.sh
sweep.sh: is the mmf cli command to do training on a defined dataset, parameters, etc.

notebooks/ : where Jupyter notebooks are stored.

[GitHub]end2end_process.ipynb: presents the whole approach end-to-end: expanding data, image feature extraction, hyperparameter search, fine-tuning, majority voting.
[GitHub]reproduce_submissions.ipynb: loads our fine-tuned (final) models and generates predictions.
[GitHub]label_memotion.ipynb: a notebook which uses /utils/label_memotion.py to label memes from Memotion and to save it in an appropriate form.
[GitHub]simple_model.ipynb: includes a simple multimodal model implementation, also known as 'mid-level concat fusion'. We train the model and generate submission for the challenge test set.
[GitHub]benchmarks.ipynb: reproduces the benchmark results.

utils/ : where some helper scripts are stored, such as labeling Memotion Dataset and merging the two datasets.

concat_memotion-hm.py: concatenates the labeled memotion samples and the hateful memes samples and saves them in a new train.jsonl file.
generate_submission.sh: generates predictions for 'test_unseen' set (phase 2 test set).
label_memotion.jsonl: presents the memes labeled by us from memotion dataset.
label_memotion.py: is the script for labelling Memotion Dataset. The script iterates over the samples in Memotion and labeler labels the samples by entering 1 or 0 on the keyboard. The labels and the sample metadata is saved at the end as a label_memotion.jsonl.

Citation:

@article{velioglu2020hateful,
  author = {Velioglu, Riza and Rose, Jewgeni},
  title = {Detecting Hate Speech in Memes Using Multimodal Deep Learning Approaches: Prize-winning solution to Hateful Memes Challenge},
  doi = {https://doi.org/jhb3}, 
  publisher = {arXiv},
  year = {2020}, 
}

Please also consider citing my thesis:

@mastersthesis{velioglu2021detecting,
  title   = "Detecting Hate Speech In Multimodal Memes Using Vision-Language Models",
  author  = "Velioglu, Riza",
  school  = "Bielefeld University",
  year    = "2021",
  url     = "http://rizavelioglu.github.io/files/RizaVelioglu-MScThesis.pdf"
}

Contact:

hateful_memes-hate_detectron's People

Contributors

Stargazers

Watchers

Forkers

drivendataorg tubbz-alt saagie-romain suhanip babujyan adeyinka-hub lmwijesundara xinyingdu janleyva anthony1013 meleayi gatechlanfen shrimpshell518 zeebah-zhl anshumaligaur 45w1n divya4064

hateful_memes-hate_detectron's Issues

FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/torch/mmf/data/datasets/hateful_memes/defaults/annotations/dev.jsonl'

I am getting an error when trying to finetune VisualBERT pretrained model on Hateful Meme Dataset. Not sure why it is looking for dev.jsonl when there is no such jsonl which came with the hateful_memes dataset package. Can somebody point me to the config changes to be done in MMF to make this work.

Code:
"""
Uncomment it if needed
"""
#os.environ['OC_DISABLE_DOT_ACCESS_WARNING']="1"

os.chdir(home)

Define where image features are

feats_dir = os.path.join(home, "features")

Define where train.jsonl is

train_dir = os.path.join(home, "train_v10.jsonl")

!mmf_run config="projects/visual_bert/configs/hateful_memes/from_coco.yaml"
model="visual_bert"
dataset=hateful_memes
run_type=train_val
checkpoint.resume_zoo=visual_bert.pretrained.cc.full
training.tensorboard=True
training.checkpoint_interval=50
training.evaluation_interval=50
training.max_updates=3000
training.log_interval=100
dataset_config.hateful_memes.max_features=100
dataset_config.hateful_memes.annotations.train[0]=$train_dir
dataset_config.hateful_memes.annotations.val[0]=hateful_memes/defaults/annotations/dev_unseen.jsonl
dataset_config.hateful_memes.annotations.test[0]=hateful_memes/defaults/annotations/test_unseen.jsonl
dataset_config.hateful_memes.features.train[0]=$feats_dir
dataset_config.hateful_memes.features.val[0]=$feats_dir
dataset_config.hateful_memes.features.test[0]=$feats_dir
training.lr_ratio=0.3
training.use_warmup=True
training.batch_size=32
optimizer.params.lr=5.0e-05
env.save_dir=./sub1
env.tensorboard_logdir=logs/fit/sub1 \

Error Log:

Namespace(config_override=None, local_rank=None, opts=['config=projects/visual_bert/configs/hateful_memes/from_coco.yaml', 'model=visual_bert', 'dataset=hateful_memes', 'run_type=train_val', 'checkpoint.resume_zoo=visual_bert.pretrained.cc.full', 'training.tensorboard=True', 'training.checkpoint_interval=50', 'training.evaluation_interval=50', 'training.max_updates=3000', 'training.log_interval=100', 'dataset_config.hateful_memes.max_features=100', 'dataset_config.hateful_memes.annotations.train[0]=/content/train_v10.jsonl', 'dataset_config.hateful_memes.annotations.val[0]=hateful_memes/defaults/annotations/dev_unseen.jsonl', 'dataset_config.hateful_memes.annotations.test[0]=hateful_memes/defaults/annotations/test_unseen.jsonl', 'dataset_config.hateful_memes.features.train[0]=/content/features', 'dataset_config.hateful_memes.features.val[0]=/content/features', 'dataset_config.hateful_memes.features.test[0]=/content/features', 'training.lr_ratio=0.3', 'training.use_warmup=True', 'training.batch_size=32', 'optimizer.params.lr=5.0e-05', 'env.save_dir=./sub1', 'env.tensorboard_logdir=logs/fit/sub1'])
/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py:252: UserWarning: Keys with dot (model.bert) are deprecated and will have different semantic meaning the next major version of OmegaConf (2.1)
See the compact keys issue for more details: omry/omegaconf#152
You can disable this warning by setting the environment variable OC_DISABLE_DOT_ACCESS_WARNING=1
warnings.warn(message=msg, category=UserWarning)
Overriding option config to projects/visual_bert/configs/hateful_memes/from_coco.yaml
Overriding option model to visual_bert
Overriding option datasets to hateful_memes
Overriding option run_type to train_val
Overriding option checkpoint.resume_zoo to visual_bert.pretrained.cc.full
Overriding option training.tensorboard to True
Overriding option training.checkpoint_interval to 50
Overriding option training.evaluation_interval to 50
Overriding option training.max_updates to 3000
Overriding option training.log_interval to 100
Overriding option dataset_config.hateful_memes.max_features to 100
Overriding option training.lr_ratio to 0.3
Overriding option training.use_warmup to True
Overriding option training.batch_size to 32
Overriding option optimizer.params.lr to 5.0e-05
Overriding option env.save_dir to ./sub1
Overriding option env.tensorboard_logdir to logs/fit/sub1
Using seed 22503996
Logging to: ./sub1/logs/train_2022-04-20T11:00:22.log
Downloading features.tar.gz: 100% 8.44G/8.44G [05:03<00:00, 27.8MB/s]
Downloading extras.tar.gz: 100% 211k/211k [00:00<00:00, 484kB/s]
Traceback (most recent call last):
File "/usr/local/bin/mmf_run", line 8, in
sys.exit(run())
File "/usr/local/lib/python3.7/dist-packages/mmf_cli/run.py", line 111, in run
main(configuration, predict=predict)
File "/usr/local/lib/python3.7/dist-packages/mmf_cli/run.py", line 40, in main
trainer.load()
File "/usr/local/lib/python3.7/dist-packages/mmf/trainers/base_trainer.py", line 59, in load
self.load_datasets()
File "/usr/local/lib/python3.7/dist-packages/mmf/trainers/base_trainer.py", line 83, in load_datasets
self.dataset_loader.load_datasets()
File "/usr/local/lib/python3.7/dist-packages/mmf/common/dataset_loader.py", line 18, in load_datasets
self.val_dataset.load(self.config)
File "/usr/local/lib/python3.7/dist-packages/mmf/datasets/multi_dataset_loader.py", line 114, in load
self.build_datasets(config)
File "/usr/local/lib/python3.7/dist-packages/mmf/datasets/multi_dataset_loader.py", line 131, in build_datasets
dataset_instance = build_dataset(dataset, dataset_config, self.dataset_type)
File "/usr/local/lib/python3.7/dist-packages/mmf/utils/build.py", line 106, in build_dataset
dataset = builder_instance.load_dataset(config, dataset_type)
File "/usr/local/lib/python3.7/dist-packages/mmf/datasets/base_dataset_builder.py", line 96, in load_dataset
dataset = self.load(config, dataset_type, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmf/datasets/builders/hateful_memes/builder.py", line 39, in load
self.dataset = super().load(config, dataset_type, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmf/datasets/mmf_dataset_builder.py", line 141, in load
dataset = dataset_class(config, dataset_type, imdb_idx)
File "/usr/local/lib/python3.7/dist-packages/mmf/datasets/builders/hateful_memes/dataset.py", line 19, in init
super().init(dataset_name, config, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmf/datasets/mmf_dataset.py", line 25, in init
self.annotation_db = self._build_annotation_db()
File "/usr/local/lib/python3.7/dist-packages/mmf/datasets/mmf_dataset.py", line 39, in _build_annotation_db
return AnnotationDatabase(self.config, annotation_path)
File "/usr/local/lib/python3.7/dist-packages/mmf/datasets/databases/annotation_database.py", line 24, in init
self._load_annotation_db(path)
File "/usr/local/lib/python3.7/dist-packages/mmf/datasets/databases/annotation_database.py", line 32, in _load_annotation_db
self._load_jsonl(path)
File "/usr/local/lib/python3.7/dist-packages/mmf/datasets/databases/annotation_database.py", line 39, in _load_jsonl
with PathManager.open(path, "r") as f:
File "/usr/local/lib/python3.7/dist-packages/mmf/utils/file_io.py", line 45, in open
newline=newline,
FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/torch/mmf/data/datasets/hateful_memes/defaults/annotations/dev.jsonl'

mmf can't find problem?

!mmf_run config="projects/visual_bert/configs/hateful_memes/from_coco.yaml"
model="visual_bert"
dataset=hateful_memes
run_type=train_val
checkpoint.max_to_keep=1
checkpoint.resume_zoo=visual_bert.pretrained.cc.small_fifty_pc
training.tensorboard=True
training.checkpoint_interval=50
training.evaluation_interval=50
training.max_updates=3000
training.log_interval=100
dataset_config.hateful_memes.max_features=120
dataset_config.hateful_memes.annotations.train[0]=$train_dir
dataset_config.hateful_memes.annotations.val[0]=$dev_unseen
dataset_config.hateful_memes.annotations.test[0]=$test_unseen
dataset_config.hateful_memes.features.train[0]=$feats_dir
dataset_config.hateful_memes.features.val[0]=$feats_dir
dataset_config.hateful_memes.features.test[0]=$feats_dir
training.lr_ratio=0.3
training.use_warmup=True
training.batch_size=32
optimizer.params.lr=5.0e-05
env.save_dir=./sub1
env.tensorboard_logdir=logs/fit/sub1 \

Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/importlib_metadata/init.py", line 564, in from_name return next(cls.discover(name=name)) StopIteration During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/bin/mmf_run", line 33, in sys.exit(load_entry_point('mmf', 'console_scripts', 'mmf_run')()) File "/usr/bin/mmf_run", line 22, in importlib_load_entry_point for entry_point in distribution(dist_name).entry_points File "/usr/local/lib/python3.7/dist-packages/importlib_metadata/init.py", line 988, in distribution return Distribution.from_name(distribution_name) File "/usr/local/lib/python3.7/dist-packages/importlib_metadata/init.py", line 566, in from_name raise PackageNotFoundError(name) importlib_metadata.PackageNotFoundError: No package metadata was found for mmf

who can help me solve this problem?

Thanks.