xfeif / computervision_papernotes Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 0.0 3.25 MB

📚 Paper Notes (Computer vision)

notes paper computer-vision cv cvpr iccv eccv video-papernotes self-supervised-learning video-understanding

computervision_papernotes's People

Contributors

Stargazers

Watchers

computervision_papernotes's Issues

21ICLR | An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Paper & Code-pytorch

Welcome to share your favorite CV papers here!

Welcome to share your favorite computer vision papers here, and I'll read and write some notes if I think it is interesting, too :)
You can also provide your notes/comments/blogs with the paper together! 🍻

20CVPR # Video Playback Rate Perception for Self-supervised Spatio-Temporal Representation Learning

Paper
Code

Authors:
Yuan Yao, Chang Liu, Dezhao Luo, Yu Zhou, Qixiang Ye

20NIPS # Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning

pdf
Code

Authors:
Zhongzheng Ren∗, Raymond A. Yeh∗, Alexander G. Schwing.
(UIUC)

20 | Can Temporal Information Help with Contrastive Self-Supervised Learning?

Paper
No code available now~.

Authors:
Yutong Bai, Haoqi Fan, Ishan Misra, Ganesh Venkatesh, Yongyi Lu, Yuyin Zhou, Qihang Yu, Vikas Chandra, Alan Yuille

Overview of the proposed temporal-aware contrastive self-supervised learning framework (TaCo).
TaCo mainly comprises three modules: temporal augmentation module, contrastive learning module, and temporal pretext task module. For different temporal augmentations, they apply different projection heads and task heads. The features extracted from projection head of original video sequence and agumented sequence are considered as positive sample pairs, and the remaining ones are simply regarded as negative sample pairs. The contrastive loss is computed as the summation of losses over all pairs.

20 # Video Representation Learning with Visual Tempo Consistency

Paper.
Code (Not released until 2021-01-15)
Authors:
Ceyuan Yang, Yinghao Xu, Bo Dai, Bolei Zhou

17CVPR| (Oral) Focal Loss for Dense Object Detection

[Paper]
[Pytorch Code]

Main idea:

To solve the extreme foreground-background class imbalance problem in one-stage object detection frameworks, the authors down-weights the loss assigned to well-classified examples by proposing Focal Loss, which adds a factor to the standard cross entropy criterion.

How focal loss works ?

As shown in the figure above, the factor will lower the standard cross entropy loss. Suppose gamma is 2, an example classiﬁed with p_t = 0.9 would have 100× lower loss compared with CE and with p_t ≈ 0.968 it would have 1000× lower loss. This in turn increases the importance of correcting misclassiﬁed examples (whose loss is scaled down by at most 4× for p_t ≤ .5 and gamma = 2).

21 # Video Transformer Network

pdf
No code available now ~

Authors:
Daniel Neimark, Omri Bar, Maya Zohar, Dotan Asselmann

21ICCV # Emerging Properties in Self-Supervised Vision Transformers (DINO)

Paper
Code

Authors:
Mathilde Caron, Hugo Touvron, etc.
FBAI.

Highlights:

A new proposed self-supervised learning method with KD: a form of knowledge distillation with no labels. Especially, it uses a different way to avoid the collapse solution, that is use the momentum teacher encoder.
It encouraging "local-to-global" correspondences by feeding different sizes of views to student and teacher encoders.
SSL ViT features explicitly contain the scene layout and, in particular, object boundaries, as shown in the next figure.

Awesome Vision Transformer

17 ICCV| (Oral) Representation Learning by Learning to Count

Mehdi Noroozi, Hamed Pirsiavash, Paolo Favaro,
University of Bern, University of Maryland, Baltimore County
[paper] && [code]

Main idea:
The authors relate transformations of images to transformations of the representations. Specifically, they use counting as a pretext task, which they formalize as a constraint that release the "counted" visual primitives in tiles of an image to those counted in its downsampled version.

The downsampling or scaling transformation exploits the fact that the number of visual primitives should be invariant to scale.

The tiling transformation allows equating the total number of visual primitives in each tile to that in the whole image.

21CVPR#UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

Paper
Code-pytorch

Authors:
Zhigang Dai, Bolun Cai, Yugeng Lin, Junying Chen

The Chinese explanation from the author Zhigang Dai in Zhihu.

The framework of the proposed UP-DETR.

18ICLR| Unsupervised Representation Learning by Predicting Image Rotations

Paper && code

The paper proposes a new way of learning image representations from unlabeled data by predicting the image rotations. The problem formulation implicitly encourages the learned representation to be informative about the (foreground) object and its rotation. The idea is simple, but it turns out to be very effective. The authors demonstrate strong performance in multiple transfer learning scenarios, such as ImageNet classification, PASCAL classification, PASCAL segmentation, and CIFAR-10 classification.

19CVPR| Revisiting Self-Supervised Visual Representation Learning

paper && code

The authors revisit numerous previously proposed self-supervised models, conduct a thorough large scale study and, as a result, uncover multiple crucial in-sights. They challenge a number of common practices in self-supervised visual representation learning and observe that standard recipes for CNN design do not always translate to self-supervised representation learning.

19ICCV| U-CAM: Visual Explanation using Uncertainty based Class Activation Maps

paper
code coming soon

This paper obtains improved visual question answering by using gradient-based certainty attention regions.

The proposed method yields improved uncertainty estimates that are correspondingly more certain or uncertain, show consistent correlation with mis-classification and are focused quantitatively on better attention regions as compared to other states of the art methods.

The proposed architecture can be easily incorporated in various existing VQA methods.

It could be used as a general means for obtaining improved uncertainty and explanation regions for various vision and language tasks.

20ACMMM # Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

Paper
Code-PyTorch

Authors:
Li Tao, Xueting Wang∗, Toshihiko Yamasaki

21CVPR#Exploring Simple Siamese Representation Learning

Paper

20CVPR| A Multigrid Method for Efficiently Training Video Models

Paper & Code

Authors:
Chao-Yuan Wu1,2 Ross Girshick2 Kaiming He2 Christoph Feichtenhofer2 Philipp Krahenb 1
1The University of Texas at Austin 2Facebook AI Research (FAIR)

Problem to be tackled
High resolution models perform well, but train slowly. Low resolution models train faster, but are less accurate.
Trade-off the balance between compution allocated to processing more examples per mini-batch vs. the computation allocated to processing larger time and space dimensions.

Core observation: The underlying sampling grid that is used to train video models need not be constant during training.

Highlight

To avoid this trade-off, this paper proposed to use variable mini-batch shapes with different spatial-temporal resolutions that are varied according to a schedule. Training is accelerated by scaling up the mini-batch size and learning rate when shrinking the other dimensions. This means with this strategy, we can have faster training without losing accuracy.
different shapes: resampling the training data on multiple sampling grids.
sampling grids: it is specified by a temporal span, a spatial span, a temporal stride, and a spatial stride.

Methods

Baseline: a referebce video model (C3D, I3D) trained by a baseline mini-batch optimizer (SGD) that operates on mini-batches of shape BxTxHxW (mini-batch size x number of frames x height x width) for some number of epochs (e.g., 100).

This paper: consider temporal and spatial shapes t x w x h that are formed by resampling source videos with a new sampling grid that has its own spans and strides.

19CVPR| Self-Supervised Representation Learning by Rotation Feature Decoupling

paper && code

The proposed method learns a split representation that contains both rotation related and unrelated parts.

They train neural networks by jointly predicting image rotations and discriminating individual instances.

In particular, the model decouples the rotation discrimination from instance discrimination, which allows to improve the rotation prediction by mitigating the inﬂuence of rotation label noise, as well as discriminate instances without regard to image rotations.

21 # Transformer in Transformer

Paper.
Code-pytorch.

Authors:
Kai Han, An Xiao, etc.

19CVPR # Video Action Transformer Network

Paper
Project Page
Authors:
Rohit Girdhar, Joao Carreira, Carl Doersch, Andrew Zisserman

19 | A Multigrid Method for Efficiently Training Video Models

[paper]

21ICCV # DeepViT: Towards Deeper Vision Transformer

Paper
[Code] Not available now ~

Authors:
Daquan Zhou, Bingyi Kang, etc.

20CVPR| Self-training with Noisy Student improves ImageNet classification

[paper] && [code]
Authors:
Qizhe Xie∗ 1, Minh-Thang Luong1, Eduard Hovy2, Quoc V. Le1
1Google Research, Brain Team, 2Carnegie Mellon University

Highlight

This paper presents Noisy Student Training which extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. This method achieves state-of-the-art results on ImageNet images. And experiments on ImageNet-A, ImageNet-C, and ImageNet-P show the model's robustness.

Basically, Noisy Student Training has three main steps: (1) train a teacher model on labeled images (ImageNet), (2) use the teacher to generate pseudo labels on unlabeled image, and (3) train a student model on the combination of labeled images and pseudo labeled images. They iterate this algorithm a few times by treating the student as a teacher to relabel the unlabeled data and training a new student.

Main Contributions

ImageNet SOTA
- Used extra dataset.
  - But, order of magnitude less dataset (3.5B -> 300M) compared to previous SOTA
  - But, unlabeled dataset, instead of weakly labeled. (Use teacher model to generate pseudo labels.)

19ICLR| Deep Anomaly Detection with Outlier Exposure

paper && code

Briefly, the authors improve deep anomaly detection by training anomaly detectors against an auxiliary dataset of outliers, an approach they call Outlier Exposure (OE).

For example, they use CIFAR10 as in-distribution set, while use 80 Million Tiny Images (exclude examples in CIFAR10) as outlier exposure dataset.

This enables anomaly detectors to generalize and detect unseen anomalies. In extensive experiments on natural language processing and small- and large-scale vision tasks, they find that Outlier Exposure significantly improves detection performance.

20 ｜Longformer: The Long-Document Transformer

Paper
Code-pytorch

Authors:
Iz Beltagy, Matthew E. Peters, Arman Cohan

Youtube Explanation by Yannic Kilcher.

The Longformer extends the Transformer by introducing sliding window attention and sparse global attention. This allows for the processing of much longer documents than classic models like BERT.

17ICML| Unsupervised Learning by Predicting Noise

[paper] && [code]
Main idea:
This paper introduces a generic framework to train deep networks, end-to-end, with no supervision. The authors propose to ﬁx a set of target representations in low dimensional space, called Noise As Targets (NAT), and to constrain the deep features to align to them.

20ECCV # Self-supervised Video Representation Learning by Pace Prediction

Short Presentation
Full Presentation
arXiv
Code Code without contrastive learning part.

Authors:
Jiangliu Wang, Jianbo Jiao and Yunhui Liu

Main idea:

Framework:

19CVPR| Learning Correspondence from the Cycle-consistency of Time

paper
code

Motivation: to learn representations that support reasoning at various levels of visual correspondence from scratch and without human supervision.

Main idea: use cycle-consistency in time as free supervisory signal for learning visual representations from scratch.

At training time, the proposed model learns a feature map representation to be useful for performing cycle-consistent tracking. At test time, they use the acquired representation to find nearest neighbors across space and time.

20CVPR # Semi-Supervised Semantic Segmentation with Cross-Consistency Training

pdf
Code-PyTorch

Authors:
Ouali, Yassine and Hudelot, Celine and Tami, Myriam.

20 | Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning

Paper
No code available now~
Authors:
Zehua Zhang, David Crandall (Indiana University Bloomington)

18NIPS| Deep Anomaly Detection Using Geometric Transformations

Paper && Code

Main idea: Train a multi-class model to discriminate between dozens of geometric transformations applied on all the given images.

18CVPR| Unsupervised Visual Learning Tutorial

[part1] [part2]

21AAAI # Informer: Beyond Efﬁcient Transformer for Long Sequence Time-Series Forecasting

PDF & Code-Pytorch

AAAI2021 Best Paper Award!

17CVPR| Image-to-Image Translation with Conditional Adversarial Networks

[paper] && [code]
Authors:
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros
Berkeley AI Research (BAIR) Laboratory, UC Berkeley

Highlight

This paper considers the image-to image translation problem as the input and output differs in surface appearance but both are renderings of the same underlying structure. The results suggest that the conditional adversarial networks are a promising approach for many image-to-image translation tasks, especially those involving highly structured graphical outputs.

They use a "U-Net"-based architecture as the generator and for the discriminator they use a convolutional "PatchGAN" classifier, which only penalizes structure at the scale of patches. The discriminator tries to classify if each NxN patch in an image is real or fake.

The experiments cast on a variety of tasks and datasets, including:

Semantic labels <-> photo, trained on the Cityscapes dataset.
Architectural labels -> photo, trained on the CMP Facades.
Map <-> aerial photo, trained on data scraped from Google Maps.
BW -> color photos.
Edge -> photo.
Sketch -> photo.
Day -> night.
Thermal -> color photos.
Photo with missing pixels -> inpainted photo.

18CVPR| High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs

[paper] && [code]

19NIPS|Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

http://arxiv.org/abs/1906.12340

This paper shows findings about combine self-supervised learning with supervised learning can improve model's performance on robustness and uncertainty. It serves like a summary with experiments.

In this paper, they found self-supervision can:

improve model's robustness to adversarial examples, label corruption and common input corruptions.
benefits out-of-distribution (OOD) detection on difficult, near-distribution outliers

This paper has essentially no theory, but proposes lots of interesting directions for future work.

21AAAI | Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

Paper
Code-PyTorch

Self-Supervised Representation Learning

Lilian Weng's blog post

Self-supervised learning opens up a huge opportunity for better utilizing unlabelled data, while learning in a supervised learning manner. This post covers many interesting ideas of self-supervised learning tasks on images, videos, and control problems.

20 | Dense Contrastive Learning for Self-Supervised Visual Pre-Training

pdf
Code

19CVPR| AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data

paper && code

Main idea:

As long as the unsupervised features successfully encode the essential information about the visual structures of original and transformed images, the transformation can be well predicted.

Highlight

The authors present a novel paradigm of unsupervised representation learning by Auto-Encoding Transformation(AET) in contract to the conventional Auto-Encoding Data(AED).

This AET paradigm allows us to instantiate a large varity of transformations, from parameterized, to non-parameterized and GAN-induced ones.

AET sets new SoTA performances being greatly closer to the upper bounds by their fully supervised counterparts on CIFAR-10, ImageNet and Places dataset.

AED is based on the idea of reconstrcting input data at the output end. It means a good feature representation should contain sufficient information to reconstruct the input data.

AET focuses on exploring dynamics of feature representations under different transformations, thereby revealing not only static visual structures but also how they would change by applying different transformations.

AET is kind of summary and sublimation of previous AED methods.

xfeif / computervision_papernotes Goto Github PK

computervision_papernotes's People

Contributors

Stargazers

Watchers

computervision_papernotes's Issues

Main idea:

How focal loss works ?

Highlight

Methods

Highlight

Main Contributions

Main idea:

Framework:

Highlight

Main idea:

Highlight

Recommend Projects

Recommend Topics

Recommend Org