CVPR 2021

Object Detection

UP-DETR: Unsupervised Pre-Training for Object Detection

也许能够对缺少数据的情况有所启发. stupid

Towards Open World Object Detection

the task of open world object detection:

identify objects that have not been introduced to it as `unknown', without explicit supervision to do so
incrementally learn these identified unknown categories without forgetting previously learned classes, when the corresponding labels are progressively received

Unsupervised Object Detection with LIDAR Clues

mark 没啥用

Multiple Instance Active Learning for Object Detection

DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution

backbone design for object detection.

Macro level: Recursive Feature Pyramid. (incorporates extra feedback connections from FPN into the bottom-up backbone layers)

Micro level: Switchable Atrous Convolution: convolves different atrous rates features, gether the results using switch functions.

Uncertainty-Aware Joint Salient Object Detection and Camouflaged Object Detection

mark 未知领域

Joint-DetNAS: Upgrade Your Detector With NAS, Pruning and Dynamic Distillation

包含三个主要的组成部分： NAS, 剪枝和蒸馏

算法主要包含两个重要步骤：student morphism optimizes the student's architecture and removes the redundant params

dynamic distillation aims to find the optimal matching teacher.

Beyond Max-Margin: Class Margin Equilibrium for Few-Shot Object Detection

mark

I3Net: Implicit Instance-Invariant Network for Adapting One-Stage Object Detectors

mark

Open-Vocabulary Object Detection Using Captions

open vocabulary object detection training approach for new class.

Sparse R-CNN: End-to-End Object Detection With Learnable Proposals

Existing works on object detection heavily rely on dense object candidates.

fixed sparse set of learned object proposals
Sparse R-CNN completely avoids all efforts related to object candidates design and many-to-one label assignment
No NMS

You Only Look One-Level Feature

看过。重要 YOLOF

Scaled-YOLOv4: Scaling Cross Stage Partial Network

mark

End-to-End Object Detection With Fully Convolutional Network

Distillation

Distilling Object Detectors via Decoupled Features

main idea: information of features derived from regions should be assigned with different importance during distillation.

**Proposed: ** decoupled features (DeFeat) for learning a better student detector.

General Instance Distillation for Object Detection

Backbone & Dataset

Transformer Interpretability Beyond Attention Visualization

For further investigation

Re-Labeling ImageNet: From Single to Multi-Labels, From Global to Localized Labels

github page: https://github.com/naver-ai/relabel_imagenet

Involution: Inverting the Inherence of Convolution for Visual Recognition

backbone: RedNet

Gaussian Context Transformer

如题 backbone

Capsule Network Is Not More Robust Than Convolutional Network

mark

How Does Topology Influence Gradient Propagation and Model Performance of Deep Networks with DenseNet-Type Skip Connections?

mark

RepVGG: Making VGG-Style ConvNets Greate Again

a stack of 3x3 convolution and ReLU.

Training Stage: multi-branch topology, use structural re-parameterization technique to decoupling train and inference.

Sewer-ML: A Multi-Label Sewer Defect Classification Dataset and Benchmark

mark

Bottleneck Transformer for Visual Recognition

mark

OTA: Optimal Transport Assignment for Object Detection

label assignment

ATSS proposes to set the divisions boundary for each gt based on statistical characteristics

近来的研究说明，预测出的锚定框预测分数，可以作为设计动态分配策略的一个有效的参照。

手工分配anchors的工作：Min Area, Max IoU

Point: assign ambiguous anchors to any gt or background may introduce harmful gradient wrt other gts

DETR: first work to consider label assignment from global view 匈牙利算法

为了达到global optimal assignment under one2many situdation, they proposed to formulated label assignment* as an **Optimal Transport (OT) **

Define: Sinkhorn-Knopp Iteration ==> Optimal Transport Assignment

GAIA: A Transfer Learning System of Object Detection That Fits You Needs

GAIA is capbale of providing powerful pre-trained weights, selecting models that conform to downstream demands.

RankDetNet: Delving Into Ranking Constraints for Object Detection

3 ranking contraints: global ranking, class-specific ranking and IoU-guided ranking losses

The global ranking loss encourages foreground samples to rank higher than background.
The class-specific ranking loss ensures that positive samples rank higher than negative ones for each specific class.
The IoU-guided ranking loss aims to align each pair of confidence scores with the associated pair of IoU overlap between two positive samples of a specific class.

优点: 实现简单，推理阶段不增加计算量。

improving RetinaNet baseline by 2.5% AP on the COCO test-dev set

AQD: Towards Accurate Quantized Object Detection

In this paper, we propose an Accurate Quantized object Detection solution, termed AQD, to fully get rid of floating-point computation.

PANDA: Adapting Pretrained Features for Anomaly Detection and Segmentation

mark

IQDet: Instance-Wise Quality Distribution Sampling for Object Detection

Dense Object Detector

Equalization Loss v2: A New Gradient Balance Approach for Long-Tailed Object Detection

Data-Uncertainty Guided Multi-Phase Learning for Semi-Supervised Object Detection

Adaptive Class Suppression Loss for Long-Tail Object Detection

Humble Teachers Teach Better Students for Semi-Supervised Object Detection

Unbiased Mean Teacher for Cross-Domain Object Detection

object detection model is often vulnerable to data variance.

proposed new Unbiased Mean Teacher (UMT) model for cross-domain object detection.

Points As Queries: Weakly Semi-Supervised Object Detection by Points

the dataset comprises small fully annotated images and large weakly annotated images by points.

proposed Point DETR: extends DETR by adding a pointer encoder.

In particular, when using 20% fully labeled data from COCO, our detector achieves a promising performance, 33.3 AP

Informative and Consistent Correspondence Mining for Cross-Domain Weakly Supervised Object Detection

Cross domain weakly supervised object detection

just mark, no use

Beyond Bounding-Box: Convex-Hull Feature Adaptation for Oriented and Densely Packet Object Detection

检测对象： Oriented and densely packed objects.

proposed： Convex-hull feature adaptation (CFA) for configuring convolutional features in accordance with oriented and densely packed object layouts.

重要的概念：CFA is rooted in convex-hull feature representation

CFR: it defines a set of dynamically predicted feature points guided by convex IoU (CIoU)

Group Whitening: Balancing Learning Efficiency and Representation Capacity

Group whitening: exploits the advantages of whitening operation(白化操作) and avoids the disadvantages of normalization within mini-batches.

OPANAS: One-Shot Path Aggregation Network Architecture Search for Object Detection

NAS已经应用于搜索FPN的网络结构，OPANAS全面提高了搜索效率和detection的精度。

Search Space：引入了六个heterogeneous information paths 来建立搜索空间，然后提出了一个新颖的FPN搜索空间（each FPN candidate is representated by a densely-connected directed acyclic graph)

search approach: One-shot search

Dynamic Head: Unifying Object Detection Heads With Attentions

以前看过。

Scale-Aware Automatic Augmentation for Object Detection

数据增广。

define a new scale-aware search space: image-level and box-level augmentations are designed for maintaining scale invariance.

proposed a new search metric: Pareto Scale Balance.

Class-aware Robust Adversarial Training for Object Detection

RPN Prototype Alignment for Domain Adaptive Object Detector

Cross domain

A2-FPN: Attention Aggregation Based Feature Pyramid Network for Instance Segmentation

Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection

Localization Quality Estimation (LQE): is crucial and popular in the recent advancement of dense object detectors since it can provide accurate ranking scores that benefits the NMS processing and improve detection performance.

The Translucent Patch: A Physical and Universal Attack on Object Detectors

Robust and Accurate Object Detection via Adversarial Learning

Data Augmentation

ICCV 2021

Amusi's repo: https://github.com/amusi/ICCV2021-Papers-with-Code

Backbone

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions

看过，创新点有2：

PVT not only can be trained on dense partitions of an image to achieve high output resolution, which is important for dense prediction, but also uses a progressive shrinking pyramid to reduce computations of large feature maps.
PVT has the advantages of both CNN and Transformer.

LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference

看过，还有v2

GLiT: Neural Architecture Search for Global and Local Image Transformer

Introduced a locality module, search space is defined to let the search algo freely trade off between global and local information as well as optimizing the low-level design choice in each module.
A hierarchical NAS method is proposed to search optimal vision transformer from two levels separately with evolutionary algo to tackle the problem caused by huge search space.

Understanding Robustness of Transformers for Image Classification

看过

Co-Scale Conv-Attentional Image Transformers

co-scale mechanism maintains the integrity of Transformer' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other.
devised a conv-attentional mechanism by realizing position embedding formulation in the factorized attention module with an efficient convolution-like implementation.

Aggregation With Feature Detection

Visformer: The Vision-Friendly Transformer

Going Deeper with Image Transformers

CaiT

Multiscale Vision Transformers

For AR, not for common cv tasks

Unifying Nonlocal Blocks for Neural Networks

non-local blocks are designed for capturing global spatial information in cv tasks.

proposed an efficient and robust spectral nonlocal block, which can be more robust and flexible to catch long-range dependencies where inserted into dnn than existing models.

Tokens-to-Token ViT: Training Vision Transformers From Scratch on ImageNet

看过，好像也是pretrain free了

Incorporating Convolution Designs Into Visual Transformers

LeFF

FcaNet: Frequency Channel Attention Networks

CvT: Introducing Convolutions to Vision Transformers

Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows

Point

Conformer: Local Features Coupling Global Representations for Visual Recognition

AutoFormer: Searching Transformers for Visual Recognition

Scalable Vision Transformers With Hierarchical Pooling

HVT

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Visual Transformers: Where Do Transformers Really Belong in Vision Models?

Object Detection

PnP-DETR: Towards Efficient Visual Analysis With Transformers

DETR的模型压缩

Knowledge Mining and Transferring for Domain Adaptive Object Detection

Knowledge Transfer Network

Conditional DETR for Fast Training Convergence

6.7x faster for R50 and R101, 10x faster for DC5-R50, DC-R01

Towards Rotation Invariance in Object Detection

Multi-Source Domain Adaptation for Object Detection

？

Rethinking Transformer-Based Set Prediction for Object Detection

Transformer based Set Prediction with FCOs
Transformer based Set Prediction with RCNN

Dynamic DETR: End-to-End Object Detection With Dynamic Attention

CrossDet: Crossline Representation for Object Detection

CrossDet uses a set of growing cross lines along horizontal and vertical axes as object representations. An object can be flexibly represented as cross lines in different combinations.

Fast Convergence of DETR With Spatially Modulated Co-Attention

Attention mechanism: SMCA

GraphFPN: Graph Feature Pyramid Network for Object Detection

TOOD: Task-Aligned One-Stage Object Detection

WB-DETR: Transformer-Based Detector Without Backbone

without backbone in normal DETR
LIE-T2T: local information enhancement tokens to token module to enhance internal information of tokens after unfolding.

Distillation

G-DetKD: Towards General Distillation Framework for Object Detectors via Contrastive and Semantic-Guided Feature

knowledge distillation

Revisiting Adversarial Robustness Distillation: Robust Soft Labels Make Student Better

propose a novel adversarial robustness distillation method called Robust Soft Label Adversarial Distillation (RSLAD) to train small student models.

Other Tricks

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

use two techniques to enhance ViT's ability of encoding high res images:

multi-scale structure
attention mechanism of Vision Longformer

Bit-Mixer: Mixed-Precision Networks With Runtime Bit-Width Selection

mixed precision computation

CVPR 2021

Object Detection

UP-DETR: Unsupervised Pre-Training for Object Detection

Towards Open World Object Detection

Unsupervised Object Detection with LIDAR Clues

Multiple Instance Active Learning for Object Detection

DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution

Uncertainty-Aware Joint Salient Object Detection and Camouflaged Object Detection

Joint-DetNAS: Upgrade Your Detector With NAS, Pruning and Dynamic Distillation

Beyond Max-Margin: Class Margin Equilibrium for Few-Shot Object Detection

I3Net: Implicit Instance-Invariant Network for Adapting One-Stage Object Detectors

Open-Vocabulary Object Detection Using Captions

Sparse R-CNN: End-to-End Object Detection With Learnable Proposals

You Only Look One-Level Feature

Scaled-YOLOv4: Scaling Cross Stage Partial Network

End-to-End Object Detection With Fully Convolutional Network

Distillation

Distilling Object Detectors via Decoupled Features

General Instance Distillation for Object Detection

Backbone & Dataset

Transformer Interpretability Beyond Attention Visualization

Re-Labeling ImageNet: From Single to Multi-Labels, From Global to Localized Labels

Involution: Inverting the Inherence of Convolution for Visual Recognition

Gaussian Context Transformer

Capsule Network Is Not More Robust Than Convolutional Network

How Does Topology Influence Gradient Propagation and Model Performance of Deep Networks with DenseNet-Type Skip Connections?

RepVGG: Making VGG-Style ConvNets Greate Again

Sewer-ML: A Multi-Label Sewer Defect Classification Dataset and Benchmark

Bottleneck Transformer for Visual Recognition

Approaches and Related

OTA: Optimal Transport Assignment for Object Detection

GAIA: A Transfer Learning System of Object Detection That Fits You Needs

RankDetNet: Delving Into Ranking Constraints for Object Detection

AQD: Towards Accurate Quantized Object Detection

PANDA: Adapting Pretrained Features for Anomaly Detection and Segmentation

IQDet: Instance-Wise Quality Distribution Sampling for Object Detection

Equalization Loss v2: A New Gradient Balance Approach for Long-Tailed Object Detection

Data-Uncertainty Guided Multi-Phase Learning for Semi-Supervised Object Detection

Adaptive Class Suppression Loss for Long-Tail Object Detection

Humble Teachers Teach Better Students for Semi-Supervised Object Detection

Unbiased Mean Teacher for Cross-Domain Object Detection

Points As Queries: Weakly Semi-Supervised Object Detection by Points

Informative and Consistent Correspondence Mining for Cross-Domain Weakly Supervised Object Detection

Beyond Bounding-Box: Convex-Hull Feature Adaptation for Oriented and Densely Packet Object Detection

Group Whitening: Balancing Learning Efficiency and Representation Capacity

OPANAS: One-Shot Path Aggregation Network Architecture Search for Object Detection

Dynamic Head: Unifying Object Detection Heads With Attentions

Scale-Aware Automatic Augmentation for Object Detection

Class-aware Robust Adversarial Training for Object Detection

RPN Prototype Alignment for Domain Adaptive Object Detector

A2-FPN: Attention Aggregation Based Feature Pyramid Network for Instance Segmentation

Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection

The Translucent Patch: A Physical and Universal Attack on Object Detectors

Robust and Accurate Object Detection via Adversarial Learning

ICCV 2021

Backbone

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions

LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference

GLiT: Neural Architecture Search for Global and Local Image Transformer

Understanding Robustness of Transformers for Image Classification

Co-Scale Conv-Attentional Image Transformers

Aggregation With Feature Detection

Visformer: The Vision-Friendly Transformer

Going Deeper with Image Transformers

Multiscale Vision Transformers

Unifying Nonlocal Blocks for Neural Networks

Tokens-to-Token ViT: Training Vision Transformers From Scratch on ImageNet

Incorporating Convolution Designs Into Visual Transformers

FcaNet: Frequency Channel Attention Networks

CvT: Introducing Convolutions to Vision Transformers

Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows

Conformer: Local Features Coupling Global Representations for Visual Recognition

AutoFormer: Searching Transformers for Visual Recognition

Scalable Vision Transformers With Hierarchical Pooling

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Visual Transformers: Where Do Transformers Really Belong in Vision Models?

Object Detection

PnP-DETR: Towards Efficient Visual Analysis With Transformers

Knowledge Mining and Transferring for Domain Adaptive Object Detection

Conditional DETR for Fast Training Convergence