CVPR 2021
Object Detection
UP-DETR: Unsupervised Pre-Training for Object Detection
也许能够对缺少数据的情况有所启发. stupid
Towards Open World Object Detection
the task of open world object detection:
- identify objects that have not been introduced to it as `unknown', without explicit supervision to do so
- incrementally learn these identified unknown categories without forgetting previously learned classes, when the corresponding labels are progressively received
Unsupervised Object Detection with LIDAR Clues
mark 没啥用
Multiple Instance Active Learning for Object Detection
DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution
backbone design for object detection.
Macro level: Recursive Feature Pyramid. (incorporates extra feedback connections from FPN into the bottom-up backbone layers)
Micro level: Switchable Atrous Convolution: convolves different atrous rates features, gether the results using switch functions.
Uncertainty-Aware Joint Salient Object Detection and Camouflaged Object Detection
mark 未知领域
Joint-DetNAS: Upgrade Your Detector With NAS, Pruning and Dynamic Distillation
包含三个主要的组成部分: NAS, 剪枝和蒸馏
算法主要包含两个重要步骤:student morphism optimizes the student's architecture and removes the redundant params
dynamic distillation aims to find the optimal matching teacher.
Beyond Max-Margin: Class Margin Equilibrium for Few-Shot Object Detection
mark
I3Net: Implicit Instance-Invariant Network for Adapting One-Stage Object Detectors
mark
Open-Vocabulary Object Detection Using Captions
open vocabulary object detection training approach for new class.
Sparse R-CNN: End-to-End Object Detection With Learnable Proposals
Existing works on object detection heavily rely on dense object candidates.
-
fixed sparse set of learned object proposals
-
Sparse R-CNN completely avoids all efforts related to object candidates design and many-to-one label assignment
-
No NMS
You Only Look One-Level Feature
看过。重要 YOLOF
Scaled-YOLOv4: Scaling Cross Stage Partial Network
mark
End-to-End Object Detection With Fully Convolutional Network
Distillation
Distilling Object Detectors via Decoupled Features
main idea: information of features derived from regions should be assigned with different importance during distillation.
**Proposed: ** decoupled features (DeFeat) for learning a better student detector.
General Instance Distillation for Object Detection
Backbone & Dataset
Transformer Interpretability Beyond Attention Visualization
For further investigation
Re-Labeling ImageNet: From Single to Multi-Labels, From Global to Localized Labels
github page: https://github.com/naver-ai/relabel_imagenet
Involution: Inverting the Inherence of Convolution for Visual Recognition
backbone: RedNet
Gaussian Context Transformer
如题 backbone
Capsule Network Is Not More Robust Than Convolutional Network
mark
How Does Topology Influence Gradient Propagation and Model Performance of Deep Networks with DenseNet-Type Skip Connections?
mark
RepVGG: Making VGG-Style ConvNets Greate Again
a stack of 3x3 convolution and ReLU.
Training Stage: multi-branch topology, use structural re-parameterization technique to decoupling train and inference.
Sewer-ML: A Multi-Label Sewer Defect Classification Dataset and Benchmark
mark
Bottleneck Transformer for Visual Recognition
mark
Approaches and Related
OTA: Optimal Transport Assignment for Object Detection
label assignment
ATSS proposes to set the divisions boundary for each gt based on statistical characteristics
近来的研究说明,预测出的锚定框预测分数,可以作为设计动态分配策略的一个有效的参照。
手工分配anchors的工作:Min Area, Max IoU
Point: assign ambiguous anchors to any gt or background may introduce harmful gradient wrt other gts
DETR: first work to consider label assignment from global view 匈牙利算法
为了达到global optimal assignment under one2many situdation, they proposed to formulated label assignment* as an **Optimal Transport (OT) **
Define: Sinkhorn-Knopp Iteration ==> Optimal Transport Assignment
GAIA: A Transfer Learning System of Object Detection That Fits You Needs
GAIA is capbale of providing powerful pre-trained weights, selecting models that conform to downstream demands.
RankDetNet: Delving Into Ranking Constraints for Object Detection
3 ranking contraints: global ranking, class-specific ranking and IoU-guided ranking losses
- The global ranking loss encourages foreground samples to rank higher than background.
- The class-specific ranking loss ensures that positive samples rank higher than negative ones for each specific class.
- The IoU-guided ranking loss aims to align each pair of confidence scores with the associated pair of IoU overlap between two positive samples of a specific class.
优点: 实现简单,推理阶段不增加计算量。
improving RetinaNet baseline by 2.5% AP on the COCO test-dev set
AQD: Towards Accurate Quantized Object Detection
In this paper, we propose an Accurate Quantized object Detection solution, termed AQD, to fully get rid of floating-point computation.
PANDA: Adapting Pretrained Features for Anomaly Detection and Segmentation
mark
IQDet: Instance-Wise Quality Distribution Sampling for Object Detection
- Dense Object Detector
Equalization Loss v2: A New Gradient Balance Approach for Long-Tailed Object Detection
Data-Uncertainty Guided Multi-Phase Learning for Semi-Supervised Object Detection
Adaptive Class Suppression Loss for Long-Tail Object Detection
Humble Teachers Teach Better Students for Semi-Supervised Object Detection
Unbiased Mean Teacher for Cross-Domain Object Detection
object detection model is often vulnerable to data variance.
proposed new Unbiased Mean Teacher (UMT) model for cross-domain object detection.
Points As Queries: Weakly Semi-Supervised Object Detection by Points
the dataset comprises small fully annotated images and large weakly annotated images by points.
proposed Point DETR: extends DETR by adding a pointer encoder.
In particular, when using 20% fully labeled data from COCO, our detector achieves a promising performance, 33.3 AP
Informative and Consistent Correspondence Mining for Cross-Domain Weakly Supervised Object Detection
Cross domain weakly supervised object detection
just mark, no use
Beyond Bounding-Box: Convex-Hull Feature Adaptation for Oriented and Densely Packet Object Detection
检测对象: Oriented and densely packed objects.
proposed: Convex-hull feature adaptation (CFA) for configuring convolutional features in accordance with oriented and densely packed object layouts.
重要的概念:CFA is rooted in convex-hull feature representation
CFR: it defines a set of dynamically predicted feature points guided by convex IoU (CIoU)
Group Whitening: Balancing Learning Efficiency and Representation Capacity
Group whitening: exploits the advantages of whitening operation(白化操作) and avoids the disadvantages of normalization within mini-batches.
OPANAS: One-Shot Path Aggregation Network Architecture Search for Object Detection
NAS已经应用于搜索FPN的网络结构,OPANAS全面提高了搜索效率和detection的精度。
Search Space: 引入了六个heterogeneous information paths 来建立搜索空间,然后提出了一个新颖的FPN搜索空间(each FPN candidate is representated by a densely-connected directed acyclic graph)
search approach: One-shot search
Dynamic Head: Unifying Object Detection Heads With Attentions
以前看过。
Scale-Aware Automatic Augmentation for Object Detection
数据增广。
define a new scale-aware search space: image-level and box-level augmentations are designed for maintaining scale invariance.
proposed a new search metric: Pareto Scale Balance.
Class-aware Robust Adversarial Training for Object Detection
RPN Prototype Alignment for Domain Adaptive Object Detector
Cross domain
A2-FPN: Attention Aggregation Based Feature Pyramid Network for Instance Segmentation
Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection
Localization Quality Estimation (LQE): is crucial and popular in the recent advancement of dense object detectors since it can provide accurate ranking scores that benefits the NMS processing and improve detection performance.
The Translucent Patch: A Physical and Universal Attack on Object Detectors
Robust and Accurate Object Detection via Adversarial Learning
Data Augmentation
ICCV 2021
Amusi's repo: https://github.com/amusi/ICCV2021-Papers-with-Code
Backbone
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions
看过,创新点有2:
- PVT not only can be trained on dense partitions of an image to achieve high output resolution, which is important for dense prediction, but also uses a progressive shrinking pyramid to reduce computations of large feature maps.
- PVT has the advantages of both CNN and Transformer.
LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference
看过,还有v2
GLiT: Neural Architecture Search for Global and Local Image Transformer
- Introduced a locality module, search space is defined to let the search algo freely trade off between global and local information as well as optimizing the low-level design choice in each module.
- A hierarchical NAS method is proposed to search optimal vision transformer from two levels separately with evolutionary algo to tackle the problem caused by huge search space.
Understanding Robustness of Transformers for Image Classification
看过
Co-Scale Conv-Attentional Image Transformers
- co-scale mechanism maintains the integrity of Transformer' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other.
- devised a conv-attentional mechanism by realizing position embedding formulation in the factorized attention module with an efficient convolution-like implementation.
Aggregation With Feature Detection
Visformer: The Vision-Friendly Transformer
Going Deeper with Image Transformers
CaiT
Multiscale Vision Transformers
For AR, not for common cv tasks
Unifying Nonlocal Blocks for Neural Networks
non-local blocks are designed for capturing global spatial information in cv tasks.
proposed an efficient and robust spectral nonlocal block, which can be more robust and flexible to catch long-range dependencies where inserted into dnn than existing models.
Tokens-to-Token ViT: Training Vision Transformers From Scratch on ImageNet
看过,好像也是pretrain free了
Incorporating Convolution Designs Into Visual Transformers
LeFF
FcaNet: Frequency Channel Attention Networks
CvT: Introducing Convolutions to Vision Transformers
Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
Point
Conformer: Local Features Coupling Global Representations for Visual Recognition
AutoFormer: Searching Transformers for Visual Recognition
Scalable Vision Transformers With Hierarchical Pooling
HVT
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
Visual Transformers: Where Do Transformers Really Belong in Vision Models?
Object Detection
PnP-DETR: Towards Efficient Visual Analysis With Transformers
DETR的模型压缩
Knowledge Mining and Transferring for Domain Adaptive Object Detection
Knowledge Transfer Network
Conditional DETR for Fast Training Convergence
6.7x faster for R50 and R101, 10x faster for DC5-R50, DC-R01
Towards Rotation Invariance in Object Detection
Multi-Source Domain Adaptation for Object Detection
?
Rethinking Transformer-Based Set Prediction for Object Detection
- Transformer based Set Prediction with FCOs
- Transformer based Set Prediction with RCNN
Dynamic DETR: End-to-End Object Detection With Dynamic Attention
CrossDet: Crossline Representation for Object Detection
- CrossDet uses a set of growing cross lines along horizontal and vertical axes as object representations. An object can be flexibly represented as cross lines in different combinations.
Fast Convergence of DETR With Spatially Modulated Co-Attention
- Attention mechanism: SMCA
GraphFPN: Graph Feature Pyramid Network for Object Detection
TOOD: Task-Aligned One-Stage Object Detection
WB-DETR: Transformer-Based Detector Without Backbone
- without backbone in normal DETR
- LIE-T2T: local information enhancement tokens to token module to enhance internal information of tokens after unfolding.
Distillation
G-DetKD: Towards General Distillation Framework for Object Detectors via Contrastive and Semantic-Guided Feature
knowledge distillation
Revisiting Adversarial Robustness Distillation: Robust Soft Labels Make Student Better
propose a novel adversarial robustness distillation method called Robust Soft Label Adversarial Distillation (RSLAD) to train small student models.
Other Tricks
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding
use two techniques to enhance ViT's ability of encoding high res images:
- multi-scale structure
- attention mechanism of Vision Longformer
Bit-Mixer: Mixed-Precision Networks With Runtime Bit-Width Selection
mixed precision computation