Related papers: TVConv: Efficient Translation Variant Convolution …
We propose a new convolution called Dynamic Region-Aware Convolution (DRConv), which can automatically assign multiple filters to corresponding spatial regions where features have similar representation. In this way, DRConv outperforms…
Spatial convolutions are extensively used in numerous deep video models. It fundamentally assumes spatio-temporal invariance, i.e., using shared weights for every location in different frames. This work presents Temporally-Adaptive…
Spatial convolutions are widely used in numerous deep video models. It fundamentally assumes spatio-temporal invariance, i.e., using shared weights for every location in different frames. This work presents Temporally-Adaptive Convolutions…
Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural networks (CNNs) due to the lack of inductive bias. Recent works thus resort to convolutions as a plug-and-play module and embed them in…
Convolutional Neural Networks (CNNs) have achieved remarkable success in various computer vision tasks but rely on tremendous computational cost. To solve this problem, existing approaches either compress well-trained large-scale models or…
Convolutional neural networks have witnessed remarkable improvements in computational efficiency in recent years. A key driving force has been the idea of trading-off model expressivity and efficiency through a combination of $1\times 1$…
Deep 3-dimensional (3D) Convolutional Network (ConvNet) has shown promising performance on video recognition tasks because of its powerful spatio-temporal information fusion ability. However, the extremely intensive requirements on memory…
The convolution operation is a powerful tool for feature extraction and plays a prominent role in the field of computer vision. However, when targeting the pixel-wise tasks like image fusion, it would not fully perceive the particularity of…
Vision Transformer (ViT) has prevailed in computer vision tasks due to its strong long-range dependency modelling ability. \textcolor{blue}{However, its large model size and weak local feature modeling ability hinder its application in real…
Recently, Transformers have emerged as the go-to architecture for both vision and language modeling tasks, but their computational efficiency is limited by the length of the input sequence. To address this, several efficient variants of…
This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales. PyConv contains a pyramid of kernels, where each level involves different types of filters with varying size and depth,…
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger…
The channel redundancy in feature maps of convolutional neural networks (CNNs) results in the large consumption of memories and computational resources. In this work, we design a novel Slim Convolution (SlimConv) module to boost the…
In this paper, we are committed to establishing an unified and end-to-end multi-modal network via exploring the language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called…
The hybrid architecture of convolution neural networks (CNN) and Transformer has been the most popular method for medical image segmentation. However, the existing networks based on the hybrid architecture suffer from two problems. First,…
Light-weight convolutional neural networks (CNNs) suffer performance degradation as their low computational budgets constrain both the depth (number of convolution layers) and the width (number of channels) of CNNs, resulting in limited…
Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT,…
Recent advances in vision transformers (ViTs) have demonstrated the advantage of global modeling capabilities, prompting widespread integration of large-kernel convolutions for enlarging the effective receptive field (ERF). However, the…
Real-SR endeavors to produce high-resolution images with rich details while mitigating the impact of multiple degradation factors. Although existing methods have achieved impressive achievements in detail recovery, they still fall short…
Spatial convolution is fundamental in constructing deep Convolutional Neural Networks (CNNs) for visual recognition. While dynamic convolution enhances model accuracy by adaptively combining static kernels, it incurs significant…