Spatio-channel Attention Blocks for Cross-modal Crowd Counting

Youjia Zhang, Soyun Choi, and Sungeun Hong

Inha University

ACCV 2022 Oral

Paper

Video

Poster

Code

Abstract

Crowd counting research has made significant advancements in real-world applications, but it remains a formidable challenge in crossmodal settings. Most existing methods rely solely on the optical features of RGB images, ignoring the feasibility of other modalities such as thermal and depth images. The inherently significant differences between the different modalities and the diversity of design choices for model architectures make cross-modal crowd counting more challenging. In this paper, we propose Cross-modal Spatio-Channel Attention (CSCA) blocks, which can be easily integrated into any modality-specific architecture. The CSCA blocks first spatially capture global functional correlations among multi-modality with less overhead through spatial-wise crossmodal attention. Cross-modal features with spatial attention are subsequently refined through adaptive channel-wise feature aggregation. In our experiments, the proposed block consistently shows significant performance improvement across various backbone networks, resulting in state-of-the-art results in RGB-T and RGB-D crowd counting.

Motivation

Visualization of RGB-T and RGB-D pairs. (a) shows the positive effects of thermal images in extremely poor illumination. (b) shows the negative effects caused by additional heating objects. In scene (c), the depth image provides additional information about the position and size of the head. However, in scene (d), part of the head information in the depth image is corrupted by cluttered background noise.

Framework Overview

Given a face image x ∈ X and target age randomly drawn from y′ ∈ Y, our goal is to train a single generator G such that it can generate a face image x′ of a particular age y′ corresponding to the identity in x. In addition to encoder-decoder structure, we introduce an age modulator within G to reshape identity features by considering the target age and utilize it as self-guiding information.

Effectiveness of CSCA

Comparison with SOTA Methods

Qualitative Results

BibTex

@inproceedings{zhang2022spatio,

title={Spatio-channel Attention Blocks for Cross-modal Crowd Counting},

author={Zhang, Youjia and Choi, Soyun and Hong, Sungeun},

booktitle={Proceedings of the Asian Conference on Computer Vision},

pages={90--107},

year={2022}

}

References

Liu, L., Chen, J., Wu, H., Li, G., Li, C., Lin, L.: Cross-modal collaborative representation learning and a large-scale rgbt benchmark for crowd counting. In: Proc. of Computer Vision and Pattern Recognition (CVPR). pp. 4823–4833 (2021)
Ma, Z., Wei, X., Hong, X., Gong, Y.: Bayesian loss for crowd count estimation with point supervision. In: Proc. of Int’l Conf. on Computer Vision (ICCV). pp. 6142–6151 (2019)
Lian, D., Li, J., Zheng, J., Luo, W., Gao, S.: Density map regression guided detection network for rgb-d crowd counting and localization. In: Proc. of Computer Vision and Pattern Recognition (CVPR). pp. 1821–1830 (2019)
Li, Y., Zhang, X., Chen, D.: Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In: Proc. of Computer Vision and Pattern Recognition (CVPR). pp. 1091–1100 (2018)
Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: Proc. of Computer Vision and Pattern Recognition (CVPR). pp. 589–597 (2016)