RS-CLIP: zero shot remote sensing scene classification via contrastive vision-language supervision

Lists

Tools

Li, X. ORCID: https://orcid.org/0000-0002-9946-7000, Wen, C., Hu, Y. and Zhou, N. (2023) RS-CLIP: zero shot remote sensing scene classification via contrastive vision-language supervision. International Journal of Applied Earth Observation and Geoinformation, 124. 103497. ISSN 1872-826X

Preview

Text (Open Access) - Published Version
· Available under License Creative Commons Attribution Non-commercial No Derivatives.
· Please see our End User Agreement before downloading.
3MB

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

To link to this item DOI: 10.1016/j.jag.2023.103497

Abstract/Summary

Zero-shot remote sensing scene classification aims to solve the scene classification problem on unseen categories and has attracted numerous research attention in the remote sensing field. Existing methods mostly use shallow networks for visual and semantic feature learning, and the semantic encoder networks are usually fixed during the zero-shot learning process, thus failing to capture powerful feature representations for classification. In this work, we introduced a vision-language model for remote sensing scene classification based on contrastive vision-language supervision. Our method is capable of learning semantic-aware visual representations using a contrastive vision-language loss in the embedding space. By pretraining on large-scale image–text datasets, our baseline method shows good transferring ability on remote sensing scenes. To enable model training in zero-shot settings, we introduced a pseudo-labeling technique that can automatically generate pseudo labels from unlabeled data. A curriculum learning strategy is developed to boost the performance of zero-shot remote sensing scene classification with multiple stages of model finetuning. We conducted experiments on four benchmark datasets and showed considerable performance improvement on both zero-shot and few-shot remote sensing scene classification. The proposed RS-CLIP method achieved a zero-shot classification accuracy of 95.94%, 95.97%, 85.76%, and 87.52% on the novel classes of UCM-21, WHU-RS19, NWPU-RESISC45, and AID-30 datasets respectively. Our code will be released at https://github.com/lx709/RS-CLIP.

Item Type:	Article
Refereed:	Yes
Divisions:	No Reading authors. Back catalogue items
ID Code:	119820
Publisher:	Elsevier

Download Statistics

Downloads

Downloads per month over past year

Altmetric

Deposit Details

University Staff: Request a correction | Centaur Editors: Update this record

University of Reading

CentAUR: Central Archive at the University of Reading

Accessibility navigation

RS-CLIP: zero shot remote sensing scene classification via contrastive vision-language supervision

Abstract/Summary

Downloads

Page navigation

See also

Footer navigation