Accessibility navigation


RS-CLIP: zero shot remote sensing scene classification via contrastive vision-language supervision

Li, X. ORCID: https://orcid.org/0000-0002-9946-7000, Wen, C., Hu, Y. and Zhou, N. (2023) RS-CLIP: zero shot remote sensing scene classification via contrastive vision-language supervision. International Journal of Applied Earth Observation and Geoinformation, 124. 103497. ISSN 1872-826X

[img] Text - Published Version
· Restricted to Repository staff only
· The Copyright of this document has not been checked yet. This may affect its availability.

3MB

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

To link to this item DOI: 10.1016/j.jag.2023.103497

Abstract/Summary

Zero-shot remote sensing scene classification aims to solve the scene classification problem on unseen categories and has attracted numerous research attention in the remote sensing field. Existing methods mostly use shallow networks for visual and semantic feature learning, and the semantic encoder networks are usually fixed during the zero-shot learning process, thus failing to capture powerful feature representations for classification. In this work, we introduced a vision-language model for remote sensing scene classification based on contrastive vision-language supervision. Our method is capable of learning semantic-aware visual representations using a contrastive vision-language loss in the embedding space. By pretraining on large-scale image–text datasets, our baseline method shows good transferring ability on remote sensing scenes. To enable model training in zero-shot settings, we introduced a pseudo-labeling technique that can automatically generate pseudo labels from unlabeled data. A curriculum learning strategy is developed to boost the performance of zero-shot remote sensing scene classification with multiple stages of model finetuning. We conducted experiments on four benchmark datasets and showed considerable performance improvement on both zero-shot and few-shot remote sensing scene classification. The proposed RS-CLIP method achieved a zero-shot classification accuracy of 95.94%, 95.97%, 85.76%, and 87.52% on the novel classes of UCM-21, WHU-RS19, NWPU-RESISC45, and AID-30 datasets respectively. Our code will be released at https://github.com/lx709/RS-CLIP.

Item Type:Article
Refereed:Yes
Divisions:No Reading authors. Back catalogue items
ID Code:119820
Publisher:Elsevier

University Staff: Request a correction | Centaur Editors: Update this record

Page navigation