Vision-language models in remote sensing: current progress and future trends

Li, Xiang; Wen, Congcong; Hu, Yuan; Yuan, Zhenghang; Zhu, Xiao Xiang

Download

Full text not archived in this repository.

Advice

Please see our End User Agreement.

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

Tools

Lists

Li, X. ORCID: https://orcid.org/0000-0002-9946-7000, Wen, C., Hu, Y., Yuan, Z. and Zhu, X. X. (2024) Vision-language models in remote sensing: current progress and future trends. IEEE Geoscience and Remote Sensing Magazine, 12 (2). pp. 32-66. ISSN 2168-6831 doi: 10.1109/MGRS.2024.3383473

Abstract/Summary

The remarkable achievements of ChatGPT and Generative Pre-trained Transformer 4 (GPT-4) have sparked a wave of interest and research in the field of large language models (LLMs) for artificial general intelligence (AGI). These models provide intelligent solutions that are closer to human thinking, enabling us to use general artificial intelligence (AI) to solve problems in various applications. However, in the field of remote sensing (RS), the scientific literature on the implementation of AGI remains relatively scant. Existing AI-related research in RS focuses primarily on visual-understanding tasks while neglecting the semantic understanding of the objects and their relationships. This is where vision-LMs (VLMs) excel as they enable reasoning about images and their associated textual descriptions, allowing for a deeper understanding of the underlying semantics. VLMs can go beyond visual recognition of RS images and can model semantic relationships as well as generate natural language descriptions of the image. This makes them better suited for tasks that require both visual and textual understanding, such as image captioning and visual question answering (VQA). This article provides a comprehensive review of the research on VLMs in RS, summarizing the latest progress, highlighting current challenges, and identifying potential research opportunities. Specifically, we review the application of VLMs in mainstream RS tasks, including image captioning, text-based image generation, text-based image retrieval (TBIR), VQA, scene classification, semantic segmentation, and object detection. For each task, we analyze representative works and discuss research progress. Finally, we summarize the limitations of existing works and provide possible directions for future development. This review aims to provide a comprehensive overview of the current research progress of VLMs in RS (see Figure 1 ), and to inspire further research in this exciting and promising field.

Altmetric Badge

Dimensions Badge

Item Type	Article
URI	https://centaur.reading.ac.uk/id/eprint/119819
Identification Number/DOI	10.1109/MGRS.2024.3383473
Refereed	Yes
Divisions	No Reading authors. Back catalogue items
Publisher	IEEE
Download/View statistics	View download statistics for this item

Deposit Details

CORE (COnnecting REpositories)

University Staff: Request a correction | Centaur Editors: Update this record

Date Deposited:	18 Dec 2024 14:31	Date item deposited into CentAUR
Last Modified:	08 Jun 2025 03:19	Date item last modified