Vision-language models in remote sensing: current progress and future trendsLi, X. ORCID: https://orcid.org/0000-0002-9946-7000, Wen, C., Hu, Y., Yuan, Z. and Zhu, X. X. (2024) Vision-language models in remote sensing: current progress and future trends. IEEE Geoscience and Remote Sensing Magazine, 12 (2). pp. 32-66. ISSN 2168-6831 Full text not archived in this repository. It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing. To link to this item DOI: 10.1109/MGRS.2024.3383473 Abstract/SummaryThe remarkable achievements of ChatGPT and Generative Pre-trained Transformer 4 (GPT-4) have sparked a wave of interest and research in the field of large language models (LLMs) for artificial general intelligence (AGI). These models provide intelligent solutions that are closer to human thinking, enabling us to use general artificial intelligence (AI) to solve problems in various applications. However, in the field of remote sensing (RS), the scientific literature on the implementation of AGI remains relatively scant. Existing AI-related research in RS focuses primarily on visual-understanding tasks while neglecting the semantic understanding of the objects and their relationships. This is where vision-LMs (VLMs) excel as they enable reasoning about images and their associated textual descriptions, allowing for a deeper understanding of the underlying semantics. VLMs can go beyond visual recognition of RS images and can model semantic relationships as well as generate natural language descriptions of the image. This makes them better suited for tasks that require both visual and textual understanding, such as image captioning and visual question answering (VQA). This article provides a comprehensive review of the research on VLMs in RS, summarizing the latest progress, highlighting current challenges, and identifying potential research opportunities. Specifically, we review the application of VLMs in mainstream RS tasks, including image captioning, text-based image generation, text-based image retrieval (TBIR), VQA, scene classification, semantic segmentation, and object detection. For each task, we analyze representative works and discuss research progress. Finally, we summarize the limitations of existing works and provide possible directions for future development. This review aims to provide a comprehensive overview of the current research progress of VLMs in RS (see Figure 1 ), and to inspire further research in this exciting and promising field.
Altmetric Deposit Details University Staff: Request a correction | Centaur Editors: Update this record |