Abstract:To address the limitation of standard attention mechanisms that can only generate coarse-grained attention regions, failing to capture the geographical relationships between remote sensing objects and underutilize the semantic content of remote sensing images, a structured image description network named GRSRC (geo-object relational segmentation for remote sensing image captioning) is proposed. Firstly, considering the highly structured nature of remote sensing image features, a feature extraction method based on structured semantic segmentation of remote sensing images is introduced, enhancing the encoder’s feature extraction capability for more accurate representation. Simultaneously, an attention mechanism is incorporated to weight the segmented regions, enabling the model to focus more on crucial semantic information. Secondly, taking advantage of the well-defined spatial relationships among objects in remote sensing images, geographical spatial relations are integrated into the attention mechanism, ensuring more accurate and spatially consistent descriptions. Finally, experimental evaluations are conducted on three publicly available remote sensing datasets, RSICD, UCM, and Sydney. On the UCM dataset, BLEU-1 achieved 84.06, METEOR reached 44.35, and ROUGE_L attained 77.01, demonstrating improvements of 2.32%, 1.15%, and 1.88%, respectively, compared to classical models. The experimental results indicate that the model can better leverage the semantic content of remote sensing images, demonstrating its superior performance in remote sensing image captioning tasks.