Abstract:With the rapid development of drone aerial photography technology, the demand for precise recognition of targets such as infrastructure in low-altitude scenarios has been increasingly growing. However, traditional object detection and semantic segmentation methods still have shortcomings in boundary delineation and distinguishing between similar instances. To address these issues, this paper proposes an improved multimodal feature fusion instance segmentation network, MFFISNet, to enhance the fineness and robustness of target segmentation in drone remote sensing images. The method in this paper includes three main innovations: a dual-path input structure is constructed, utilizing both RGB images and DSM information to enrich multimodal feature representation; for the DSM branch, HWF-LM and DMSCA are introduced, significantly enhancing the model’s ability to represent elevation and structural information; FGCA mechanism is proposed to achieve efficient fusion of cross-modal features, thereby improving instance segmentation accuracy in complex scenarios. On the Drone-OrthoSeg dataset, MFFISNet achieved a bounding box mAP of 42.63% and mask mAP of 42.69%; on the NWPU VHR-10 dataset, the results were 77.86% and 72.59%, respectively; and on the foggy maritime scenarios of the FoggyShipInsseg dataset, it also achieved good performance of 63.86% and 59.69%. Experimental results indicate that the proposed method outperforms existing advanced methods in both accuracy and robustness, providing efficient and reliable technical support for automated detection and measurement of infrastructure in low-altitude scenarios.