Abstract:Existing multimodal object detection methods often suffer from limitations in modality selection, spatial modeling, and cross-modal consistency, particularly under challenging conditions such as low illumination, target occlusion, and complex backgrounds. To address these issues, this paper proposes a UAV object detection method based on modality-guided selection and adaptive contrastive learning. First, a modality-guided selection module is designed, which employs global semantic-aware channel attention to dynamically evaluate modal contributions and enable adaptive feature fusion, thereby effectively resolving the modality imbalance issue inherent in conventional fixed-weight fusion strategies. Second, a Modality Enhancement Module is introduced, incorporating locally-globally coordinated relative positional biases and normalized residual connections into a single-branch self-attention structure to enhance spatial perception in complex and occluded scenes. Finally, a detection-aware adaptive crossmodal contrastive learning strategy is proposed, utilizing detection boxes with modality-adaptive temperature scaling to explicitly align multimodal features and improve semantic consistency. Experimental results demonstrate that the proposed method achieves mAP50 scores of 78.6% on the Drone Vehicle dataset and 98.3% on the LLVIP dataset, outperforming existing approaches. Deployment on a real UAV platform achieves 12.61 fps, validating both the accuracy and practical utility of the framework.