Abstract:To address the issue of high computational complexity in the single shot multibox detector (SSD) model and its poor robustness in handling small targets and occlusions, an improved SSD-based infrared human target detection method is proposed to meet the real-time requirements of intelligent surveillance. First, MobileNet V2 is used as the backbone feature extraction network, replacing the traditional visual geometry group network 16(VGG16)network in SSD, which reduces computational cost through depthwise separable convolutions. Then, a feature pyramid network (FPN) structure is introduced to achieve multi-scale feature fusion, enhancing the representation ability of shallow features. Finally, the squeeze-and-excitation (SE) channel attention mechanism is incorporated to dynamically learn the channel weights, focusing on key features and improving the model’s attention to important channel information. Experimental results on the self-built IR-HD dataset show that the improved SSD model’s detection accuracy is increased by 1.3%@AP0.5 and 14.3%AP@0.75, while the model’s inference speed improves by 3.835 fps. The conclusion indicates that this method, through lightweight design, feature fusion, and attention mechanism collaboration, significantly enhances both detection accuracy and real-time performance, demonstrating strong robustness in infrared small target and occlusion scenarios.