Abstract:To address the challenges in detecting dispersed and fine defects on metal surfaces, convolutional neural network (CNN) often fall short due to their limited ability to capture global features, leading to missed detections and loss of detail in identifying defects such as oxidation particles, cracks, and scratches. Although Transformers can capture comprehensive global information, the extensive computation required can be costly. In pursuit of an efficient and accurate method for metal surface defect detection, this study introduces a novel network architecture, the DPG-Transformer, which synergistically combines the local feature extraction capabilities of CNNs with the global modeling strengths of Transformer. This integration is facilitated through the use of depthwise separable convolutions (DW-Conv) and pooling grid window attention mechanisms (PGW-Attention). The effectiveness of the DPG-Transformer was validated on both a proprietary metal defect dataset (ST-DET) and a public dataset (NEU-CLS), achieving defect detection accuracies of 99.3% and 99.6%, respectively, and outperforming several classic networks in terms of accuracy, computational efficiency, and floating-point operations. Additionally, visualization experiments demonstrated that the DPG-Transformer more comprehensively extracts defect features associated with corrosion and scaling compared to CNN models, and more precisely focuses on the global features of elongated cracks and scratches than Transformer models. The results indicate that the DPG-Transformer not only reduces computational load and complexity but also enhances the comprehensive and precise detection of metal surface defects, making it a highly suitable approach for practical applications in metal surface defect detection.