Abstract:In response to the severe stacking of scrap steel samples and the need for refined classification of scrap steel types, this paper proposes a scrap steel image classification method based on cross-layer fusion of semantic-enhanced features. The proposed method consists of several stages, aiming to optimize the accuracy and efficiency of scrap steel classification. The first stage is motion detection, which is used to extract scrap steel images without moving objects such as grapples from video sequences. This step ensures that the dataset excludes irrelevant objects, providing a more accurate foundation for subsequent analysis. Next, the state-of-the-art visual model “Segment Anything Model (SAM)” is applied to perform semantic segmentation on scrap steel images without moving objects such as grapples, to segment the instances in the scrap steel images. The core contribution of this paper lies in the design of a scrap steel image classification model, EfficientNetB5-CLFSEF, which can effectively handle the subtle differences between scrap steel categories and the significant morphological changes within each category. This model uses EfficientNetB5 as the feature extractor, as it is renowned for its efficiency and high performance in visual recognition tasks. Additionally, the model integrates a novel cross-layer fusion of semantic-enhanced features (CLFSEF) module, which is crucial for improving the classification accuracy of scrap steel images. The CLFSEF module consists of two key components:cross-layer feature fusion (CLF) and semantic-enhanced features (SEF). CLF fuses the features from different layers of the EfficientNetB5 feature extractor, enabling the model to capture deep semantic information and low-level details such as boundaries, which is crucial for distinguishing similar scrap steel categories. On the other hand, the SEF module groups the fused features based on semantic similarity between channels. This grouping process enables the model to focus on the most discriminative features in the image. Moreover, the SEF module also integrates knowledge distillation and maximum entropy regularization techniques to enhance the model’s ability to recognize the most significant parts of the input scrap steel images. To validate the proposed method, experiments were conducted using a specially customized dataset for scrap steel classification. The benchmark EfficientNetB5 achieved an accuracy of 87.98% on the test set. After introducing the CLF module, the accuracy increased to 89.63%. Adding the SEF module resulted in an accuracy of 89.23%, and when the CLF and SEF modules are combined into the complete CLFSEF module, the accuracy increased to 90.51%. Compared to the benchmark classification model, these improvements increased by 1.65%, 1.25%, and 2.53% respectively. Moreover, the proposed model outperforms the comparison classification models.