Abstract:To address the limitations of traditional fault diagnosis models in collaboratively extracting spatiotemporal features from gearbox vibration signals and their weak robustness in noisy environments, this paper introduces an intelligent fault diagnosis method that combines a receptive field attention residual network (ResNet-RFA) and a bidirectional gated recurrent unit with self-attention (BiGRU-SATT). The process begins by converting raw vibration signals into time-frequency images via short-time fourier transform (STFT), while preserving the original 1D time-series data. A dual-channel network is then constructed: One channel uses ResNet-RFA to extract key spatial features from the time-frequency images, and the other uses BiGRU-SATT to capture temporal dependencies. The spatiotemporal features are merged and fed into a fully connected layer for classification. Experimental results demonstrate a high accuracy of 100%, outperforming comparison models (Transformer: 91%, Mamba: 96%, SVM: 94%, DBN: 89%) and showing strong noise robustness under 10 and 20 dB Gaussian-impulse mixed noise. In conclusion, the fusion model of ResNet-RFA and BiGRU-SATT can effectively and collaboratively mine the spatiotemporal features of signals, demonstrating superior accuracy and robustness over other comparative models, making it suitable for complex industrial environments.