Abstract:In recent years, transformer-based methods have achieved remarkable progress in the field of 3D human pose estimation. However, current approaches are still confronted with two major challenges. First, the computational inefficiency arises from the quadratic complexity of global self-attention when processing large-scale joint affinity matrices in dynamic video sequences. This issue significantly hampers the real-time performance of the models. Second, the suboptimal spatiotemporal feature fusion restricts the model’s ability to capture fine-grained motion patterns and structural dependencies between joints, leading to less accurate pose estimation results. To tackle these limitations, this paper proposes a novel architecture named the dual-stage spatio-temporal convolutional transformer (DSTCFormer). The key innovation of DSTCFormer lies in its decoupling of spatiotemporal feature learning into parallel spatial and temporal pathways. Specifically, the convolutional multi-scale attention (CMSA) module is introduced to hierarchically aggregate local and global correlations through convolution-enhanced multi-head attention. In the spatial pathway, convolutional position embeddings are utilized to encode skeletal topology, enabling the model to focus on intra-frame joint relationships. Meanwhile, the temporal pathway captures inter-frame motion coherence via axial-specific self-attention. Moreover, a cross-stage fusion mechanism is designed to integrate multi-scale spatiotemporal features through depthwise separable convolutions and feature transformation layers, which ensures efficient computation and robust feature representation. Extensive experiments conducted on the Human3.6M and MPI-INF-3DHP datasets demonstrate the superiority of DSTCFormer. Under Protocol 1 (P1), DSTCFormer achieves a state-of-the-art Mean Per Joint Position Error (MPJPE) of 40.1mm on Human3.6M with 243 input frames, outperforming PoseFormer (44.3 mm), MixSTE (40.9 mm), and STCFormer (40.5 mm). On the MPI-INF-3DHP dataset, it attains a percentage of correct keypoints at 150 mm (PCK@150 mm) of 99.1% and an area under curve (AUC) of 85.2%, surpassing existing methods by 0.4% and 1.3%, respectively. In summary, the proposed method not only advances the theoretical frameworks for spatiotemporal modeling but also offers practical implications for real-time applications, paving the way for more efficient and accurate 3D human pose estimation in various scenarios.