基于双阶时空交叉卷积Transformer的三维人体姿态估计方法
DOI:
CSTR:
作者:
作者单位:

1.南京信息工程大学人工智能学院南京210044;2.江苏开放大学科学技术处南京210036

作者简介:

通讯作者:

中图分类号:

TN60;TP29

基金项目:

国家自然科学基金(62072150)、江苏省产学研合作项目(BY20230641)资助


3D human pose estimation with dual-stage spatio-temporal convolutional transformer
Author:
Affiliation:

1.School of Aritificial Intelligence, Nanjing University of Information Science & Technology, Nanjing 210044,China; 2.Science and Technology Office, Jiangsu Open University, Nanjing 210036, Chin

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    在三维人体姿态估计领域,尽管基于Transformer的方法已取得显著进展,但在处理大规模关节亲和力矩阵,尤其是动态视频序列分析时,依然面临显著的计算挑战。为此,提出一种双阶时空交叉卷积Transformer模型(dual-stage spatio-temporal convolutional transformer, DSTCFormer),旨在高效融合时空信息并提升三维姿态估计的精度与鲁棒性。通过并行时空路径设计、卷积强化注意力机制及结构驱动的位置编码技术,解决现有方法在处理长时序视频时计算效率低、时空特征建模不足的问题。在空间路径中,提出卷积位置嵌入模块,通过局部邻域卷积显式建模人体骨骼拓扑结构;在时间路径中,设计轴向特异性自注意力机制,捕捉跨帧关节运动轨迹。引入卷积多尺度注意力(convolutional multi-scale attention, CMSA)模块,结合深度可分离卷积与特征变换层,实现多尺度时空特征的交叉融合。此外,通过逐步细化关节间时空依赖关系,降低计算开销。实验表明,在Human3.6M数据集上,DSTCFormer以243帧输入取得P1协议下40.1 mm的平均关节位置误差,较PoseFormer、MixSTE和STCFormer分别降低4.1、0.8和0.4 mm;在MPI-INF-3DHP数据集上,PCK@150 mm和曲线下面积(AUC)分别达到99.1%和85.2%,较基准模型提升0.8 mm误差优势。提出的方法为三维人体姿态估计提供了高效的理论框架,并为虚拟现实、人机交互等应用奠定了技术基础。

    Abstract:

    In recent years, transformer-based methods have achieved remarkable progress in the field of 3D human pose estimation. However, current approaches are still confronted with two major challenges. First, the computational inefficiency arises from the quadratic complexity of global self-attention when processing large-scale joint affinity matrices in dynamic video sequences. This issue significantly hampers the real-time performance of the models. Second, the suboptimal spatiotemporal feature fusion restricts the model’s ability to capture fine-grained motion patterns and structural dependencies between joints, leading to less accurate pose estimation results. To tackle these limitations, this paper proposes a novel architecture named the dual-stage spatio-temporal convolutional transformer (DSTCFormer). The key innovation of DSTCFormer lies in its decoupling of spatiotemporal feature learning into parallel spatial and temporal pathways. Specifically, the convolutional multi-scale attention (CMSA) module is introduced to hierarchically aggregate local and global correlations through convolution-enhanced multi-head attention. In the spatial pathway, convolutional position embeddings are utilized to encode skeletal topology, enabling the model to focus on intra-frame joint relationships. Meanwhile, the temporal pathway captures inter-frame motion coherence via axial-specific self-attention. Moreover, a cross-stage fusion mechanism is designed to integrate multi-scale spatiotemporal features through depthwise separable convolutions and feature transformation layers, which ensures efficient computation and robust feature representation. Extensive experiments conducted on the Human3.6M and MPI-INF-3DHP datasets demonstrate the superiority of DSTCFormer. Under Protocol 1 (P1), DSTCFormer achieves a state-of-the-art Mean Per Joint Position Error (MPJPE) of 40.1mm on Human3.6M with 243 input frames, outperforming PoseFormer (44.3 mm), MixSTE (40.9 mm), and STCFormer (40.5 mm). On the MPI-INF-3DHP dataset, it attains a percentage of correct keypoints at 150 mm (PCK@150 mm) of 99.1% and an area under curve (AUC) of 85.2%, surpassing existing methods by 0.4% and 1.3%, respectively. In summary, the proposed method not only advances the theoretical frameworks for spatiotemporal modeling but also offers practical implications for real-time applications, paving the way for more efficient and accurate 3D human pose estimation in various scenarios.

    参考文献
    相似文献
    引证文献
引用本文

邹宇,周先春,潘志庚,蔡创新.基于双阶时空交叉卷积Transformer的三维人体姿态估计方法[J].电子测量与仪器学报,2025,39(9):159-171

复制
分享
相关视频

文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2025-12-09
  • 出版日期:
文章二维码
×
《电子测量与仪器学报》
关于防范虚假编辑部邮件的郑重公告