Robot indoor scene recognition based on fusion of CNN and Transformer
DOI:
CSTR:
Author:
Affiliation:

Clc Number:

TP242;TN98

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    In order to improve the accuracy of robot scene recognition in complex indoor environments, this paper proposes a robot scene recognition model that fuses convolutional neural network (CNN) and visual Transformer structure. The model uses CNN to extract local features of the scene. And the visual Transformer structure is used to capture the distant dependencies in the features. The proposed visual Transformer structure consists of three parts, they are a feature encoding structure (Attention Embedding), an Encoder structure, and a structure that converts high-level semantic features into pixel-level features (Attention Project). The robot scene recognition model studied in this paper uses CNN to improve the description ability of local detail features of the visual Transformer. Furthermore, the visual Transformer helps CNN to construct the dependencies of distant features, which can effectively characterize and utilize the visual features of the robot working scene images. Finally, the effectiveness of the model is verified by experimenting with the dataset collected by the robot in the actual working environment and the open source COLD dataset. The scene recognition accuracy of our model is higher.

    Reference
    Related
    Cited by
Get Citation
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:
  • Revised:
  • Adopted:
  • Online: September 18,2023
  • Published:
Article QR Code