基于复杂背景的多尺度特征融合手-物交互检测方法

A Multi-scale Feature Fusion Method for Hand-object Interaction Detection Based on Complex Background

  • 摘要: 针对手与物体在交互过程中不同场景的背景噪声、光照变化等复杂背景,以及手-物相互遮挡、分辨率低等问题对手-物交互检测精度的影响,提出采用一种两阶段的多尺度特征融合手-物交互检测方法。首先,引入基于特征金字塔的残差网络Resnet50作为特征提取主干网络,实现深层语义信息和浅层细节特征的多尺度融合,提高小目标检测的精度;然后,利用检测到的手部区域与物体区域的几何信息来判断是否交互,过滤非交互的物体;最后,在大规模的室内外包括11个类别的手接触物体的人类互动视频帧数据集进行实验,提高网络的泛化性能。实验结果表明,本文所提方法和两阶检测方法相比,在提高检测精度的同时没有增加网络模型复杂度,同时在数据集不同类别的检测精度相对稳定,有效提升了网络的泛化性能。

     

    Abstract: A two-stage multi-scale feature fusion method is proposed to address the challenges posed by complex backgrounds such as background noise and lighting variations, as well as issues like occlusion and low resolution in hand-object interaction detection.In the first stage, the Resnet50 residual network is introduced based on the feature pyramid as the backbone network for feature extraction, achieving multi-scale fusion of deep semantic information and shallow detail features, and improving the accuracy of small object detection.Subsequently, the geometric information between the detected hand region and object region is utilized to determine the occurrence of interaction, thereby filtering out non-interacting objects.Finally, extensive experiments are conducted on a large-scale dataset comprising human interaction video frames from both indoor and outdoor environments, involving 11 object categories that are commonly touched by hands.The experimental results show that compared with other methods of the same type, the proposed method improves detection accuracy without increasing the complexity of the network model.At the same time, the detection accuracy of different categories in the dataset is relatively stable, effectively improving the generalization performance of the network.

     

/

返回文章
返回