基于高效区域注意力机制的面部表情识别

张志勇; 王昱; 张顺

doi:10.3969/j.issn.2096-9066.2025.06.005

基于高效区域注意力机制的面部表情识别

Facial Expression Recognition with an Efficient Regional Attention Mechanism

摘要

摘要: 面部特征判别性提取这一关键性难题尚待突破，尤其当存在现实场景中普遍出现的干扰因素时，如光照条件变化、部分区域遮挡等，另外，数据分布不均衡时，模型泛化能力减弱的现象也频繁发生。本研究着重于细微表情感知能力的强化与模型鲁棒性的整体提升，基于此背景，构建了改进自ConvNeXt网络架构的ERAnet模型，核心点在于高效区域注意力模块的设计与集成，全局语义信息与局部细粒度特征的深度融合以及动态区域聚焦机制的应用。而最具判别力的面部区域能够被模型自主捕捉，得益于可学习区域掩码同通道注意力的协同作用。实验数据证实了该模型在细微表情感知方面的显著进步，具体表现为：首先通过区域注意力机制完成面部各区域关注度的动态调整；其次实施特征通道权重的优化分配；最终将多尺度分组卷积同注意力机制进行融合以丰富特征，使复杂环境下仍保持高效的特征提取能力。公开数据集FERPlus和RAF-DB上的测试结果显示，识别准确率分别达到91.45%和90.29%，相较于基准模型分别提升了1.76和2.43个百分点，实验证明该方法在表情识别任务中具有明确的应用前景与技术优势。

Abstract: Extracting highly discriminative facial features remains a challenging problem, particularly when there are common interference factors in real scenarios, such as changes in lighting conditions and partial occlusion of certain areas, which often weaken model generalization. To address these issues and significantly improve the perception of subtle expressions as well as overall robustness, this paper proposes ERAnet, an enhanced facial expression recognition model built upon the ConvNeXt architecture. The core contribution lies in the design and integration of an efficient regional attention module that deeply fuses global semantic information with local fine-grained details and employs a dynamic region-focusing mechanism. By combining learnable regional masks with channel attention, the model can automatically capture the most discriminative facial areas. Extensive experiments validate the effectiveness of the proposed approach: it dynamically adjusts attention across different facial regions, optimally reallocates channel weights, and integrates multi-scale grouped convolutions with attention mechanisms to enrich feature representations while maintaining efficient extraction even under complex conditions. On the publicly available FERPlus and RAF-DB datasets, ERAnet achieves recognition accuracies of 91.45% and 90.29%, respectively, representing improvements of 1.76 and 2.43 percentage points over strong baseline models. These results clearly demonstrate the practical potential and technical superiority of the proposed method for real-world facial expression recognition tasks.

HTML全文

参考文献(29)

施引文献

资源附件(0)