Abstract:
Extracting highly discriminative facial features remains a challenging problem, particularly when there are common interference factors in real scenarios, such as changes in lighting conditions and partial occlusion of certain areas, which often weaken model generalization. To address these issues and significantly improve the perception of subtle expressions as well as overall robustness, this paper proposes ERAnet, an enhanced facial expression recognition model built upon the ConvNeXt architecture. The core contribution lies in the design and integration of an efficient regional attention module that deeply fuses global semantic information with local fine-grained details and employs a dynamic region-focusing mechanism. By combining learnable regional masks with channel attention, the model can automatically capture the most discriminative facial areas. Extensive experiments validate the effectiveness of the proposed approach: it dynamically adjusts attention across different facial regions, optimally reallocates channel weights, and integrates multi-scale grouped convolutions with attention mechanisms to enrich feature representations while maintaining efficient extraction even under complex conditions. On the publicly available FERPlus and RAF-DB datasets, ERAnet achieves recognition accuracies of 91.45% and 90.29%, respectively, representing improvements of 1.76 and 2.43 percentage points over strong baseline models. These results clearly demonstrate the practical potential and technical superiority of the proposed method for real-world facial expression recognition tasks.