International Journal For Multidisciplinary Research
E-ISSN: 2582-2160
•
Impact Factor: 9.24
A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal
Home
Research Paper
Submit Research Paper
Publication Guidelines
Publication Charges
Upload Documents
Track Status / Pay Fees / Download Publication Certi.
Editors & Reviewers
View All
Join as a Reviewer
Get Membership Certificate
Current Issue
Publication Archive
Conference
Publishing Conf. with IJFMR
Upcoming Conference(s) ↓
Conferences Published ↓
DePaul-2026
IC-AIRCM-T3-2026
SPHERE-2025
AIMAR-2025
SVGASCA-2025
ICCE-2025
Chinai-2023
PIPRDA-2023
ICMRS'23
Contact Us
Plagiarism is checked by the leading plagiarism checker
Call for Paper
Volume 8 Issue 3
May-June 2026
Indexing Partners
MMEF-Net: Multimodal Emotion Feature Network with Contextual Enrichment and Dynamic Modality Weighting
| Author(s) | Ms. Harshita Dubey, Mr. Mohit Kadwal |
|---|---|
| Country | India |
| Abstract | Recognizing human emotional states is a foundational challenge in building intelligent and responsive Human-Computer Interaction (HCI) systems. This paper presents MMEF-Net, a Multimodal Emotion Feature Network that integrates audio, visual, and textual modalities through a hierarchical contextual enrichment strategy combined with a dynamic modality weighting mechanism. To overcome the persistent limitation of scarce annotated training data, MMEF-Net employs state-of-the-art pre-trained encoders that provide transferable and discriminative representations for each modality. The audio branch applies HuBERT-Large (Hidden-Unit BERT) with selective extraction of intermediate transformer layers known to encode higher-level prosodic and spectral properties. The visual branch adopts a dual-path encoder pairing Contrastive Language–Image Pre-Training Vision Transformer Large (CLIP-ViT-Large) for holistic frame-level representations with OpenFace 2.0 derived facial region crops for fine-grained expression analysis. The textual branch incorporates a Large Language Model (LLM) guided augmentation pipeline in which GPT-4 generates emotion-aware pseudo-labels and salient keywords, while Qwen-Omni contributes video-grounded descriptions and supplementary emotional cues; these enriched signals are jointly encoded by ChineseRoBERTa-wwm-ext-large. Cross-modal integration is achieved through a dynamic weighting module applying self-attention with residual skip connections, preventing feature degradation during fusion. A multi-source label refinement pipeline further mitigates annotation noise by combining weak-classifier predictions with LLM-generated labels through majority voting. Extensive experiments on the MER2025-SEMI benchmark demonstrate that MMEF-Net attains a Weighted Average F-score (WAF) of 87.52%, representing a gain exceeding ten percentage points over the official baseline of 76.80%, thereby confirming the effectiveness of the proposed design for real-world multimodal emotion recognition. |
| Keywords | Multimodal emotion recognition, MMEF-Net, dynamic modality weighting, contextual enrichment, large language models, self-attention fusion, HuBERT-Large, CLIP-ViT, affective computing, ensemble learning. |
| Field | Computer > Artificial Intelligence / Simulation / Virtual Reality |
| Published In | Volume 8, Issue 3, May-June 2026 |
| Published On | 2026-05-08 |
| DOI | https://doi.org/10.36948/ijfmr.2026.v08i03.77483 |
Share this

E-ISSN 2582-2160
CrossRef DOI is assigned to each research paper published in our journal.
IJFMR DOI prefix is
10.36948/ijfmr
Downloads
All research papers published on this website are licensed under Creative Commons Attribution-ShareAlike 4.0 International License, and all rights belong to their respective authors/researchers.
Powered by Sky Research Publication and Journals