International Journal For Multidisciplinary Research

E-ISSN: 2582-2160     Impact Factor: 9.24

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

Call for Paper Volume 8, Issue 3 (May-June 2026) Submit your research before last 3 days of June to publish your research paper in the issue of May-June.

MMEF-Net: Multimodal Emotion Feature Network with Contextual Enrichment and Dynamic Modality Weighting

Author(s) Ms. Harshita Dubey, Mr. Mohit Kadwal
Country India
Abstract Recognizing human emotional states is a foundational challenge in building intelligent and responsive Human-Computer Interaction (HCI) systems. This paper presents MMEF-Net, a Multimodal Emotion Feature Network that integrates audio, visual, and textual modalities through a hierarchical contextual enrichment strategy combined with a dynamic modality weighting mechanism. To overcome the persistent limitation of scarce annotated training data, MMEF-Net employs state-of-the-art pre-trained encoders that provide transferable and discriminative representations for each modality. The audio branch applies HuBERT-Large (Hidden-Unit BERT) with selective extraction of intermediate transformer layers known to encode higher-level prosodic and spectral properties. The visual branch adopts a dual-path encoder pairing Contrastive Language–Image Pre-Training Vision Transformer Large (CLIP-ViT-Large) for holistic frame-level representations with OpenFace 2.0 derived facial region crops for fine-grained expression analysis. The textual branch incorporates a Large Language Model (LLM) guided augmentation pipeline in which GPT-4 generates emotion-aware pseudo-labels and salient keywords, while Qwen-Omni contributes video-grounded descriptions and supplementary emotional cues; these enriched signals are jointly encoded by ChineseRoBERTa-wwm-ext-large. Cross-modal integration is achieved through a dynamic weighting module applying self-attention with residual skip connections, preventing feature degradation during fusion. A multi-source label refinement pipeline further mitigates annotation noise by combining weak-classifier predictions with LLM-generated labels through majority voting. Extensive experiments on the MER2025-SEMI benchmark demonstrate that MMEF-Net attains a Weighted Average F-score (WAF) of 87.52%, representing a gain exceeding ten percentage points over the official baseline of 76.80%, thereby confirming the effectiveness of the proposed design for real-world multimodal emotion recognition.
Keywords Multimodal emotion recognition, MMEF-Net, dynamic modality weighting, contextual enrichment, large language models, self-attention fusion, HuBERT-Large, CLIP-ViT, affective computing, ensemble learning.
Field Computer > Artificial Intelligence / Simulation / Virtual Reality
Published In Volume 8, Issue 3, May-June 2026
Published On 2026-05-08
DOI https://doi.org/10.36948/ijfmr.2026.v08i03.77483

Share this