MMEF-Net: Multimodal Emotion Feature Network with Contextual Enrichment and Dynamic Modality Weighting

Harshita Dubey; Mohit Kadwal

doi:10.36948/ijfmr.2026.v08i03.77483

MMEF-Net: Multimodal Emotion Feature Network with Contextual Enrichment and Dynamic Modality Weighting

Author(s)	Ms. Harshita Dubey, Mr. Mohit Kadwal
Country	India
Abstract	Recognizing human emotional states is a foundational challenge in building intelligent and responsive Human-Computer Interaction (HCI) systems. This paper presents MMEF-Net, a Multimodal Emotion Feature Network that integrates audio, visual, and textual modalities through a hierarchical contextual enrichment strategy combined with a dynamic modality weighting mechanism. To overcome the persistent limitation of scarce annotated training data, MMEF-Net employs state-of-the-art pre-trained encoders that provide transferable and discriminative representations for each modality. The audio branch applies HuBERT-Large (Hidden-Unit BERT) with selective extraction of intermediate transformer layers known to encode higher-level prosodic and spectral properties. The visual branch adopts a dual-path encoder pairing Contrastive Language–Image Pre-Training Vision Transformer Large (CLIP-ViT-Large) for holistic frame-level representations with OpenFace 2.0 derived facial region crops for fine-grained expression analysis. The textual branch incorporates a Large Language Model (LLM) guided augmentation pipeline in which GPT-4 generates emotion-aware pseudo-labels and salient keywords, while Qwen-Omni contributes video-grounded descriptions and supplementary emotional cues; these enriched signals are jointly encoded by ChineseRoBERTa-wwm-ext-large. Cross-modal integration is achieved through a dynamic weighting module applying self-attention with residual skip connections, preventing feature degradation during fusion. A multi-source label refinement pipeline further mitigates annotation noise by combining weak-classifier predictions with LLM-generated labels through majority voting. Extensive experiments on the MER2025-SEMI benchmark demonstrate that MMEF-Net attains a Weighted Average F-score (WAF) of 87.52%, representing a gain exceeding ten percentage points over the official baseline of 76.80%, thereby confirming the effectiveness of the proposed design for real-world multimodal emotion recognition.
Keywords	Multimodal emotion recognition, MMEF-Net, dynamic modality weighting, contextual enrichment, large language models, self-attention fusion, HuBERT-Large, CLIP-ViT, affective computing, ensemble learning.
Field	Computer > Artificial Intelligence / Simulation / Virtual Reality
Published In	Volume 8, Issue 3, May-June 2026
Published On	2026-05-08
DOI	https://doi.org/10.36948/ijfmr.2026.v08i03.77483

View / Download PDF File

E-ISSN 2582-2160

doi

CrossRef DOI is assigned to each research paper published in our journal.

IJFMR DOI prefix is
10.36948/ijfmr

Downloads

Research Paper Format Copyright Permission Form and Undertaking Form Cover Page Vol 8 Isu 2 Cover Page Vol 8 Isu 1 Cover Page Vol 7 Isu 6

All research papers published on this website are licensed under Creative Commons Attribution-ShareAlike 4.0 International License, and all rights belong to their respective authors/researchers.

CC-BY-SA

About IJFMR Fees & Payment Current Issue Publication Archive	Submit Research Paper Track Submission Status Publication Guidelines Publication Ethics Peer Review & Plagiarism	Join as a Reviewer Editors & Reviewers Reviewer Referral Program Get Reviewer Membership Certi.	Website/Journal Policies Usage Policy Content Policies Privacy Policy

Contact Us		+91-9687-828-838	editor@ijfmr.com

International Journal For Multidisciplinary Research

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

MMEF-Net: Multimodal Emotion Feature Network with Contextual Enrichment and Dynamic Modality Weighting

Share this