TraMP-LLaMA: Generative Interpretability with Decoupled Instruction Tuning for Facial Expression Quality Assessment

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

TraMP-LLaMA is a novel unified multimodal framework designed for Facial Expression Quality Assessment (FEQA), specifically addressing the interpretability limitations of existing methods that only provide severity scores without explicit facial motion evidence. This framework jointly predicts severity scores and generates structured textual reports based on facial motion cues, which is particularly relevant for Parkinson's disease assessment. It integrates RGB appearance and landmark trajectory cues and employs a decoupled instruction-tuning strategy to minimize task interference between severity prediction and language generation. To facilitate this, the PFED5 dataset was extended with expert-guided textual motion descriptions, creating PFED5-plus. Experiments on PFED5-plus demonstrate that TraMP-LLaMA surpasses competitive video-language baselines in report generation and achieves the best severity prediction performance, improving Spearman's rank correlation by at least 4.39 percent over all compared methods under joint multi-expression training.

Key takeaway

For AI Scientists developing interpretable facial expression quality assessment models, particularly in clinical contexts like Parkinson's disease, TraMP-LLaMA demonstrates a robust method. You should consider integrating multimodal inputs like RGB and landmark trajectories, alongside a decoupled instruction-tuning strategy, to jointly produce severity scores and explanatory textual reports. This approach significantly enhances model transparency and diagnostic utility, moving beyond opaque single-score predictions.

Key insights

TraMP-LLaMA offers generative interpretability for facial expression quality assessment by jointly predicting severity and generating textual motion reports.

Principles

Method

TraMP-LLaMA integrates RGB appearance and landmark trajectory cues, applying decoupled instruction-tuning to jointly predict severity scores and generate structured textual reports from facial motion evidence.

In practice

Topics

Code references

Best for: NLP Engineer, Computer Vision Engineer, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.