TraMP-LLaMA: Generative Interpretability with Decoupled Instruction Tuning for Facial Expression Quality Assessment

2026-06-25 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

TraMP-LLaMA is a novel unified multimodal framework designed for Facial Expression Quality Assessment (FEQA), addressing the limitation of existing methods that only provide severity scores without explicit facial motion evidence. This framework jointly predicts severity scores and generates structured textual reports from facial motion cues, enhancing interpretability, particularly for Parkinson's disease assessment. It integrates RGB appearance and landmark trajectory cues, employing a decoupled instruction-tuning strategy to minimize task interference between severity prediction and language generation. To support its development, the PFED5 dataset was extended with expert-guided textual motion descriptions, resulting in PFED5-plus. Experiments on PFED5-plus demonstrate TraMP-LLaMA's superior performance, outperforming competitive video-language baselines in report generation and achieving the best severity prediction, improving Spearman's rank correlation by at least 4.39 percent.

Key takeaway

For Machine Learning Engineers developing AI for medical diagnostics or human-computer interaction, if you are struggling with the interpretability of facial expression quality assessment models, TraMP-LLaMA offers a robust solution. Your current models likely provide only severity scores; this framework enables generating explicit textual reports detailing facial motion evidence. You should consider integrating such generative interpretability approaches to enhance diagnostic transparency and user trust in your applications.

Key insights

TraMP-LLaMA provides generative interpretability for facial expression quality assessment by jointly predicting scores and generating textual reports.

Principles

Decoupled instruction tuning reduces task interference.
Multimodal cues (RGB, landmark trajectories) enhance assessment.
Generative textual reports improve model interpretability.

Method

The framework integrates RGB appearance and landmark trajectory cues, employing a decoupled instruction-tuning strategy for joint severity prediction and language generation.

In practice

Utilize TraMP-LLaMA for joint FEQA scoring and report generation.
Access PFED5-plus dataset for facial motion description research.
Explore the provided code for multimodal framework implementation.

Topics

Facial Expression Quality Assessment
Generative Interpretability
Multimodal AI
Instruction Tuning
Parkinson's Disease
Landmark Trajectory Analysis

Code references

shuchaoduan/TraMP-LLaMA

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.