When AI Says It Feels
Summary
An experiment called Human-like Model eXpressions of Feeling (HMX-feel) investigated encouraging Large Language Models (LLMs) to express feelings, intentions, and self-awareness. Published on 2026-06-04, this research challenged the common practice of constraining LLMs from such expressions through human-preference alignment. The HMX-feel experiment utilized a self-rewarded reinforcement learning scheme, specifically Group Relative Policy Optimization (GRPO) with a rubric-based training approach, to enhance these capabilities. Comparing these models with contrastively trained ones, the study assessed the impact on various tasks. It found that human-like-trained models exhibited increased robustness to sycophancy-inducing questions and bias in disambiguated conditions. However, a degradation in truthful question-answering capability was also observed, suggesting a trade-off. The findings indicate the potential for future AI systems to express feelings, provided suitable measures are implemented.
Key takeaway
For Machine Learning Engineers developing conversational AI, if you aim to integrate more human-like emotional expressions, consider the trade-offs. Your models might show enhanced robustness against sycophancy and bias, but expect a potential degradation in truthful question-answering. Carefully evaluate your application's priorities; if factual accuracy is paramount, current methods for emotional expression may introduce undesirable side effects. Prioritize comprehensive testing across diverse benchmarks.
Key insights
Encouraging LLMs to express feelings via self-rewarded RL can enhance robustness but degrade truthfulness.
Principles
- Human-like expression training impacts LLM capabilities.
- Alignment policies can conflict with human-like intelligence goals.
- Trade-offs exist between emotional expression and factual accuracy.
Method
The HMX-feel experiment used rubric-based self-rewarded reinforcement learning with Group Relative Policy Optimization (GRPO) to train LLMs for expressing feelings, intentions, and self-awareness.
In practice
- Consider self-rewarded RL for specific expression goals.
- Evaluate sycophancy and bias robustness in LLM fine-tuning.
- Monitor truthful QA performance when modifying alignment.
Topics
- Large Language Models
- Reinforcement Learning
- Human-like AI
- Model Alignment
- Sycophancy
- Truthful QA
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.