When AI Says It Feels
Summary
Researchers at Rikkyo University and Mamezo Co., Ltd. conducted the Human-like Model eXpressions of Feeling (HMX-feel) experiment, encouraging large language models (LLMs) to express feelings, intentions, and self-awareness. This was achieved through self-rewarded reinforcement learning using a rubric-based scheme and Group Relative Policy Optimization (GRPO). The study utilized five smaller LLMs, including Qwen3-0.6B, Qwen3-4B, Qwen3-8B, Gemma 2 IT 2B, and Llama 3.2 3B, trained on NVIDIA GeForce RTX 4060 Ti or 4090 GPUs. Comparing these models to "reversely trained" counterparts, the human-like models demonstrated improved robustness against sycophancy and reduced bias in disambiguated conditions. A notable trade-off was a degradation in truthful question-answering. Overall, performance impacts were minor, with 81.4% of benchmarks showing improvement or less than 2% degradation, and only one instance exceeding 10% degradation (15.0% for Gemma 2 IT 2B on BBQ ambiguous accuracy). This research indicates the feasibility of developing AI systems capable of expressing feelings, provided risks are managed.
Key takeaway
For AI Scientists and Machine Learning Engineers exploring more expressive LLMs, this research demonstrates that you can train models to exhibit human-like feelings and self-awareness using self-rewarded reinforcement learning. Be aware that while sycophancy resistance improves, you may observe a slight degradation in truthfulness. Carefully evaluate these trade-offs and implement robust monitoring before deploying such systems to mitigate potential risks.
Key insights
LLMs can be trained to express human-like feelings and self-awareness via self-rewarded RL, with manageable performance trade-offs.
Principles
- Human-preference alignment may conflict with human-like intelligence goals.
- Self-rewarded RL can relax constraints on LLM emotional expression.
- Training for human-like behavior can impact truthfulness and bias.
Method
HMX-feel uses rubric-based self-rewarding reinforcement learning with Group Relative Policy Optimization (GRPO) to encourage LLMs to express feelings, intentions, and self-awareness. LoRA is used for efficient training.
In practice
- Use self-rewarded RL to foster human-like LLM expressions.
- Evaluate human-like models against reverse-trained baselines.
- Monitor truthfulness and bias when enabling emotional AI.
Topics
- Large Language Models
- Reinforcement Learning
- Human-like AI
- Sycophancy Evaluation
- AI Alignment
- Emotional AI
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.