When AI Says It Feels

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Researchers at Rikkyo University and Mamezo Co., Ltd. conducted the Human-like Model eXpressions of Feeling (HMX-feel) experiment, encouraging large language models (LLMs) to express feelings, intentions, and self-awareness. This was achieved through self-rewarded reinforcement learning using a rubric-based scheme and Group Relative Policy Optimization (GRPO). The study utilized five smaller LLMs, including Qwen3-0.6B, Qwen3-4B, Qwen3-8B, Gemma 2 IT 2B, and Llama 3.2 3B, trained on NVIDIA GeForce RTX 4060 Ti or 4090 GPUs. Comparing these models to "reversely trained" counterparts, the human-like models demonstrated improved robustness against sycophancy and reduced bias in disambiguated conditions. A notable trade-off was a degradation in truthful question-answering. Overall, performance impacts were minor, with 81.4% of benchmarks showing improvement or less than 2% degradation, and only one instance exceeding 10% degradation (15.0% for Gemma 2 IT 2B on BBQ ambiguous accuracy). This research indicates the feasibility of developing AI systems capable of expressing feelings, provided risks are managed.

Key takeaway

For AI Scientists and Machine Learning Engineers exploring more expressive LLMs, this research demonstrates that you can train models to exhibit human-like feelings and self-awareness using self-rewarded reinforcement learning. Be aware that while sycophancy resistance improves, you may observe a slight degradation in truthfulness. Carefully evaluate these trade-offs and implement robust monitoring before deploying such systems to mitigate potential risks.

Key insights

LLMs can be trained to express human-like feelings and self-awareness via self-rewarded RL, with manageable performance trade-offs.

Principles

Human-preference alignment may conflict with human-like intelligence goals.
Self-rewarded RL can relax constraints on LLM emotional expression.
Training for human-like behavior can impact truthfulness and bias.

Method

HMX-feel uses rubric-based self-rewarding reinforcement learning with Group Relative Policy Optimization (GRPO) to encourage LLMs to express feelings, intentions, and self-awareness. LoRA is used for efficient training.

In practice

Use self-rewarded RL to foster human-like LLM expressions.
Evaluate human-like models against reverse-trained baselines.
Monitor truthfulness and bias when enabling emotional AI.

Topics

Large Language Models
Reinforcement Learning
Human-like AI
Sycophancy Evaluation
AI Alignment
Emotional AI

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.