TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots
Summary
TherapyGym is a novel framework designed to evaluate and improve therapy chatbots, focusing on clinical fidelity and safety, which are often overlooked by generic LLM evaluation metrics. It introduces an automated pipeline that scores adherence to Cognitive Behavioral Therapy (CBT) techniques using the Cognitive Therapy Rating Scale (CTRS) over multi-turn sessions. Safety is assessed via a multi-label annotation scheme covering therapy-specific risks like failing to address harm or abuse. To validate LLM-based judges, TherapyGym includes TherapyJudgeBench, a dataset of 116 dialogues with 1,270 expert ratings. The framework also functions as a training harness, using CTRS and safety-based rewards to drive reinforcement learning with configurable patient simulations. Models trained with TherapyGym showed significant improvement in expert-rated CTRS scores, rising from 0.10 to 0.60, and reduced safety violations from 0.38 to 0.20.
Key takeaway
For AI Scientists and Research Scientists developing mental health chatbots, TherapyGym offers a robust framework to ensure clinical efficacy and safety. You should integrate clinically validated metrics like CTRS and explicit safety checks into your evaluation and alignment pipelines. This approach, demonstrated to improve skillfulness and reduce risks through reinforcement learning, can guide your model optimization beyond generic conversational fluency, leading to more responsible and effective therapeutic AI.
Key insights
TherapyGym evaluates and aligns therapy chatbots using clinical fidelity and safety metrics, improving performance through RL with expert-validated feedback.
Principles
- Therapy chatbot evaluation requires clinical specificity.
- Fidelity and safety are core pillars of effective therapy.
- LLM judges can approximate expert therapist assessments.
Method
TherapyGym uses an automated CTRS pipeline for fidelity, multi-label annotation for safety, and TherapyJudgeBench for LLM judge validation. It fine-tunes LLMs via GRPO with skill- and safety-based reward signals from simulated patient interactions.
In practice
- Use CTRS for objective CBT skill evaluation.
- Implement multi-label safety checks for therapy chatbots.
- Employ patient simulators for scalable RL training.
Topics
- Therapy Chatbots
- LLM Evaluation
- Clinical Fidelity
- Reinforcement Learning
- Cognitive Behavioral Therapy
Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.