TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots

2025-09-22 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI for Mental Health, AI Ethics and Safety · Depth: Expert, extended

Summary

TherapyGym is a novel framework designed to evaluate and improve therapy chatbots, focusing on clinical fidelity and safety, which are often overlooked by generic LLM evaluation metrics. It introduces an automated pipeline that scores adherence to Cognitive Behavioral Therapy (CBT) techniques using the Cognitive Therapy Rating Scale (CTRS) over multi-turn sessions. Safety is assessed via a multi-label annotation scheme covering therapy-specific risks like failing to address harm or abuse. To validate LLM-based judges, TherapyGym includes TherapyJudgeBench, a dataset of 116 dialogues with 1,270 expert ratings. The framework also functions as a training harness, using CTRS and safety-based rewards to drive reinforcement learning with configurable patient simulations. Models trained with TherapyGym showed significant improvement in expert-rated CTRS scores, rising from 0.10 to 0.60, and reduced safety violations from 0.38 to 0.20.

Key takeaway

For AI Scientists and Research Scientists developing mental health chatbots, TherapyGym offers a robust framework to ensure clinical efficacy and safety. You should integrate clinically validated metrics like CTRS and explicit safety checks into your evaluation and alignment pipelines. This approach, demonstrated to improve skillfulness and reduce risks through reinforcement learning, can guide your model optimization beyond generic conversational fluency, leading to more responsible and effective therapeutic AI.

Key insights

TherapyGym evaluates and aligns therapy chatbots using clinical fidelity and safety metrics, improving performance through RL with expert-validated feedback.

Principles

Therapy chatbot evaluation requires clinical specificity.
Fidelity and safety are core pillars of effective therapy.
LLM judges can approximate expert therapist assessments.

Method

TherapyGym uses an automated CTRS pipeline for fidelity, multi-label annotation for safety, and TherapyJudgeBench for LLM judge validation. It fine-tunes LLMs via GRPO with skill- and safety-based reward signals from simulated patient interactions.

In practice

Use CTRS for objective CBT skill evaluation.
Implement multi-label safety checks for therapy chatbots.
Employ patient simulators for scalable RL training.

Topics

Therapy Chatbots
LLM Evaluation
Clinical Fidelity
Reinforcement Learning
Cognitive Behavioral Therapy

Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.