CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning
Summary
CoRA, a novel Confidence-Rationale Alignment framework, addresses the issue of misleadingly high confidence in Chain-of-Thought (CoT) reasoning within Large Language Models (LLMs), where rationales may seem plausible but lack substantive support. This framework introduces a GRPO-based reinforcement learning approach that jointly optimizes for answer correctness, committed-answer probability, and rubric-based rationale support. The rubric evaluates rationale grounding, coherence, task match, and connection to the selected answer without access to the gold answer. Across MedQA, MathQA, and OpenBookQA datasets, utilizing three open-weight LLMs, CoRA successfully reduced the confidence-rationale alignment error by up to 26.51% compared to untuned checkpoints, SFT, and correctness-only GRPO. The method also maintained competitive accuracy and frequently improved calibration, demonstrating that reliable CoT reasoning necessitates rationales that genuinely support confident answers.
Key takeaway
For Machine Learning Engineers deploying Chain-of-Thought (CoT) LLMs, prioritize confidence-rationale alignment. High answer confidence is insufficient; rationales must substantively justify it. Implement frameworks like CoRA's GRPO-based approach, which explicitly reward rationale quality alongside correctness and confidence. This ensures your models provide transparent, trustworthy reasoning, reducing misleading outputs in critical applications.
Key insights
Reliable Chain-of-Thought reasoning requires aligning model confidence with the substantive support provided by its generated rationale.
Principles
- Jointly reward correctness, confidence, and rationale quality.
- Evaluate rationale grounding, coherence, and task match.
- Substantive rationales are crucial for reliable CoT.
Method
A GRPO-based reinforcement learning framework jointly rewards answer correctness, committed-answer probability, and rubric-based rationale support, assessing grounding, coherence, task match, and answer connection.
In practice
- Implement rubric-based rationale evaluation.
- Apply GRPO for confidence-rationale alignment.
- Test alignment on diverse QA datasets.
Topics
- Chain-of-Thought Reasoning
- Large Language Models
- Reinforcement Learning
- Model Confidence
- Rationale Generation
- Model Calibration
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.