LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training
Summary
The LaRA (Layer-wise Representation Analysis) framework, published on 2026-05-28, addresses the critical problem of data contamination in reinforcement learning (RL) post-training for large language models (LLMs). While RL post-training enhances LLM reasoning, contamination can compromise generalization and evaluation reliability. Existing detection methods, which rely on output-level signals like likelihood or entropy, are ineffective for RL-trained models because RL optimizes behavior via trajectory-level rewards. LaRA introduces three complementary metrics—perturbation sensitivity, directional collapse, and local representation rigidity—measured under controlled perturbations. The framework identifies that contamination leads to progressive geometric deviations across model layers, manifesting as amplified perturbation sensitivity, stronger directional collapse, and enhanced local rigidity. A detection protocol aggregates these representation-level deviations, demonstrating superior performance over current output-level baselines in experiments with RL-trained reasoning models.
Key takeaway
For NLP Engineers or AI Scientists developing and deploying RL post-trained LLMs, you should integrate LaRA's layer-wise representation analysis into your model evaluation pipeline. Relying solely on output-level metrics for contamination detection is insufficient for RL-trained models. By monitoring perturbation sensitivity, directional collapse, and local rigidity across layers, you can proactively identify data contamination, thereby safeguarding model generalization and the reliability of your evaluation processes. This ensures robust and trustworthy LLM performance.
Key insights
Data contamination in RL post-trained LLMs can be reliably detected by analyzing layer-wise representation deviations.
Principles
- RL-trained models require representation-level contamination detection, not output-level signals.
- Contamination causes amplified perturbation sensitivity and stronger directional collapse across layers.
- Local representation rigidity is enhanced across layers due to contamination.
Method
LaRA measures perturbation sensitivity, directional collapse, and local representation rigidity under controlled perturbations, then aggregates these layer-wise deviations across metrics for contamination detection.
In practice
- Apply LaRA's protocol to evaluate the integrity of RL post-trained reasoning models.
- Use layer-wise geometric deviations as indicators for data contamination.
Topics
- Reinforcement Learning
- Large Language Models
- Data Contamination
- Representation Analysis
- LLM Post-training
- Model Evaluation
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.