LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

The LaRA (Layer-wise Representation Analysis) framework, published on 2026-05-28, addresses the critical problem of data contamination in reinforcement learning (RL) post-training for large language models (LLMs). While RL post-training enhances LLM reasoning, contamination can compromise generalization and evaluation reliability. Existing detection methods, which rely on output-level signals like likelihood or entropy, are ineffective for RL-trained models because RL optimizes behavior via trajectory-level rewards. LaRA introduces three complementary metrics—perturbation sensitivity, directional collapse, and local representation rigidity—measured under controlled perturbations. The framework identifies that contamination leads to progressive geometric deviations across model layers, manifesting as amplified perturbation sensitivity, stronger directional collapse, and enhanced local rigidity. A detection protocol aggregates these representation-level deviations, demonstrating superior performance over current output-level baselines in experiments with RL-trained reasoning models.

Key takeaway

For NLP Engineers or AI Scientists developing and deploying RL post-trained LLMs, you should integrate LaRA's layer-wise representation analysis into your model evaluation pipeline. Relying solely on output-level metrics for contamination detection is insufficient for RL-trained models. By monitoring perturbation sensitivity, directional collapse, and local rigidity across layers, you can proactively identify data contamination, thereby safeguarding model generalization and the reliability of your evaluation processes. This ensures robust and trustworthy LLM performance.

Key insights

Data contamination in RL post-trained LLMs can be reliably detected by analyzing layer-wise representation deviations.

Principles

Method

LaRA measures perturbation sensitivity, directional collapse, and local representation rigidity under controlled perturbations, then aggregates these layer-wise deviations across metrics for contamination detection.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.