LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

2026-05-28 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

The LaRA (Layer-wise Representation Analysis) framework, published on 2026-05-28, addresses the critical problem of data contamination in reinforcement learning (RL) post-training for large language models (LLMs). While RL post-training enhances LLM reasoning, contamination can compromise generalization and evaluation reliability. Existing detection methods, which rely on output-level signals like likelihood or entropy, are ineffective for RL-trained models because RL optimizes behavior via trajectory-level rewards. LaRA introduces three complementary metrics—perturbation sensitivity, directional collapse, and local representation rigidity—measured under controlled perturbations. The framework identifies that contamination leads to progressive geometric deviations across model layers, manifesting as amplified perturbation sensitivity, stronger directional collapse, and enhanced local rigidity. A detection protocol aggregates these representation-level deviations, demonstrating superior performance over current output-level baselines in experiments with RL-trained reasoning models.

Key takeaway

For NLP Engineers or AI Scientists developing and deploying RL post-trained LLMs, you should integrate LaRA's layer-wise representation analysis into your model evaluation pipeline. Relying solely on output-level metrics for contamination detection is insufficient for RL-trained models. By monitoring perturbation sensitivity, directional collapse, and local rigidity across layers, you can proactively identify data contamination, thereby safeguarding model generalization and the reliability of your evaluation processes. This ensures robust and trustworthy LLM performance.

Key insights

Data contamination in RL post-trained LLMs can be reliably detected by analyzing layer-wise representation deviations.

Principles

RL-trained models require representation-level contamination detection, not output-level signals.
Contamination causes amplified perturbation sensitivity and stronger directional collapse across layers.
Local representation rigidity is enhanced across layers due to contamination.

Method

LaRA measures perturbation sensitivity, directional collapse, and local representation rigidity under controlled perturbations, then aggregates these layer-wise deviations across metrics for contamination detection.

In practice

Apply LaRA's protocol to evaluate the integrity of RL post-trained reasoning models.
Use layer-wise geometric deviations as indicators for data contamination.

Topics

Reinforcement Learning
Large Language Models
Data Contamination
Representation Analysis
LLM Post-training
Model Evaluation

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.