Interpretable Discriminative Text Representations via Agreement and Label Disentanglement

2026-05-21 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

A new operational criterion for interpretable discriminative text representations is proposed, focusing on conceptual clarity and label disentanglement. Conceptual clarity is measured by chance-adjusted agreement (Cohen's κ) between independent annotators applying a feature definition, while label disentanglement ensures the feature does not merely paraphrase the prediction target. This criterion is instantiated in LLM-assisted Feature Discovery (LFD), an iterative method that uses a proposer LLM to suggest lexical and semantic features from contrastive text pairs. An independent examiner LLM then screens these candidates, retaining only those with cross-LLM Cohen's κ ≥ 0.70. Features are selected based on residual held-out predictive gain. Across ten text-classification tasks spanning seven corpora, LFD achieved predictive performance comparable to a strong Text Bottleneck Model (TBM) baseline, but yielded substantially clearer and less label-entangled features. Human audits involving 232 raters further validated LFD features, showing higher human-human and human-LLM agreement and reduced label leakage compared to baseline concepts.

Key takeaway

For Machine Learning Engineers developing interpretable text classification models, prioritize features that demonstrate both conceptual clarity and label disentanglement. You should implement a two-stage LLM process, using one LLM to propose features and a separate, independent LLM to validate their definitions via cross-LLM Cohen's κ ≥ 0.70. This approach ensures features are reliably measurable and distinct from the target label, enhancing model auditability and trustworthiness.

Key insights

Interpretable text features require both conceptual clarity via inter-annotator agreement and label disentanglement.

Principles

Conceptual clarity requires chance-adjusted inter-annotator agreement.
Features must be distinct from the target label (label disentanglement).
Cross-LLM agreement screens for feature reliability.

Method

LFD iteratively proposes features from contrastive text pairs using a proposer LLM. An independent examiner LLM screens candidates via cross-LLM Cohen's κ ≥ 0.70. Features are then selected by residual predictive gain.

In practice

Use cross-LLM κ to validate feature definitions.
Design LLM pipelines to separate feature proposal and examination.
Employ contrastive examples for disentangled feature discovery.

Topics

Interpretable AI
Text Classification
Large Language Models
Feature Discovery
Cohen's Kappa
Label Disentanglement

Best for: NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.