Interpretable Discriminative Text Representations via Agreement and Label Disentanglement
Summary
A new operational criterion for interpretable discriminative text representations is proposed, focusing on conceptual clarity and label disentanglement. Conceptual clarity is measured by chance-adjusted agreement (Cohen's κ) between independent annotators applying a feature definition, while label disentanglement ensures the feature does not merely paraphrase the prediction target. This criterion is instantiated in LLM-assisted Feature Discovery (LFD), an iterative method that uses a proposer LLM to suggest lexical and semantic features from contrastive text pairs. An independent examiner LLM then screens these candidates, retaining only those with cross-LLM Cohen's κ ≥ 0.70. Features are selected based on residual held-out predictive gain. Across ten text-classification tasks spanning seven corpora, LFD achieved predictive performance comparable to a strong Text Bottleneck Model (TBM) baseline, but yielded substantially clearer and less label-entangled features. Human audits involving 232 raters further validated LFD features, showing higher human-human and human-LLM agreement and reduced label leakage compared to baseline concepts.
Key takeaway
For Machine Learning Engineers developing interpretable text classification models, prioritize features that demonstrate both conceptual clarity and label disentanglement. You should implement a two-stage LLM process, using one LLM to propose features and a separate, independent LLM to validate their definitions via cross-LLM Cohen's κ ≥ 0.70. This approach ensures features are reliably measurable and distinct from the target label, enhancing model auditability and trustworthiness.
Key insights
Interpretable text features require both conceptual clarity via inter-annotator agreement and label disentanglement.
Principles
- Conceptual clarity requires chance-adjusted inter-annotator agreement.
- Features must be distinct from the target label (label disentanglement).
- Cross-LLM agreement screens for feature reliability.
Method
LFD iteratively proposes features from contrastive text pairs using a proposer LLM. An independent examiner LLM screens candidates via cross-LLM Cohen's κ ≥ 0.70. Features are then selected by residual predictive gain.
In practice
- Use cross-LLM κ to validate feature definitions.
- Design LLM pipelines to separate feature proposal and examination.
- Employ contrastive examples for disentangled feature discovery.
Topics
- Interpretable AI
- Text Classification
- Large Language Models
- Feature Discovery
- Cohen's Kappa
- Label Disentanglement
Best for: NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.