Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset
Summary
The paper "Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset" identifies a critical supervision design flaw in existing formality transfer benchmarks, such as GYAFC. These benchmarks frame formality transfer as a symmetric bidirectional task, but their binary human rewrites capture relative stylistic shifts rather than absolute human perceptions of formality. This flaw causes models to produce "pseudo-formal" language that satisfies benchmark labels but lacks genuine formality. The authors quantify this misalignment and propose a new framework that reconceptualizes formality as a three-level graded dimension: informal, casual, and formal, with "casual" serving as an explicit intermediate state. Based on this, they introduce 3LF, a new dataset providing parallel supervision across these three levels. Training on 3LF substantially reduces informal-to-formal failures and improves alignment with human perception; for instance, GPT-4.1-nano's F1 score improved from 0.06 to 0.88 in the informal-to-formal direction, despite 3LF being smaller than GYAFC.
Key takeaway
For NLP Engineers developing controllable text generation systems, especially for formality transfer, you should critically assess your benchmark's supervision design. Relying on binary formal/informal labels can lead to models generating pseudo-formal outputs. Consider adopting a graded formality spectrum, incorporating an intermediate "casual" state, and building datasets like 3LF. This approach significantly improves human alignment and reduces informal-to-formal generation failures, as demonstrated by GPT-4.1-nano's F1 score improvement from 0.06 to 0.88.
Key insights
Existing formality transfer benchmarks misalign supervision, leading models to generate pseudo-formal text; a graded "casual" anchor resolves this.
Principles
- Formality transfer is a graded dimension, not binary.
- Supervision design shapes stylistic alignment.
- Intermediate states clarify supervision signals.
Method
Reconceptualize formality as informal, casual, formal. Introduce 3LF dataset with parallel supervision across these three levels to train models for improved human alignment.
In practice
- Re-evaluate existing benchmark labels for misalignment.
- Design datasets with graded stylistic dimensions.
- Use "casual" as an intermediate anchor for clarity.
Topics
- Formality Transfer
- Text Generation
- Dataset Design
- Supervision Misalignment
- Natural Language Processing
- GPT-4.1-nano
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.