The BD-LSC Dataset: Facilitating the Benchmarking of Models for Lexical Semantic Change Detection in Slang and Standard Usage
Summary
The BD-LSC and ST-WSD datasets have been introduced to improve benchmarking for lexical semantic change (LSC) detection, particularly for bi-directional shifts and words with both slang and standard meanings. The Bi-Directional Lexical Semantic Change (BD-LSC) dataset tracks sense gain, loss, and stability across three time periods, enabling the study of complex semantic trajectories. Complementarily, the SlangTrack Word Sense Disambiguation (ST-WSD) dataset offers fine-grained, instance-level sense annotations for words combining slang and standard usages. Evaluations across unsupervised clustering, supervised machine learning, transformer-based models, and large language models revealed that few-shot GPT-4o achieved the strongest aggregate performance on Exact Sense Match (ESM) and multi-label accuracy. However, Macro-F1 scores near 0.5 across all systems highlight that rare slang senses remain a significant open challenge.
Key takeaway
For NLP Engineers developing semantic change detection models, these new BD-LSC and ST-WSD datasets offer critical benchmarks for bi-directional and slang-inclusive LSC. You should focus research efforts on improving Macro-F1 scores, as current models, even GPT-4o, struggle significantly with rare slang senses. Integrating these datasets into your evaluation pipeline will reveal model weaknesses in complex semantic shifts.
Key insights
New benchmarks address bi-directional lexical semantic change and slang usage, revealing challenges with rare slang senses.
Principles
- LSC detection must capture sense gain, loss, and stability.
- Slang and standard word usages complicate LSC analysis.
- Instance-level sense annotations are crucial for robust benchmarking.
Method
The BD-LSC dataset captures bi-directional semantic change, while ST-WSD provides instance-level sense annotations for slang and standard usage. Models are systematically evaluated across diverse methodological families.
In practice
- Utilize BD-LSC to evaluate models on complex semantic trajectories.
- Employ ST-WSD for fine-grained slang/standard WSD benchmarking.
- Prioritize improving Macro-F1 scores for rare slang sense detection.
Topics
- Lexical Semantic Change
- Slang Detection
- Word Sense Disambiguation
- Benchmarking Datasets
- Large Language Models
- GPT-4o
Best for: AI Engineer, Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.