Training Data Size Sensitivity in Unsupervised Rhyme Recognition

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

RhymeTagger, a language-independent tool for unsupervised rhyme recognition, was evaluated across seven languages: Czech, German, English, French, Italian, Russian, and Slovene. The study investigated the impact of training data size and language differences on its accuracy. To establish a performance baseline, inter-annotator agreement was assessed on a manually annotated poem subset, revealing factors like phonetic similarity and word distance influencing expert disagreement. RhymeTagger's performance was also compared against three large language models (LLMs) using a one-shot learning approach. The research found that with sufficient training data, RhymeTagger consistently surpassed human agreement levels, whereas LLMs, due to their lack of phonetic representation, performed poorly on the task.

Key takeaway

For research scientists developing natural language processing tools for poetic analysis, this study indicates that RhymeTagger offers a robust, language-independent solution for rhyme recognition. You should consider integrating phonetic representations into your models, as LLMs without this capability significantly underperform. This approach can lead to more accurate and reliable automated literary analysis.

Key insights

RhymeTagger excels at unsupervised rhyme recognition, outperforming humans and LLMs when adequately trained.

Principles

Method

RhymeTagger identifies rhymes by detecting repeating patterns in poetry corpora. Its performance was evaluated against human agreement and LLMs using one-shot learning across multiple languages.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.