Label-Aware Pseudo-Training Sample Generation for Text Classification
Summary
A novel data augmentation algorithm for Natural Language Processing (NLP) tasks, designed to address limited training data, has been developed. This method subtly alters sentences by inserting random words and then employs Large Language Models (LLMs) to identify the most suitable replacements within their embedding space. Drawing inspiration from Prompt Tuning, the algorithm optimizes the embedding vectors of these inserted tokens by maximizing the conditional generation probability, rather than focusing on the input prompt itself. This approach facilitates the generation of a large volume of samples, implicitly leveraging the extensive knowledge embedded within LLMs. Extensive experiments across various benchmark text classification tasks demonstrate a substantial improvement in performance compared to non-augmented baselines.
Key takeaway
For research scientists developing NLP models with limited datasets, this data augmentation algorithm offers a robust solution. You should consider integrating this LLM-based method to generate diverse training samples, as it has shown substantial performance improvements in text classification benchmarks. This approach allows you to implicitly benefit from LLM knowledge without extensive prompt engineering.
Key insights
A new data augmentation method uses LLMs to subtly alter sentences, improving NLP performance with limited data.
Principles
- LLMs can augment data effectively.
- Optimizing token embeddings enhances generation.
Method
Insert random words into sentences, then use LLMs to find optimal replacements by maximizing conditional generation probability through updating inserted tokens' embedding vectors.
In practice
- Improve text classification with scarce data.
- Apply LLM knowledge for sample generation.
Topics
- Data Augmentation
- Large Language Models
- Natural Language Processing
- Prompt Tuning
- Text Classification
Best for: Research Scientist, AI Researcher, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Journal of Artificial Intelligence Research.