As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

A study by Jasmine Owers, Edwin Simpson, and Martha Lewis investigated Large Language Models' (LLMs) capacity to interpret negation within figurative language. They augmented the existing Fig-QA dataset, which contains 10,256 instances of figurative sentences, with new annotations for negation, tense, and concreteness. Testing a range of models including GloVe, SBERT, Llama-3, -3.1, -3.3, GPT-4o-mini, GPT-4o, and OpenAI o1-mini, the researchers found that the combination of negation and figurativeness, particularly in similes, presents a significant challenge for LLMs. Model performance was highly sensitive to prompt style, with "question-answer" methods generally yielding higher accuracy than "mid-phrase" approaches. While human performance on the test set was 0.946, models struggled more with negated figurative language compared to literal negation, indicating a specific interaction effect.

Key takeaway

For NLP Engineers deploying LLMs in real-world applications, you should prioritize "question-answer" prompting styles for tasks involving complex language interpretation. Be aware that your models will likely exhibit significantly lower accuracy when processing text that combines negation with figurative language, especially similes. You should specifically evaluate your LLMs on these challenging linguistic interactions to identify potential failure points and consider targeted fine-tuning if high accuracy is critical.

Key insights

LLMs struggle with combined negation and figurative language, with prompt style significantly impacting interpretation accuracy.

Principles

Method

The study developed new annotations for the Fig-QA dataset, including negation, tense, and concreteness. It assessed models using cosine similarity for embedding models and log-likelihood or question-answer prompts for autoregressive LLMs. A small literal negation dataset was also created.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.