Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment
Summary
A study investigated the accuracy and quality of multi-turn conversations between developers and LLM-based agents for Non-Functional Requirements (NFRs) assessment. Focusing on HIPAA regulatory compliance, researchers hired 49 programmers to interact with GitHub Copilot. These developers assessed 148 HIPAA-derived NFRs against the iTrust codebase across three dimensions: requirement satisfaction level, reasoning, and code localization. The findings indicate that while developers generally agree with LLM assessments, the accuracy against expert ground truth remains low. Furthermore, the study modeled user satisfaction, revealing that longer system responses and an increased number of information-providing turns negatively impact user satisfaction, whereas proactive interactions positively influence it. This research highlights critical gaps in current LLM evaluation benchmarks, which primarily focus on functional correctness.
Key takeaway
For AI Engineers developing LLM-based dialogue systems for complex tasks like Non-Functional Requirements assessment, you must prioritize improving factual accuracy against expert ground truth. While users may agree with LLM outputs, their correctness is often low. Focus on designing systems with proactive interactions and optimizing response length and information density per turn, as these factors significantly influence user satisfaction in multi-turn dialogues.
Key insights
LLM-based agents struggle with NFR assessment accuracy despite user agreement, with dialogue length and proactivity impacting satisfaction.
Principles
- LLM NFR assessments often lack expert-level accuracy despite user concurrence.
- User satisfaction in LLM dialogues decreases with longer responses and more information-providing turns.
- Proactive LLM interactions enhance user satisfaction in assessment tasks.
Method
A study involved 49 programmers using GitHub Copilot to assess 148 HIPAA-derived NFRs against the iTrust codebase, evaluating satisfaction, reasoning, and code localization.
In practice
- Prioritize proactive LLM interactions in NFR assessment tools.
- Optimize LLM response length to improve user satisfaction.
- Design LLM dialogues to manage information density per turn.
Topics
- LLM Dialogue Systems
- Non-Functional Requirements
- HIPAA Compliance
- User Satisfaction
- GitHub Copilot
- LLM Evaluation
Best for: Research Scientist, AI Scientist, NLP Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.