Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Legal & Regulatory · Depth: Expert, quick

Summary

A study investigated the accuracy and quality of multi-turn conversations between developers and LLM-based agents for Non-Functional Requirements (NFRs) assessment. Focusing on HIPAA regulatory compliance, researchers hired 49 programmers to interact with GitHub Copilot. These developers assessed 148 HIPAA-derived NFRs against the iTrust codebase across three dimensions: requirement satisfaction level, reasoning, and code localization. The findings indicate that while developers generally agree with LLM assessments, the accuracy against expert ground truth remains low. Furthermore, the study modeled user satisfaction, revealing that longer system responses and an increased number of information-providing turns negatively impact user satisfaction, whereas proactive interactions positively influence it. This research highlights critical gaps in current LLM evaluation benchmarks, which primarily focus on functional correctness.

Key takeaway

For AI Engineers developing LLM-based dialogue systems for complex tasks like Non-Functional Requirements assessment, you must prioritize improving factual accuracy against expert ground truth. While users may agree with LLM outputs, their correctness is often low. Focus on designing systems with proactive interactions and optimizing response length and information density per turn, as these factors significantly influence user satisfaction in multi-turn dialogues.

Key insights

LLM-based agents struggle with NFR assessment accuracy despite user agreement, with dialogue length and proactivity impacting satisfaction.

Principles

Method

A study involved 49 programmers using GitHub Copilot to assess 148 HIPAA-derived NFRs against the iTrust codebase, evaluating satisfaction, reasoning, and code localization.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.