Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment

2026-06-23 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Legal & Regulatory · Depth: Expert, quick

Summary

A study investigated the accuracy and quality of multi-turn conversations between developers and LLM-based agents for Non-Functional Requirements (NFRs) assessment. Focusing on HIPAA regulatory compliance, researchers hired 49 programmers to interact with GitHub Copilot. These developers assessed 148 HIPAA-derived NFRs against the iTrust codebase across three dimensions: requirement satisfaction level, reasoning, and code localization. The findings indicate that while developers generally agree with LLM assessments, the accuracy against expert ground truth remains low. Furthermore, the study modeled user satisfaction, revealing that longer system responses and an increased number of information-providing turns negatively impact user satisfaction, whereas proactive interactions positively influence it. This research highlights critical gaps in current LLM evaluation benchmarks, which primarily focus on functional correctness.

Key takeaway

For AI Engineers developing LLM-based dialogue systems for complex tasks like Non-Functional Requirements assessment, you must prioritize improving factual accuracy against expert ground truth. While users may agree with LLM outputs, their correctness is often low. Focus on designing systems with proactive interactions and optimizing response length and information density per turn, as these factors significantly influence user satisfaction in multi-turn dialogues.

Key insights

LLM-based agents struggle with NFR assessment accuracy despite user agreement, with dialogue length and proactivity impacting satisfaction.

Principles

LLM NFR assessments often lack expert-level accuracy despite user concurrence.
User satisfaction in LLM dialogues decreases with longer responses and more information-providing turns.
Proactive LLM interactions enhance user satisfaction in assessment tasks.

Method

A study involved 49 programmers using GitHub Copilot to assess 148 HIPAA-derived NFRs against the iTrust codebase, evaluating satisfaction, reasoning, and code localization.

In practice

Prioritize proactive LLM interactions in NFR assessment tools.
Optimize LLM response length to improve user satisfaction.
Design LLM dialogues to manage information density per turn.

Topics

LLM Dialogue Systems
Non-Functional Requirements
HIPAA Compliance
User Satisfaction
GitHub Copilot
LLM Evaluation

Best for: Research Scientist, AI Scientist, NLP Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.