Retrieve, Then Classify: Corpus-Grounded Automation of Clinical Value Set Authoring

2026-04-16 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

Clinical value set authoring, which involves identifying all standardized vocabulary codes for a clinical concept, is a significant bottleneck in clinical quality measurement and phenotyping. A new approach, Retrieval-Augmented Set Completion (RASC), addresses this by retrieving the K most similar existing value sets from a curated corpus to create a candidate pool, then applying a classifier to each candidate code. This method reduces the effective output space from the full vocabulary to a smaller retrieved pool. RASC was evaluated on 11,803 publicly available VSAC value sets, establishing the first large-scale benchmark for this task. A cross-encoder fine-tuned on SAPBert achieved an AUROC of 0.852 and a value-set-level F1 of 0.298, outperforming a simpler three-layer Multilayer Perceptron (AUROC 0.799, F1 0.250). Both classifiers reduced irrelevant candidates per true positive from 12.3 (retrieval-only) to approximately 3.2 and 4.4, respectively. Zero-shot GPT-4o achieved an F1 of 0.105, with 48.6% of its codes absent from VSAC.

Key takeaway

For clinical informaticists and AI scientists developing automated phenotyping or quality measurement systems, RASC offers a robust alternative to direct LLM prompting for value set authoring. You should consider implementing RASC's retrieve-and-select paradigm to improve accuracy and reduce irrelevant code generation, especially when dealing with large, dynamic clinical vocabularies like VSAC. This approach significantly outperforms zero-shot LLMs and simpler classifiers, providing a more reliable foundation for clinical applications.

Key insights

Retrieval-Augmented Set Completion (RASC) significantly improves clinical value set authoring by combining retrieval with classification.

Principles

Shrink output space for complex vocabularies.
LLMs struggle with large, version-controlled clinical vocabularies.

Method

RASC retrieves K similar value sets, forms a candidate pool, then classifies each candidate code to complete the set.

In practice

Use RASC for clinical value set generation.
Fine-tune cross-encoders like SAPBert for classification.

Topics

Clinical Value Set Authoring
Retrieval-Augmented Set Completion
SAPBert
VSAC Value Sets
Large Language Models

Code references

mukhes3/RASC

Best for: NLP Engineer, Machine Learning Engineer, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.