Characterising LLM-Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models
Summary
A cross-domain empirical study introduces CompCQ, a multi-dimensional framework for systematically characterizing Competency Questions (CQs) generated by Large Language Models (LLMs). CQs are crucial for ontology engineering requirement elicitation but are traditionally manual and labor-intensive. This research evaluates CQs from five LLMs, including open models KimiK2-1T, Llama 3.1-8B, Llama 3.2-3B, and closed models Gemini 2.5 Pro, GPT 4.1, across five diverse domains like cultural heritage and healthcare. The study quantifies CQ properties such as readability (Flesch-Kincaid Grade Level, Dale-Chall Readability Score), structural complexity (requirement, linguistic, syntactic), and relevance to input text (LLM-rated Likert scale). It also assesses semantic diversity and overlap using Sentence-BERT embeddings. Findings indicate that domain characteristics primarily shape LLM generation behavior, with closed models offering greater stability and readability, while open models provide higher diversity.
Key takeaway
For AI Scientists and Ontology Engineers evaluating LLMs for CQ generation, you should recognize that different models exhibit distinct generation profiles influenced by domain. Closed models like Gemini 2.5 Pro and GPT 4.1 tend to produce more readable and stable CQs, while open models offer greater diversity. Therefore, combine outputs from multiple LLMs and integrate human review to ensure comprehensive and accurate coverage of requirements, rather than relying on a single model.
Key insights
CompCQ framework systematically characterizes LLM-generated Competency Questions across multiple linguistic, structural, and semantic dimensions.
Principles
- Domain characteristics drive LLM generation profiles.
- No single LLM captures the full requirements space.
- Closed models offer stability and readability.
Method
The CompCQ framework quantifies CQ readability, complexity (requirement, linguistic, syntactic), and LLM-rated relevance. It uses Sentence-BERT embeddings for semantic diversity (APS, ACD, Shannon entropy) and pairwise set comparisons (centroid similarity, coverage, novelty).
In practice
- Combine multiple LLMs for comprehensive CQ generation.
- Retain human-in-the-loop refinement for accuracy.
- Use CompCQ to benchmark LLM output profiles.
Topics
- Ontology Engineering
- Competency Questions
- Large Language Models
- CompCQ Framework
- Cross-Domain Empirical Study
Code references
Best for: AI Scientist, Research Scientist, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.