Reference-Free Evaluation of Taxonomies

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

Pascal Wullschleger et al. (August 2023) introduce two novel reference-free metrics for evaluating taxonomy quality, addressing limitations of gold-standard dependent methods. The first metric, Concept Similarity Correlation (CSC), assesses robustness by correlating semantic and taxonomic similarity, effectively identifying misclassified leaf and non-leaf concepts—a type of error often missed by prior metrics like Semantic Proximity (SP). The second metric utilizes Natural Language Inference (NLI) to evaluate logical adequacy, specifically verifying "is-a" parent-child relationships within the taxonomy. These metrics were empirically validated on five diverse taxonomies: SemEval-Food, SemEval-Verb, MeSH, a Wikidata-derived taxonomy, and a proprietary CookBook taxonomy. Experiments demonstrated that both CSC and the NLI-based logical adequacy metric (NLIV-S) correlate well with F1 scores against gold-standard taxonomies, outperforming SP, especially when non-leaf concepts are mutated. The authors used `bart-large-mnli` for NLI and `all-MiniLM-L6-v2` for semantic similarity.

Key takeaway

For NLP Engineers developing or evaluating automated taxonomy generation systems, these reference-free metrics offer a robust alternative to gold-standard comparisons. You can assess taxonomy quality, specifically robustness and logical adequacy, even when a perfect reference taxonomy is unavailable. Implement Concept Similarity Correlation (CSC) and Natural Language Inference (NLI) based checks to identify structural and semantic inconsistencies. This approach allows for continuous quality monitoring and iterative improvement of taxonomy models without manual gold standard creation.

Key insights

Reference-free metrics for taxonomy quality can assess robustness and logical adequacy without gold standards.

Principles

Taxonomy quality can be evaluated without a gold standard.
Robustness requires correlating semantic and taxonomic similarity.
Logical adequacy verifies "is-a" parent-child relationships.

Method

Robustness is measured by Concept Similarity Correlation (CSC) using Kendall rank correlation of semantic and taxonomic similarities. Logical adequacy uses Natural Language Inference (NLI) to approximate parent-child relation probabilities.

In practice

Apply CSC to detect misclassified leaf and non-leaf concepts.
Use NLI models (e.g., `bart-large-mnli`) for "is-a" relation checks.
Evaluate taxonomy changes by correlating metrics with F1 scores.

Topics

Taxonomy Evaluation
Reference-Free Metrics
Natural Language Inference
Semantic Similarity
Taxonomy Robustness
Logical Adequacy

Code references

nichtich/wikidata-taxonomy

Best for: AI Scientist, Research Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.