# I Got Tired of Saying “This Response Is Better” Without Being Able to Prove It

2026-03-20 · Source: Naturallanguageprocessing on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, medium

Summary

An AI evaluator developed a Python script to objectively score and rank Large Language Model (LLM) responses, moving beyond subjective "vibes." The script evaluates responses across five dimensions: Relevance, Accuracy, Clarity, Completeness, and Conciseness, with scores assigned manually before automated analysis. It also automatically flags five common error patterns: Unsupported Claim, Overconfidence, Vagueness, Unnecessary Refusal, and Repetition, each with a severity level. The system was tested using prompts requiring domain expertise in geology and data/tech. The author emphasizes that while automated metrics catch patterns, human evaluators with domain knowledge are crucial for catching nuanced meaning, illustrating this with an example where a model's "vagueness" was actually an incorrect statement. The full script is available on GitHub.

Key takeaway

For AI Engineers and Data Scientists building or evaluating LLMs, you should integrate both human domain expertise and automated error detection into your evaluation pipelines. Relying solely on automated metrics can miss subtle inaccuracies, while purely manual review lacks scalability and objective vocabulary. Your evaluation framework should provide specific, data-backed reasons for response quality, moving beyond subjective assessments to ensure robust model performance and reliability.

Key insights

Combining human domain expertise with automated pattern detection improves LLM evaluation accuracy and provides objective vocabulary for subjective judgments.

Principles

Vibes do not scale in AI evaluation.
Automated metrics catch patterns; human evaluators catch meaning.

Method

Manually score LLM responses on five dimensions (Relevance, Accuracy, Clarity, Completeness, Conciseness), then run an automated script to flag five error types (Unsupported Claim, Overconfidence, Vagueness, Unnecessary Refusal, Repetition) and compare results.

In practice

Implement a multi-dimensional scoring rubric.
Automate detection of common LLM error patterns.
Integrate domain experts into evaluation workflows.

Topics

LLM Evaluation
Automated Evaluation
Error Detection
Domain Expertise
Hallucination Detection

Code references

renataennes/LLM-response

Best for: AI Engineer, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Naturallanguageprocessing on Medium.