LLMbench: A Comparative Close Reading Workbench for Large Language Models

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

LLMbench is a new browser-based workbench designed for the comparative "close reading" of large language model (LLM) outputs, contrasting with quantitative evaluation tools like Google PAIR's LLM Comparator. This tool is oriented towards digital humanities practices, allowing side-by-side annotation of two model responses to the same prompt. It features four analytical overlays: Probabilities for token-level log-probability inspection, Differences for word-level diffing, Tone for Hyland-style metadiscourse analysis, and Structure for sentence-level parsing with discourse connective highlighting. LLMbench also includes five analytical modes: Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, and Cross-Model Divergence, which visualize the probabilistic structure of generated text through continuous heatmaps, entropy sparklines, pixel maps, and three-dimensional probability terrains. The paper details the tool's architecture, its six modes, and its design rationale, emphasizing the importance of log-probability data for critical studies of generative AI models.

Key takeaway

For AI Scientists and Digital Humanists analyzing LLM behavior, LLMbench offers a novel approach to understanding generative text beyond surface-level evaluation. You should explore its analytical overlays and modes to gain deeper insights into the probabilistic nature of model outputs, moving beyond simple quantitative metrics. This tool can reveal how subtle prompt changes or temperature settings influence text generation, providing a richer context for model interpretation and critical assessment.

Key insights

LLMbench enables hermeneutic analysis of LLM outputs by visualizing token-level probabilities and textual differences.

Principles

Treat generated text as a research object.
Log-probability data is crucial for critical AI studies.

Method

LLMbench compares two LLM responses side-by-side, offering analytical overlays (Probabilities, Differences, Tone, Structure) and modes (Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, Cross-Model Divergence) to visualize probabilistic text generation.

In practice

Inspect token-level log-probabilities.
Analyze word-level differences between outputs.
Visualize counterfactual text histories.

Topics

LLMbench
Comparative LLM Analysis
Digital Humanities
Log-Probability Data
Generative AI Critical Studies

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.