LLMbench: A Comparative Close Reading Workbench for Large Language Models
Summary
LLMbench is a new browser-based workbench designed for the comparative "close reading" of large language model (LLM) outputs, contrasting with quantitative evaluation tools like Google PAIR's LLM Comparator. This tool is oriented towards digital humanities practices, allowing side-by-side annotation of two model responses to the same prompt. It features four analytical overlays: Probabilities for token-level log-probability inspection, Differences for word-level diffing, Tone for Hyland-style metadiscourse analysis, and Structure for sentence-level parsing with discourse connective highlighting. LLMbench also includes five analytical modes: Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, and Cross-Model Divergence, which visualize the probabilistic structure of generated text through continuous heatmaps, entropy sparklines, pixel maps, and three-dimensional probability terrains. The paper details the tool's architecture, its six modes, and its design rationale, emphasizing the importance of log-probability data for critical studies of generative AI models.
Key takeaway
For AI Scientists and Digital Humanists analyzing LLM behavior, LLMbench offers a novel approach to understanding generative text beyond surface-level evaluation. You should explore its analytical overlays and modes to gain deeper insights into the probabilistic nature of model outputs, moving beyond simple quantitative metrics. This tool can reveal how subtle prompt changes or temperature settings influence text generation, providing a richer context for model interpretation and critical assessment.
Key insights
LLMbench enables hermeneutic analysis of LLM outputs by visualizing token-level probabilities and textual differences.
Principles
- Treat generated text as a research object.
- Log-probability data is crucial for critical AI studies.
Method
LLMbench compares two LLM responses side-by-side, offering analytical overlays (Probabilities, Differences, Tone, Structure) and modes (Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, Cross-Model Divergence) to visualize probabilistic text generation.
In practice
- Inspect token-level log-probabilities.
- Analyze word-level differences between outputs.
- Visualize counterfactual text histories.
Topics
- LLMbench
- Comparative LLM Analysis
- Digital Humanities
- Log-Probability Data
- Generative AI Critical Studies
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.