LLMbench: A Comparative Close Reading Workbench for Large Language Models
Summary
LLMbench is a new browser-based workbench designed for the comparative close reading of large language model (LLM) outputs, contrasting with existing quantitative evaluation tools like Google PAIR's LLM Comparator. This tool focuses on digital humanities' hermeneutic practices, presenting two model responses to a single prompt side-by-side in annotatable panels. It includes four analytical overlays: Probabilities for token-level log-probability inspection, Differences for word-level diff, Tone for metadiscourse analysis, and Structure for sentence parsing with discourse connective highlighting. LLMbench also features five analytical modes—Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, and Cross-Model Divergence—to visualize the probabilistic structure of generated text. The tool treats generated text as a research object, offering visualizations such as continuous heatmaps, entropy sparklines, pixel maps, and three-dimensional probability terrains to illustrate the counterfactual history of each word.
Key takeaway
For digital humanities researchers or AI ethicists analyzing LLM outputs, LLMbench offers a unique qualitative approach beyond quantitative metrics. You should explore its analytical overlays and modes to gain deeper insights into the probabilistic nature of generated text, which can inform critical studies of generative AI models. This tool provides a novel way to understand how LLMs construct responses, revealing the "could have been otherwise" aspect of their output.
Key insights
LLMbench enables close reading of LLM outputs by visualizing probabilistic text generation for humanistic analysis.
Principles
- Generated text is a research object.
- Log-probability data is critical for AI studies.
Method
LLMbench compares two LLM outputs side-by-side using analytical overlays (Probabilities, Differences, Tone, Structure) and modes (Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, Cross-Model Divergence) to visualize token-level probabilities and counterfactual histories.
In practice
- Inspect token-level log-probabilities.
- Analyze word-level differences.
- Visualize text generation probabilities.
Topics
- LLMbench
- Large Language Models
- Digital Humanities
- Log-Probability Data
- Comparative LLM Analysis
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.