Beyond One Output: Visualizing and Comparing Distributions of Language Model Generations
Summary
This research introduces \tool, an interactive visualization tool designed to help users understand the distributional structure of language model (LM) generations beyond single outputs. LMs often produce unexpected homogeneity, mode collapse, or inconsistent responses, which are difficult to discern from individual samples. A formative study with 13 LM researchers revealed that while they reason about LM behavior in distributional terms, current tools lack support for this. \tool addresses this by representing multiple LM generations as overlapping paths through a text graph, highlighting shared structures, branching points, and clusters, while retaining access to raw outputs. Three crowdsourced user studies with 47, 44, and 40 participants, respectively, evaluated \tool against a plain list view for tasks like diversity comparison, single-distribution comprehension, and two-distribution comparison. Results indicate that graph summaries improve structural judgments like assessing diversity, whereas direct output inspection remains superior for detail-oriented questions, suggesting a hybrid workflow is most effective.
Key takeaway
For research scientists evaluating language model outputs, relying solely on single generations or raw text lists can obscure critical distributional behaviors like mode collapse or unexpected homogeneity. You should integrate tools like \tool into your workflow to gain a "bird's-eye view" of output distributions, using graph visualizations for high-level pattern recognition and diversity assessment, while retaining the ability to switch to raw text lists for fine-grained detail inspection and verification. This hybrid approach will enhance your confidence in prompt iteration and model evaluation.
Key insights
Visualizing language model output distributions as interactive graphs reveals hidden structures and improves diversity assessment.
Principles
- Single LM outputs are misleading for assessing model behavior.
- Hybrid interfaces combining graph summaries and raw text are optimal.
- Visualization effectiveness depends on text distribution structure.
Method
\tool constructs a merged token graph from LM outputs, tokenizing, creating directed edges, merging semantically similar tokens, and collapsing unbranched chains. It uses a D3 force simulation for layout, balancing reading order and structural visibility.
In practice
- Use \tool to identify mode collapse or repetitive patterns.
- Filter graph views by selecting nodes to focus on specific phrases.
- Compare output distributions across different prompts or models.
Topics
- Large Language Models
- Human-AI Interaction
- Text Visualization
- Output Distribution
- Prompt Engineering
Best for: Research Scientist, AI Scientist, Prompt Engineer, Product Designer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.