HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

2026-04-21 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A new framework, Tree-of-Writing (ToW), has been developed to address the challenges of evaluating large language models' (LLMs) human-level writing capabilities, particularly for long-form and open-ended texts. ToW models the aggregation weights of sub-features in a tree-structured workflow, resolving inconsistencies often found in LLM-as-a-judge methods. Alongside ToW, the HowToBench benchmark was introduced, featuring 1302 instructions across 12 genres and three task categories: contextual completion, outline-guided writing, and open-ended generation in Chinese. ToW achieved a 0.93 Pearson correlation with human judgments, demonstrating its effectiveness in mitigating biases. The research also found that traditional overlap-based metrics and current LLM-as-a-judge practices are vulnerable to textual disturbances, whereas ToW is robust. Additionally, a negative correlation was observed between input length and content-related scores in the Guide task, indicating that simply increasing input information does not improve performance.

Key takeaway

For research scientists evaluating large language model writing, adopting the Tree-of-Writing (ToW) framework can significantly improve the accuracy and robustness of assessments. Your team should consider integrating ToW to overcome the limitations of traditional metrics and existing LLM-as-a-judge methods, especially for long-form and open-ended generation tasks. This approach offers a more reliable correlation with human judgment and resilience against textual disturbances, providing a clearer picture of an LLM's true writing proficiency.

Key insights

Tree-of-Writing (ToW) offers a robust, tree-structured method for evaluating LLM writing, correlating highly with human judgment.

Principles

Explicitly model sub-feature aggregation weights.
Traditional metrics are vulnerable to text disturbances.

Method

Tree-of-Writing (ToW) uses a tree-structured workflow to model and aggregate sub-feature weights for text evaluation, mitigating biases and improving correlation with human judgments.

In practice

Use ToW for holistic LLM writing evaluation.
Avoid simple input length increases for content scores.

Topics

HoWToBench
Tree-of-Writing
LLM Writing Evaluation
Human-level Writing
Chinese Language Models

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.