HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing
Summary
This paper introduces HoWToBench, a large-scale Chinese writing benchmark, and Tree-of-Writing (ToW), a novel evaluation framework for assessing Large Language Models' (LLMs) human-level writing capabilities. HoWToBench comprises 12 genres and 1302 instructions across three task categories: contextual completion, outline-guided writing, and open-ended generation, curated from expert-written sources with a 96.85% quality pass rate. ToW addresses the "Negotiation Inconsistency" in LLM-as-a-judge methods by explicitly modeling the aggregation weights of sub-features in a tree-structured workflow, achieving a 0.93 Pearson correlation with human judgments. The framework demonstrates robustness against textual disturbances, unlike traditional overlap-based metrics and popular LLM-as-a-judge practices. The study also reveals a negative correlation between input length and content-related scores in guided tasks, indicating that simply increasing input information does not improve nuanced writing quality.
Key takeaway
For research scientists evaluating LLM writing, adopting the Tree-of-Writing (ToW) framework and HoWToBench benchmark can significantly enhance the reliability and human alignment of your assessments. You should move beyond simplistic metrics and implicit LLM-as-a-judge methods to explicitly model evaluation criteria, especially for open-ended and nuanced writing tasks. This approach will provide more robust and interpretable insights into model performance across diverse genres and task complexities.
Key insights
ToW and HoWToBench offer a robust, human-aligned framework for evaluating LLM writing beyond simple instruction-following.
Principles
- Explicitly model evaluation sub-feature aggregation weights.
- Human-level writing evaluation requires multi-dimensional assessment.
- Longer inputs do not guarantee higher quality in nuanced writing.
Method
Tree-of-Writing (ToW) simulates human judgment via a tree-structured workflow, assigning explicit weights to content, format, and impression nodes, then aggregating scores through depth-first traversal.
In practice
- Use ToW for more reliable LLM writing evaluation.
- Consider HoWToBench for diverse Chinese writing tasks.
- Prioritize quality over input length for complex LLM generation.
Topics
- Tree-of-Writing Framework
- LLM Writing Evaluation
- HoWToBench Benchmark
- Human Judgment Correlation
- Text Generation Robustness
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.