HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing
Summary
A new framework, Tree-of-Writing (ToW), has been developed to address the challenges of evaluating large language models' (LLMs) human-level writing capabilities, particularly for long-form and open-ended texts. ToW models the aggregation weights of sub-features in a tree-structured workflow, resolving inconsistencies often found in LLM-as-a-judge methods. Alongside ToW, the HowToBench benchmark was introduced, featuring 1302 instructions across 12 genres and three task categories: contextual completion, outline-guided writing, and open-ended generation in Chinese. ToW achieved a 0.93 Pearson correlation with human judgments, demonstrating its effectiveness in mitigating biases. The research also found that traditional overlap-based metrics and current LLM-as-a-judge practices are vulnerable to textual disturbances, whereas ToW is robust. Additionally, a negative correlation was observed between input length and content-related scores in the Guide task, indicating that simply increasing input information does not improve performance.
Key takeaway
For research scientists evaluating large language model writing, adopting the Tree-of-Writing (ToW) framework can significantly improve the accuracy and robustness of assessments. Your team should consider integrating ToW to overcome the limitations of traditional metrics and existing LLM-as-a-judge methods, especially for long-form and open-ended generation tasks. This approach offers a more reliable correlation with human judgment and resilience against textual disturbances, providing a clearer picture of an LLM's true writing proficiency.
Key insights
Tree-of-Writing (ToW) offers a robust, tree-structured method for evaluating LLM writing, correlating highly with human judgment.
Principles
- Explicitly model sub-feature aggregation weights.
- Traditional metrics are vulnerable to text disturbances.
Method
Tree-of-Writing (ToW) uses a tree-structured workflow to model and aggregate sub-feature weights for text evaluation, mitigating biases and improving correlation with human judgments.
In practice
- Use ToW for holistic LLM writing evaluation.
- Avoid simple input length increases for content scores.
Topics
- HoWToBench
- Tree-of-Writing
- LLM Writing Evaluation
- Human-level Writing
- Chinese Language Models
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.