Human Evaluation of Large Language Models: How Good is Hugging Face’s BLOOM?
Summary
Hugging Face recently released BLOOM, a 176B parameter multilingual large language model developed through an open collaboration of over 1000 researchers. This analysis evaluates BLOOM's real-world performance across seven application categories, moving beyond traditional academic benchmarks due to their inherent flaws, such as mislabeled datasets and a failure to reflect practical use cases like marketing copy generation. The human evaluation, conducted by "Surgers," assessed BLOOM's zero-shot and few-shot capabilities in tasks including categorizing toxic speech, creative writing, question-answering, writing marketing copy, named entity recognition, coding, and mathematics. While BLOOM showed some surprising strengths, particularly in coding and basic math, its overall performance in areas like toxicity classification and creative writing was inconsistent and often fell short of expectations for a model of its scale, indicating a gap between benchmark scores and practical utility.
Key takeaway
For AI Engineers and NLP Engineers evaluating new large language models, you should prioritize human evaluation over sole reliance on academic benchmarks. Your models might perform worse on traditional benchmarks yet be practically superior. Consider designing diverse prompts that test creativity, humor, and real-world application scenarios, and provide clear context to human evaluators to ensure nuanced and relevant feedback, accelerating your development cycle with actionable insights.
Key insights
Human evaluation is crucial for assessing large language models' real-world performance beyond flawed academic benchmarks.
Principles
- Academic benchmarks often contain hidden flaws.
- Benchmarks may not reflect real-world LLM applications.
- Human evaluation captures creativity and nuance.
Method
A human evaluation involved "Surgers" generating and evaluating prompts for BLOOM across seven application categories, assessing performance with Good/Okay/Bad labels for zero-shot and few-shot examples.
In practice
- Prioritize human evaluation for LLM performance.
- Design prompts for creativity and humor.
- Provide human evaluators with use-case context.
Topics
- BLOOM
- Large Language Models
- Human Evaluation
- Academic Benchmarks
- Natural Language Processing
Best for: AI Engineer, NLP Engineer, AI Scientist, AI Researcher, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Surge AI Blog.