Building LMs that can model the world and model themselves | ARC Prize @ MIT
Summary
Jacob Andreas, an associate professor at MIT, discusses the critical issues of factual inconsistency and metacognition errors in large language models (LLMs), moving beyond simple accuracy metrics. He highlights that current LLMs, like GBD5, can answer complex trivia questions but often exhibit contradictory knowledge and overconfidence when probed with related questions. Andreas argues that models need to be coherent, possessing both a robust "world model" (factual consistency) and a reliable "self model" (metacognition regarding their own knowledge and uncertainty). He presents methods for optimizing coherence, including a self-supervised procedure for improving factual consistency by retraining models on internally coherent subsets of their own generated facts, and adjusting reward functions in reinforcement learning to penalize overconfidence. The discussion extends to current reasoning benchmarks like ARC, suggesting they should evaluate not just predictions but also the model's ability to explain its solutions and the underlying algorithms, similar to Bungard problems.
Key takeaway
For AI scientists and machine learning engineers developing advanced LLMs, focusing solely on predictive accuracy is insufficient. You should prioritize building models that exhibit strong internal coherence, both factually and in their self-assessment of knowledge. Integrate training procedures that explicitly reward consistent reasoning and calibrated confidence, and design evaluation benchmarks that demand explanations and underlying decision rules, not just correct outputs, to foster more reliable and human-compatible AI systems.
Key insights
LLMs require explicit training for internal coherence, encompassing both factual consistency and accurate self-assessment of knowledge.
Principles
- Coherence is as vital as accuracy for robust AI models.
- Models need both a "world model" and a "self model."
- Explanations are more valuable than mere predictions.
Method
Optimize factual consistency by prompting LLMs to generate facts, identifying coherent subsets, and then retraining the model on these self-curated, consistent facts. Improve metacognition by modifying RL reward functions to credit low confidence for incorrect answers.
In practice
- Implement self-supervised factual consistency training.
- Adjust RL rewards for better confidence calibration.
- Design benchmarks to elicit explanations, not just answers.
Topics
- Language Model Coherence
- Factual Inconsistency
- Metacognition Errors
- World Models
- Self Models
Best for: AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by ARC Prize.