Building LMs that can model the world and model themselves | ARC Prize @ MIT

· Source: ARC Prize · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, long

Summary

Jacob Andreas, an associate professor at MIT, discusses the critical issues of factual inconsistency and metacognition errors in large language models (LLMs), moving beyond simple accuracy metrics. He highlights that current LLMs, like GBD5, can answer complex trivia questions but often exhibit contradictory knowledge and overconfidence when probed with related questions. Andreas argues that models need to be coherent, possessing both a robust "world model" (factual consistency) and a reliable "self model" (metacognition regarding their own knowledge and uncertainty). He presents methods for optimizing coherence, including a self-supervised procedure for improving factual consistency by retraining models on internally coherent subsets of their own generated facts, and adjusting reward functions in reinforcement learning to penalize overconfidence. The discussion extends to current reasoning benchmarks like ARC, suggesting they should evaluate not just predictions but also the model's ability to explain its solutions and the underlying algorithms, similar to Bungard problems.

Key takeaway

For AI scientists and machine learning engineers developing advanced LLMs, focusing solely on predictive accuracy is insufficient. You should prioritize building models that exhibit strong internal coherence, both factually and in their self-assessment of knowledge. Integrate training procedures that explicitly reward consistent reasoning and calibrated confidence, and design evaluation benchmarks that demand explanations and underlying decision rules, not just correct outputs, to foster more reliable and human-compatible AI systems.

Key insights

LLMs require explicit training for internal coherence, encompassing both factual consistency and accurate self-assessment of knowledge.

Principles

Method

Optimize factual consistency by prompting LLMs to generate facts, identifying coherent subsets, and then retraining the model on these self-curated, consistent facts. Improve metacognition by modifying RL reward functions to credit low confidence for incorrect answers.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ARC Prize.