Training LLMs at test time for Scientific Discovery

2026-02-17 · Source: LLM on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, medium

Summary

The Discover methodology explores training Large Language Models (LLMs) at test time to enhance their problem-solving capabilities, specifically tested on MLE-Bench's Spaceship-Titanic problem. This approach involves continuous self-improvement where the model generates multiple attempts, evaluates them, and updates its weights during the problem-solving process, diverging from traditional methods like prompt engineering, few-shot examples, fine-tuning, or standard Reinforcement Learning (RL). A key innovation is the use of an Entropic objective function, which prioritizes outlier performance over average scores, encouraging risk-seeking strategies that can lead to breakthroughs. The methodology involves sampling generations, filtering uninformative groups, computing advantages using the Entropic objective, attaching these advantages to tokens, and performing forward-backward passes with an Adam optimizer to update LoRA weights. While effective for discovery tasks, limitations include potential mode collapse and the model losing general capabilities due to weight drift.

Key takeaway

For research scientists developing LLMs for scientific discovery, adopting the Discover methodology with its test-time training and Entropic objective can significantly enhance problem-solving capabilities beyond traditional methods. You should consider integrating this approach to foster novel solutions, especially when a single breakthrough is more valuable than consistent average performance. Be mindful of potential mode collapse and explore diversity mechanisms to maintain exploratory behavior.

Key insights

Test-time training with an Entropic objective enables LLMs to self-improve and discover novel solutions by prioritizing high-reward outliers.

Principles

Continuous learning during problem-solving outperforms static models.
Entropic objective encourages risk-seeking for discovery tasks.
Verifiable rewards are crucial for iterative improvement.

Method

The Discover methodology samples LLM generations, filters uninformative groups, computes advantages using an Entropic objective, attaches advantages to tokens, and updates LoRA weights via an Adam optimizer.

In practice

Apply test-time training for complex discovery problems.
Use Entropic objective to encourage outlier solutions.
Implement diversity mechanisms to prevent mode collapse.

Topics

Test-time Training
Large Language Models
Entropic Objective Function
Scientific Discovery
Mode Collapse

Code references

prannayHexo/discover

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.