OLLM: Options-based Large Language Models
Summary
Options LLM (OLLM) is a novel method that replaces the single next-token prediction of standard Large Language Models with a set of learned options, indexed by a discrete latent variable. This "plug-in" architecture adds an encoder and decoder before the output head, allowing almost any pretrained LLM to be converted with minimal additional parameters (e.g., 1.56% trainable parameters on a 1.7B-parameter backbone). OLLM explicitly models variation, enabling a downstream policy to select or search multiple plausible next-token options in a low-dimensional latent space. Applied to a distilled Qwen2.5 Deepseek R1 1.7B model trained on OpenMathReasoning and evaluated on OmniMath, OLLM achieved up to ~70% final answer correctness under optimal latent selection, significantly outperforming SOTA LoRA-adapted baselines that peaked at 51%. This approach enhances controllability, robustness, and efficiency in math reasoning by constraining policy exploration to behaviors learned during supervised fine-tuning, avoiding issues like language switching or degenerate reasoning.
Key takeaway
For research scientists developing or fine-tuning LLMs for complex reasoning tasks, OLLM offers a compelling architectural modification. You should consider integrating OLLM's latent option space to enhance model controllability and robustness, particularly in domains requiring precise, multi-step logic like mathematical problem-solving. This approach can yield significant accuracy gains and more efficient policy learning compared to traditional sampling heuristics or full-vocabulary RL fine-tuning.
Key insights
OLLM uses a discrete latent space to model multiple next-token options, improving LLM control and performance.
Principles
- Explicitly model token variation for enhanced control.
- Constrain policy learning to SFT-learned behaviors.
- Decompose next-token distribution into latent options.
Method
OLLM inserts an encoder and decoder before an LLM's output head to learn a discrete latent space of next-token options. A policy then selects latents for controlled generation, optimized via cross-entropy loss.
In practice
- Convert existing LLMs with minimal parameter additions.
- Apply to math reasoning for improved accuracy.
- Explore for code generation or multi-step planning.
Topics
- Options LLM
- Latent Space Modeling
- Next-Token Prediction
- Math Reasoning
- Policy Learning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.