OLLM: Options-based Large Language Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

Options LLM (OLLM) is a novel method that replaces the single next-token prediction of standard Large Language Models with a set of learned options, indexed by a discrete latent variable. This "plug-in" architecture adds an encoder and decoder before the output head, allowing almost any pretrained LLM to be converted with minimal additional parameters (e.g., 1.56% trainable parameters on a 1.7B-parameter backbone). OLLM explicitly models variation, enabling a downstream policy to select or search multiple plausible next-token options in a low-dimensional latent space. Applied to a distilled Qwen2.5 Deepseek R1 1.7B model trained on OpenMathReasoning and evaluated on OmniMath, OLLM achieved up to ~70% final answer correctness under optimal latent selection, significantly outperforming SOTA LoRA-adapted baselines that peaked at 51%. This approach enhances controllability, robustness, and efficiency in math reasoning by constraining policy exploration to behaviors learned during supervised fine-tuning, avoiding issues like language switching or degenerate reasoning.

Key takeaway

For research scientists developing or fine-tuning LLMs for complex reasoning tasks, OLLM offers a compelling architectural modification. You should consider integrating OLLM's latent option space to enhance model controllability and robustness, particularly in domains requiring precise, multi-step logic like mathematical problem-solving. This approach can yield significant accuracy gains and more efficient policy learning compared to traditional sampling heuristics or full-vocabulary RL fine-tuning.

Key insights

OLLM uses a discrete latent space to model multiple next-token options, improving LLM control and performance.

Principles

Method

OLLM inserts an encoder and decoder before an LLM's output head to learn a discrete latent space of next-token options. A policy then selects latents for controlled generation, optimized via cross-entropy loss.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.