MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

2026-05-28 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

MIRA, or Mid-training Rubric Anchoring for Source-Aware Data Selection, is a new framework designed to optimize data selection for large language model (LLM) mid-training. This stage involves curating large-scale data mixtures to enhance capabilities before final post-training, presenting a unique data selection challenge due to the need for both scalability and source-adaptive semantic criteria across heterogeneous data sources. MIRA addresses limitations of existing model-based and semantic selection methods by integrating rubric construction directly into the data selection process. It first identifies evaluation criteria for each source group, then distills these judgments into scalable student scorers for efficient full-corpus filtering. In experiments focused on code-oriented mid-training, involving 21 sources and 5 source groups, MIRA surpassed selection baselines across nine code benchmarks and achieved performance comparable to a full-corpus run using only half the training tokens.

Key takeaway

For Machine Learning Engineers optimizing LLM mid-training data, MIRA offers a significant advancement. If your current data selection struggles with diverse sources or scalability, you should consider implementing dynamic, source-aware rubric construction. This approach allows you to achieve comparable performance to full-corpus training. It also potentially halves your token consumption, directly impacting training costs and efficiency. Explore MIRA's self-anchored rubric discovery to refine your data curation strategies.

Key insights

MIRA integrates dynamic rubric construction into data selection for LLM mid-training, optimizing for heterogeneous sources.

Principles

Data selection rubrics should adapt to source heterogeneity.
Scalable student scorers can distill expert judgments.
Mid-training data optimization requires semantic criteria.

Method

MIRA discovers evaluation criteria per source group, then distills these judgments into scalable student scorers for full-corpus filtering, making rubric construction part of selection.

In practice

Apply dynamic rubric generation for diverse datasets.
Use student scorers to scale semantic filtering.
Optimize mid-training data to reduce token usage.

Topics

LLM Mid-training
Data Selection
Rubric Anchoring
Source-Aware Filtering
Code Benchmarks
Token Efficiency

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.