Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning
Summary
TaskPGM, a novel framework introduced in November 2024, systematically optimizes the composition of training mixtures for fine-tuning large language models (LLMs). It addresses the current manual, heuristic-driven process by selecting continuous task proportions through minimizing an energy function over a Markov Random Field (MRF). TaskPGM quantifies task relationships using behavioral divergences, such as Jensen-Shannon Divergence and Pointwise Mutual Information, derived from the predictive distributions of models fine-tuned on individual tasks. This method yields a closed-form solution under simplex constraints, provably balancing task representativeness and diversity. Empirical evaluations demonstrate consistent performance improvements on Llama-2-7B and Mistral-7B across benchmarks like MMLU and BIG-Bench-Hard, with reported gains up to 4.3 percentage points. Beyond performance, TaskPGM provides interpretable insights into task influence and effective mixture composition.
Key takeaway
For Machine Learning Engineers optimizing LLM fine-tuning data, TaskPGM offers a principled, automated approach to overcome manual, heuristic-driven mixture selection. You should consider implementing TaskPGM to achieve consistent performance gains, such as 4.3 percentage points on MMLU, and gain interpretable insights into task influence. This method can potentially reduce data needs and computational overhead compared to traditional uniform or size-based sampling strategies, leading to more robust and efficient model specialization.
Key insights
TaskPGM optimizes LLM fine-tuning data mixtures by balancing task representativeness and diversity through an energy-based probabilistic framework.
Principles
- Optimal fine-tuning mixtures balance task representativeness and diversity.
- Functional task similarity, not just semantic, drives effective mixture composition.
- Energy minimization over MRFs can yield closed-form solutions for mixture optimization.
Method
TaskPGM models tasks as MRF nodes, quantifying pairwise affinities via behavioral divergences (JSD, PMI) from single-task model predictions. It then minimizes an energy function under simplex constraints for optimal continuous task proportions.
In practice
- Use JSD or PMI to quantify functional task similarity.
- Apply TaskPGM to optimize fine-tuning mixtures for Llama-2 and Mistral.
- Adjust the β/λ ratio to control representativeness-diversity tradeoff.
Topics
- LLM Fine-tuning
- Data Mixture Optimization
- Markov Random Fields
- Jensen-Shannon Divergence
- Pointwise Mutual Information
- Task Representativeness
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.