Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, extended

Summary

TaskPGM, a novel framework introduced in November 2024, systematically optimizes the composition of training mixtures for fine-tuning large language models (LLMs). It addresses the current manual, heuristic-driven process by selecting continuous task proportions through minimizing an energy function over a Markov Random Field (MRF). TaskPGM quantifies task relationships using behavioral divergences, such as Jensen-Shannon Divergence and Pointwise Mutual Information, derived from the predictive distributions of models fine-tuned on individual tasks. This method yields a closed-form solution under simplex constraints, provably balancing task representativeness and diversity. Empirical evaluations demonstrate consistent performance improvements on Llama-2-7B and Mistral-7B across benchmarks like MMLU and BIG-Bench-Hard, with reported gains up to 4.3 percentage points. Beyond performance, TaskPGM provides interpretable insights into task influence and effective mixture composition.

Key takeaway

For Machine Learning Engineers optimizing LLM fine-tuning data, TaskPGM offers a principled, automated approach to overcome manual, heuristic-driven mixture selection. You should consider implementing TaskPGM to achieve consistent performance gains, such as 4.3 percentage points on MMLU, and gain interpretable insights into task influence. This method can potentially reduce data needs and computational overhead compared to traditional uniform or size-based sampling strategies, leading to more robust and efficient model specialization.

Key insights

TaskPGM optimizes LLM fine-tuning data mixtures by balancing task representativeness and diversity through an energy-based probabilistic framework.

Principles

Method

TaskPGM models tasks as MRF nodes, quantifying pairwise affinities via behavioral divergences (JSD, PMI) from single-task model predictions. It then minimizes an energy function under simplex constraints for optimal continuous task proportions.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.