Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

ART (Art-based Reinforcement Training) is a novel fine-tuning technique for Multi-modal Large Language Models (MLLMs) that overcomes limitations of existing Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and Soft Prompting. Unlike these techniques, which necessitate modifications to pre-compiled LLM computational graphs and are thus not fully supported in high-throughput engines such as vLLM, ART injects information by optimizing only the raw visual input of a frozen MLLM. This approach enables a soft-token method on pre-compiled graphs, relying on backpropagation of gradients into a plain pixel array, thereby supporting any fine-tuning objective. The optimized visual input can also be stylized as task-relevant computational artworks. ART's effectiveness is validated across different sizes of the popular open Qwen architecture and several textual benchmarks, demonstrating accuracy competitive with LoRA, particularly in mathematics and structured-tool-use tasks.

Key takeaway

For Machine Learning Engineers deploying Multi-modal LLMs in high-throughput environments like vLLM, ART offers a critical alternative to traditional PEFT methods. If you are struggling with computational graph modifications required by LoRA or Soft Prompting, you should consider ART. This method allows you to fine-tune frozen MLLMs by optimizing visual inputs, preserving pre-compiled graphs and achieving LoRA-competitive accuracy in tasks like mathematics and structured-tool-use. This approach simplifies deployment and broadens fine-tuning possibilities for your MLLM applications.

Key insights

ART fine-tunes MLLMs by optimizing raw visual input, enabling soft-token methods on frozen, pre-compiled models without graph modification.

Principles

Method

ART injects information into a frozen MLLM by optimizing its raw visual input. It uses backpropagation of gradients into a plain pixel array, supporting any fine-tuning objective on pre-compiled computational graphs.

In practice

Topics

Best for: Research Scientist, MLOps Engineer, AI Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.