Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training
Summary
ART (Art-based Reinforcement Training) is a novel fine-tuning technique for Multi-modal Large Language Models (MLLMs) that overcomes limitations of existing Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and Soft Prompting. Unlike these techniques, which necessitate modifications to pre-compiled LLM computational graphs and are thus not fully supported in high-throughput engines such as vLLM, ART injects information by optimizing only the raw visual input of a frozen MLLM. This approach enables a soft-token method on pre-compiled graphs, relying on backpropagation of gradients into a plain pixel array, thereby supporting any fine-tuning objective. The optimized visual input can also be stylized as task-relevant computational artworks. ART's effectiveness is validated across different sizes of the popular open Qwen architecture and several textual benchmarks, demonstrating accuracy competitive with LoRA, particularly in mathematics and structured-tool-use tasks.
Key takeaway
For Machine Learning Engineers deploying Multi-modal LLMs in high-throughput environments like vLLM, ART offers a critical alternative to traditional PEFT methods. If you are struggling with computational graph modifications required by LoRA or Soft Prompting, you should consider ART. This method allows you to fine-tune frozen MLLMs by optimizing visual inputs, preserving pre-compiled graphs and achieving LoRA-competitive accuracy in tasks like mathematics and structured-tool-use. This approach simplifies deployment and broadens fine-tuning possibilities for your MLLM applications.
Key insights
ART fine-tunes MLLMs by optimizing raw visual input, enabling soft-token methods on frozen, pre-compiled models without graph modification.
Principles
- Fine-tuning can occur via raw visual input optimization.
- Backpropagation into pixel arrays supports any objective.
- Pre-compiled MLLM graphs can remain unmodified.
Method
ART injects information into a frozen MLLM by optimizing its raw visual input. It uses backpropagation of gradients into a plain pixel array, supporting any fine-tuning objective on pre-compiled computational graphs.
In practice
- Fine-tune MLLMs on pre-compiled graphs using visual input.
- Generate task-relevant computational artworks from optimized inputs.
Topics
- Multi-modal LLMs
- Parameter-Efficient Fine-Tuning
- ART (Reinforcement Training)
- Qwen Architecture
- Visual Input Optimization
- High-throughput Inference
Best for: Research Scientist, MLOps Engineer, AI Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.