LISA: Likelihood Score Alignment for Visual-condition Controllable Generation

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

LISA (LIkelihood Score Alignment) is a novel regularization method designed to enhance visual-condition controllable generation, particularly within the prevalent dual-branch paradigm. This paradigm typically involves a side network encoding visual conditions and fusing its features into a frozen, pretrained main network. LISA reinterprets this setup, viewing the main network as providing an unconditional score for perceptual quality and the side network as implicitly contributing a likelihood score for conditional control. The proposed method explicitly aligns the side network's intermediate features with an approximated likelihood score. This is achieved by projecting hooked features from a designated side network layer into the score latent space using a lightweight decoder, then calculating a regularization loss against a constructed likelihood score target. Jointly optimizing the side network and decoder with both standard diffusion loss and this new regularization loss, LISA consistently accelerates training convergence, improves synthetic results, and encourages more disentangled conditional modeling features, all with negligible additional training cost and zero extra inference cost across diverse image/video tasks and model types.

Key takeaway

For Machine Learning Engineers optimizing visual-condition controllable generation models, LISA offers a significant advantage. You can accelerate training convergence and enhance synthetic output quality across diverse image/video tasks. This method also encourages more disentangled conditional modeling features. Crucially, it adds negligible training cost and zero extra inference cost, making it a highly efficient upgrade for your existing dual-branch architectures.

Key insights

LISA explicitly aligns side network features with an approximated likelihood score for improved visual-condition controllable generation.

Principles

Method

Hook side network features, project them into score latent space with a lightweight decoder, then calculate regularization loss against an approximated likelihood score target for joint optimization.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.