VMix : Improving Text-to-Image Diffusion Model with
Cross-Attention Mixing Control

Shaojin Wu,1 Fei Ding, 1,* Mengqi Huang,1,2 Wei Liu,1 Qian He1
1 ByteDance Inc.   2 University of Science and Technology of China

×

We introduce VMix, which offers improved aesthetic guidance to the model via a novel condition control method called value-mixed cross-attention. VMix serves as an innovative plug-and-play adapter, designed to systematically enhance aesthetic quality.

Abstract

While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, composition, etc. In this paper, we propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method, all while preserving the image-text alignment. Through our meticulous design, VMix is flexible enough to be applied to community models for better visual performance without retraining. To validate the effectiveness of our method, we conducted extensive experiments, showing that VMix outperforms other state-of-the-art methods and is compatible with other community modules (e.g., LoRA, ControlNet, and IPAdapter) for image generation.


Existing methods always fail to align fine-grained human preference for visually generated content. Images favored by human beings should excel across various fine-grained aesthetic dimensions simultaneously, such as natural light, coherent color, and reasonable composition. To address this challenge, we introduce VMix, a novel plug-and-play adapter designed to systematically bridge the aesthetic quality gap between generated images and real-world counterparts across various aesthetic dimensions.

How does it work?


Illustration of VMix:

(a) In the initialization stage, pre-defined aesthetic labels are transformed into [CLS] tokens through CLIP, thereby obtaining AesEmb, which only need to be processed once at the beginning of training.

(b) In the training stage, a project layer first maps the input aesthetic description yaes into an embedding fa of the same token dimension as the content text embedding ft. The text embedding ft is then integrated into the denoising network through value-mixed cross-attention.

(c) In the inference stage, VMix extract all positive aesthetic embedding from AesEmb to form the aesthetic input, along with the content input, is fed into the model for the denoising process.

Aesthetic Fine-grained Control

VMix can achieve fine-grained aesthetic control by adjusting the aesthetic embedding. When using only single-dimensional aesthetic labels, it can be observed that the image quality improves in specific dimensions. When employing full positive aesthetic labels, the visual performance of the images is superior to the baseline overall.


“A girl leaning against a window with a breeze blowing,
summer portrait, half-length medium view”

full       baseline      
light      emotion      texture      color      none

Comparison To Current Methods

Qualitative comparison with various state-of-the-art methods. All results are based on Stable Diffusion.



Qualitative comparison with various state-of-the-art methods. All the results of the methods are based on the SDXL.



Personalized Text-to-Image Model

Images generated by the personalized model with or without VMix.



BibTex

  @misc{wu2024vmix,
    title={VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control},
    author={Shaojin Wu and Fei Ding and Mengqi Huang and Wei Liu and Qian He},
    year={2024},
    eprint={2412.20800},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}