Diffusion models arise as a powerful generative tool recently. Despite the great progress, existing diffusion models mainly focus on uni-modal control, i.e., the diffusion process is driven by only one modality of condition. To further unleash the users' creativity, it is desirable for the model to be controllable by multiple modalities simultaneously, e.g., generating and editing faces by describing the age (text-driven) while drawing the face shape (mask-driven).
In this work, we present Collaborative Diffusion, where pre-trained uni-modal diffusion models collaborate to achieve multi-modal face generation and editing without re-training. Our key insight is that diffusion models driven by different modalities are inherently complementary regarding the latent denoising steps, where bilateral connections can be established upon. Specifically, we propose dynamic diffuser, a meta-network that adaptively hallucinates multi-modal denoising steps by predicting the spatial-temporal influence functions for each pre-trained uni-modal model. Collaborative Diffusion not only collaborates generation capabilities from uni-modal diffusion models, but also integrates multiple uni-modal manipulations to perform multi-modal editing. Extensive qualitative and quantitative experiments demonstrate the superiority of our framework in both image quality and condition consistency.
We use pre-trained uni-modal diffusion models to perform multi-modal guided face generation and editing. At each step of the reverse process (i.e., from timestep t to t − 1), the dynamic diffuser predicts the spatial-varying and temporal-varying influence function to selectively enhance or suppress the contributions of the given modality.
We exploit pre-trained uni-modal diffusion models. They can collaborate to achieve multi-modal control without being re-trained.
Collaborative Diffusion can be used to extend arbitrary uni-modal approach / task (e.g. face generation, face editing ...) to the multi-modal paradigm.
Our dynamic diffusers implicitly learned the "layout first, details later" influence functions for diffusion model's sampling process.
Our method generates realistic images under different combinations of multi-modal conditions, even for relatively rare combinations in the training distribution, such as a man with long hair.
Click the sliding button to see more results (slide towards left or slide towards right).
Spatial Variations: The influence for the mask-driven model mainly lies on the contours of facial regions, such as the outline of hair, face, and eyes, as these regions are crucial in defining facial layout. In contrast, the influence for the text-driven model is stronger at skin regions including cheeks and chin. This is because the attributes related skin texture, such as age and beard length, are better described by text.
Temporal Variations: The influence from the mask-driven model is stronger at earlier diffusion stages (i.e., larger t), since early stages focus on initializing the facial layout using the mask-driven model’s predictions. At later stages, the influence from the text-driven model increases as the textural details (e.g., skin wrinkles and beard length) are instantiated using information from the text.
Below shows the influence functions at each DDIM timestep. (a) Given the mask condition, (b) displays the influence functions of the mask-driven collaborator at each DDIM sampling step t = 980, 960, ..., 20, 0, from the left to right, top to down. (c) Given the text condition, (d) displays the influence functions of the text-driven collaborator similarly. (f) shows the intermediate diffusion denoising results.
If you find our work useful, please consider citing our paper:
@InProceedings{huang2023collaborative,
author = {Huang, Ziqi and Chan, Kelvin C.K. and Jiang, Yuming and Liu, Ziwei},
title = {Collaborative Diffusion for Multi-Modal Face Generation and Editing},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2023},
}