Collaborative Diffusion for
Multi-Modal Face Generation and Editing

CVPR 2023

Ziqi Huang, Kelvin C.K. Chan, Yuming Jiang, Ziwei Liu^†

S-Lab, Nanyang Technological University

We propose Collaborative Diffusion, where users can use multiple modalities to control face generation and editing.
(a) Face Generation. Given multi-modal controls, our framework synthesizes high-quality images consistent with the input conditions. (b) Face Editing. Collaborative Diffusion also supports multi-modal editing of real images with promising identity preservation capability.

Video

Abstract

Diffusion models arise as a powerful generative tool recently. Despite the great progress, existing diffusion models mainly focus on uni-modal control, i.e., the diffusion process is driven by only one modality of condition. To further unleash the users' creativity, it is desirable for the model to be controllable by multiple modalities simultaneously, e.g., generating and editing faces by describing the age (text-driven) while drawing the face shape (mask-driven).

In this work, we present Collaborative Diffusion, where pre-trained uni-modal diffusion models collaborate to achieve multi-modal face generation and editing without re-training. Our key insight is that diffusion models driven by different modalities are inherently complementary regarding the latent denoising steps, where bilateral connections can be established upon. Specifically, we propose dynamic diffuser, a meta-network that adaptively hallucinates multi-modal denoising steps by predicting the spatial-temporal influence functions for each pre-trained uni-modal model. Collaborative Diffusion not only collaborates generation capabilities from uni-modal diffusion models, but also integrates multiple uni-modal manipulations to perform multi-modal editing. Extensive qualitative and quantitative experiments demonstrate the superiority of our framework in both image quality and condition consistency.

Highlights

We exploit pre-trained uni-modal diffusion models. They can collaborate to achieve multi-modal control without being re-trained.

Collaborative Diffusion can be used to extend arbitrary uni-modal approach / task (e.g. face generation, face editing ...) to the multi-modal paradigm.

Our dynamic diffusers implicitly learned the "layout first, details later" influence functions for diffusion model's sampling process.

Multi-Modal-Driven Face Editing

Given the input real image and target conditions, we display the edited image using our method.

BibTeX

If you find our work useful, please consider citing our paper:

 @InProceedings{huang2023collaborative,
      author = {Huang, Ziqi and Chan, Kelvin C.K. and Jiang, Yuming and Liu, Ziwei},
      title = {Collaborative Diffusion for Multi-Modal Face Generation and Editing},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
      year = {2023},
  }

Collaborative Diffusion for
Multi-Modal Face Generation and Editing

Video

Abstract

Framework

Highlights

Multi-Modal-Driven Face Generation

More Results

Multi-Modal-Driven Face Editing

Given the input real image and target conditions, we display the edited image using our method.

Influence Functions

The influence functions record the contributions from each collaborator. They determine when, where, and how much each uni-modal diffusion model contributes to the synthesis process.

BibTeX