Mojo: Training-Free Image Editing via Skip Connection Modulation

1University of Science and Technology of China, 2Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS),
Institute of Computing Technology
Intro Image

Variance maps across channels of skip connection features within the U-Net for the text prompt "A superhero in New York City" at the 15-th timestep of the diffusion process. These skip connection features exhibit high variance in regions corresponding to the structure of the generated image.

Abstract

Text-to-image diffusion models have recently garnered significant attention for their ability to create diverse and realistic visual contents. However, adapting these models for real image editing remains challenging. Existing text-guided image editing methods either struggle to achieve effective editing while maintaining the overall image structure, or require extensive fine-tuning, making them impractical for many applications.

To address these challenges, we introduce Mojo, a novel training-free approach for effective and structure-preserving image editing. Mojo incorporates two innovative techniques: Skip Connection Modulation (SCM) and Cross Image Self-Attention (CISA). SCM leverages the potential of skip connections within the diffusion U-Net. By modulating skip connection features during image editing process, it retains the source image structure while facilitating successful modifications. CISA further enhances the quality of edited images by improving fine-grained visual details through self-attention transfer.

Extensive experiments show that Mojo outperforms existing image editing methods, delivering superior results in versatile image editing scenarios.

Framework

Methods

We first invert the source image to its corresponding latent encoding via DDIM inversion and extract skip connection features during this process. Subsequently, we simultaneously reconstruct the source image and perform image editing using our proposed Skip Connection Modulation (SCM) and Cross-Image Self-Attention (CISA).

Visual comparisons with competing methods

Methods

Mojo strikes a good balance between faithful editing and image structure.

Visual variations across various editing scenarios

Quantitative comparisons

Methods

Quantitative comparisons on Wild-TI2I, ImageNet-R-TI2I and ImageNet-Real benchmarks. Mojo strikes a good balance between faithful editing and image structure.