Text-to-image diffusion models have recently garnered significant attention for their ability to create diverse and realistic visual contents. However, adapting these models for real image editing remains challenging. Existing text-guided image editing methods either struggle to achieve effective editing while maintaining the overall image structure, or require extensive fine-tuning, making them impractical for many applications.
To address these challenges, we introduce Mojo, a novel training-free approach for effective and structure-preserving image editing. Mojo incorporates two innovative techniques: Skip Connection Modulation (SCM) and Cross Image Self-Attention (CISA). SCM leverages the potential of skip connections within the diffusion U-Net. By modulating skip connection features during image editing process, it retains the source image structure while facilitating successful modifications. CISA further enhances the quality of edited images by improving fine-grained visual details through self-attention transfer.
Extensive experiments show that Mojo outperforms existing image editing methods, delivering superior results in versatile image editing scenarios.