LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis

1University of Science and Technology of China, 2Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS),
Institute of Computing Technology
Intro Image

Existing training-free layout-to-image synthesis approaches struggle to generate high-quality images that adhere to the given textual prompts and spatial layout. In contrast, LoCo is able to handle various spatial layouts and unusual prompts while maintaining high image quality and precise concept coverage.

Abstract

Recent text-to-image diffusion models have reached an unprecedented level in generating high-quality images. However, their exclusive reliance on textual prompts often falls short in precise control of image compositions.

In this paper, we propose LoCo, a training-free approach for layout-to-image Synthesis that excels in producing highquality images aligned with both textual prompts and layout instructions. Specifically, we introduce a Localized Attention Constraint (LAC), leveraging semantic affinity between pixels in self-attention maps to create precise representations of desired objects and effectively ensure the accurate placement of objects in designated regions. We further propose a Padding Token Constraint (PTC) to leverage the semantic information embedded in previously neglected padding tokens, improving the consistency between object appearance and layout instructions.

LoCo seamlessly integrates into existing text-to-image and layout-to-image models, significantly amplifying their performance and effectively addressing semantic failures observed in prior methods. Through extensive experiments, we showcase the superiority of our approach, surpassing existing state-ofthe-art training-free layout-to-image methods both qualitatively and quantitatively across multiple benchmarks.

Framework

Methods

We propose two loss functions (constraints) to update the image latent during the denoising process of the text-to-image diffusion model based on the cross-attention maps extracted at each timestep. Optimizing these two losses helps to correctly incorperate the layout instructions into generative process.

Visual comparisons with competing training-free methods

Visual variations across complex prompts and layout instructions

Quantitative comparisons

BibTeX


    @article{zhao2023loco,
      title={LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis},
      author={Zhao, Peiang and Li, Han and Jin, Ruiyang and Zhou, S Kevin},
      journal={arXiv preprint arXiv:2311.12342},
      year={2023}
    }