LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis

1University of Science and Technology of China, 2Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS),
Institute of Computing Technology
Intro Image

Existing training-free layout-to-image synthesis approaches struggle to generate high-quality images that adhere to the given textual prompts and spatial layout. In contrast, LoCo is able to handle various spatial layouts and unusual prompts while maintaining high image quality and precise concept coverage.

Abstract

Recent text-to-image diffusion models have reached an unprecedented level in generating high-quality images. However, their exclusive reliance on textual prompts often falls short in accurately conveying fine-grained spatial compositions.

Here, we propose LoCo, a training-free approach for layout-to-image synthesis that excels in producing high-quality images aligned with both textual prompts and spatial layouts. Our method introduces a Localized Attention Constraint to refine cross-attention for individual objects, ensuring their precise placement in designated regions. We further propose a Padding Token Constraint to leverage the semantic information embedded in previously neglected padding tokens, thereby preventing the undesired fusion of synthesized objects.

LoCo seamlessly integrates into existing text-to-image and layout-to-image models, significantly amplifying their performance and effectively addressing semantic failures observed in prior methods. Through extensive experiments, we showcase the superiority of our approach, surpassing existing state-ofthe-art training-free layout-to-image methods both qualitatively and quantitatively across multiple benchmarks

Framework

Methods

We propose two loss functions (constraints) to update the image latent during the denoising process of the text-to-image diffusion model based on the cross-attention maps extracted at each timestep. Optimizing these two losses helps to correctly incorperate the layout instructions into generative process.

Visual comparisons with competing training-free methods

Visual variations across complex prompts

Quantitative comparisons

BibTeX


    @article{zhao2023loco,
      title={LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis},
      author={Zhao, Peiang and Li, Han and Jin, Ruiyang and Zhou, S Kevin},
      journal={arXiv preprint arXiv:2311.12342},
      year={2023}
    }