EAGLE

Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis

@ CVPR 2025
Woojung Han, Yeonkyung Lee, Chanyoung Kim, Kwanghyun Park, Seong Jae Hwang
Yonsei University
Visualizing EAGLE and Baselines

We introduce STORM,
Spatial Transport Optimization by Repositioning Attention Map

Abstract

Diffusion-based text-to-image (T2I) models have recently excelled in high-quality image generation, particularly in a training-free manner, enabling cost-effective adaptability and generalization across diverse tasks. However, while the existing methods have been continuously focusing on several challenges such as "missing objects" and "mismatched attributes", another critical issue of "mislocated objects" remains where generated spatial positions fail to align with text prompts. Surprisingly, ensuring such seemingly basic functionality remains challenging in popular T2I models due to the inherent difficulty of imposing explicit spatial guidance via text forms. To address this, we propose STORM (Spatial Transport Optimization by Repositioning Attention Map), a novel training-free approach for spatially coherent T2I synthesis. STORM employs Spatial Transport Optimization (STO), rooted in optimal transport theory, to dynamically adjust object attention maps for precise spatial adherence, supported by a custom Spatial Transport (ST) Cost function that enhances spatial understanding. Our analysis shows that integrating spatial awareness is most effective in the early denoising stages, while later phases refine details. Extensive experiments demonstrate that STORM surpasses existing methods, effectively mitigating mislocated objects while improving missing and mismatched attributes, setting a new benchmark for spatial alignment in T2I synthesis. The source code will be publicly released.



Video

Method


Overview Pipeline

Main figure

Our method leverages Optimal Transport in a training-free manner, allowing the model to accurately reflect relative object positions at each step without additional inputs. Given the prompt "A car to the left of an elephant", our method dynamically adjusts the attention maps to induce the specified spatial relationship. The process starts with initial attention maps for the car and elephant at time step \(z_t\). Using the centroids of these attention maps, the Spatial Transport Optimization (STO) computes the losses to correct positional relationships (e.g., ensuring the car is to the left of the elephant). The updated attention map is then used to refine the latent representation \(z_t\), leading to a final image that adheres to the desired spatial arrangement. The comparison of attention maps (before and after STO) shows improved alignment, effectively placing the car to the left of the elephant as instructed in the prompt.


Comparison of Attention Map Progression During Denoising

Main figure

Visualization of the attention maps for "a red bird to the right of a green plant" throughout the denoising process for both Stable Diffusion (a) and our model (b). While Stable Diffusion struggles to distinctly capture the spatial relationship between the bird and the plant, our model effectively aligns the objects according to the specified spatial cue ("to the right of"). The resulting image from our model demonstrates improved spatial accuracy compared to Stable Diffusion.

Qualitative Results

Coco figure

Qualitative comparison across the custom prompt, which involves attribute and positional information in text, evaluating previous state-of-the-art training-free T2I methods, Attend&Excite, Divide&Bind, INITNO, CONFORM, and ours.


Coco figure

Generated images with complex positioning details in the prompt

Quantitative Results


VISOR

cocostuff table

Performance comparison between different models on VISOR (%) and Object Accuracy (OA) (%) metrics, based on Stable Diffusion 1.4 and Stable Diffusion 2.1.


T2I-CompBench

cityscapes table

Comparison of methods on T2I-CompBench, with attribute binding and spatial relationships calculated using models based on Stable Diffusion 1.4 and 2.1.


CLIP Score and TIFA Score

potsdam-3 table

Comparison of various models on CLIP Image-Text, CLIP Text-Text, and TIFA scores across prompts involving Animal-Animal, Animal-Object, and Object-Object pairs.


User Study

potsdam-3 table

User study evaluating model performance on object synthesis, attribute matching, spatial correctness, and overall fidelity.

BibTeX

@InProceedings{storm2025,
    author    = {Han, Woojung and Lee, Yeonkyung and Kim, Chanyoung and Park, Kwanghyun and Hwang, Seong Jae},
    title     = {Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {...}
}