Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis
Abstract
Video
Method
Overview Pipeline

Our method leverages Optimal Transport in a training-free manner, allowing the model to accurately reflect relative object positions at each step without additional inputs. Given the prompt "A car to the left of an elephant", our method dynamically adjusts the attention maps to induce the specified spatial relationship. The process starts with initial attention maps for the car and elephant at time step \(z_t\). Using the centroids of these attention maps, the Spatial Transport Optimization (STO) computes the losses to correct positional relationships (e.g., ensuring the car is to the left of the elephant). The updated attention map is then used to refine the latent representation \(z_t\), leading to a final image that adheres to the desired spatial arrangement. The comparison of attention maps (before and after STO) shows improved alignment, effectively placing the car to the left of the elephant as instructed in the prompt.
Comparison of Attention Map Progression During Denoising

Visualization of the attention maps for "a red bird to the right of a green plant" throughout the denoising process for both Stable Diffusion (a) and our model (b). While Stable Diffusion struggles to distinctly capture the spatial relationship between the bird and the plant, our model effectively aligns the objects according to the specified spatial cue ("to the right of"). The resulting image from our model demonstrates improved spatial accuracy compared to Stable Diffusion.
Qualitative Results

Qualitative comparison across the custom prompt, which involves attribute and positional information in text, evaluating previous state-of-the-art training-free T2I methods, Attend&Excite, Divide&Bind, INITNO, CONFORM, and ours.

Generated images with complex positioning details in the prompt
Quantitative Results
VISOR

Performance comparison between different models on VISOR (%) and Object Accuracy (OA) (%) metrics, based on Stable Diffusion 1.4 and Stable Diffusion 2.1.
T2I-CompBench

Comparison of methods on T2I-CompBench, with attribute binding and spatial relationships calculated using models based on Stable Diffusion 1.4 and 2.1.
CLIP Score and TIFA Score

Comparison of various models on CLIP Image-Text, CLIP Text-Text, and TIFA scores across prompts involving Animal-Animal, Animal-Object, and Object-Object pairs.
User Study

User study evaluating model performance on object synthesis, attribute matching, spatial correctness, and overall fidelity.
BibTeX
@InProceedings{storm2025,
author = {Han, Woojung and Lee, Yeonkyung and Kim, Chanyoung and Park, Kwanghyun and Hwang, Seong Jae},
title = {Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
pages = {...}
}