DragText

DragText: Rethinking Text Embedding in Point-based Image Editing

Round 1 Accept @ WACV 2025
Gayoon Choi*, Taejin Jeong*, Sujung Hong, Jaehoon Joo, Seong Jae Hwang
Yonsei University
Visualizing DragText and Baselines

We introduce DragText,
Rethinking Text Embedding in Point-based Image Editing.

Abstract

Point-based image editing enables accurate and flexible control through content dragging. However, the role of text embedding in the editing process has not been thoroughly investigated. A significant aspect that remains unexplored is the interaction between text and image embeddings. In this study, we show that during the progressive editing of an input image in a diffusion model, the text embedding remains constant. As the image embedding increasingly diverges from its initial state, the discrepancy between the image and text embeddings presents a significant challenge. Moreover, we found that the text prompt significantly influences the dragging process, particularly in maintaining content integrity and achieving the desired manipulation. To utilize these insights, we propose DragText, which optimizes text embedding in conjunction with the dragging process to pair with the modified image embedding. Simultaneously, we regularize the text optimization process to preserve the integrity of the original text prompt. Our approach can be seamlessly integrated with existing diffusion-based drag methods with only a few lines of code.


Method


Point-based Image Editing

Main figure

Illustration of the drag editing process within the image and text embedding spaces of the diffusion model (DM). During the edit, the original image embedding \(\mathbf{z}_t\) naturally deviates to the dragged image latent vector \(\mathbf{\bar{z}}_t\). With no text optimization, the corresponding text embedding \(\mathbf{c}\) is decoupled from \(\mathbf{\bar{z}}_t\). Hence, optimal text embedding \(\mathbf{\hat{c}}\) coupled with dragged images has to be acquired to make the optimal latent vector \(\mathbf{\hat{z}}_t\) which then holds the related semantics via text.


Pipeline

Main figure

The pipeline of DragText. The image \(\mathbf{x}_0\) is mapped to a low-dimensional space through a VAE encoder, and the text is encoded by a CLIP text encoder as the text embedding \(\mathbf{c}\). Through DDIM inversion with \(\mathbf{c}\), the latent vector \(\mathbf{z}_t\) is obtained. At time step \(t=35\), \(\mathbf{z}^0_t\) and \(\mathbf{c}\) are optimized to \(\mathbf{\hat{z}}^k_t\) and \(\mathbf{\hat{c}}\) by iterating with motion supervision (M.S.) and point tracking (P.T.) \(k\)-times.

Qualitative Results

DragBench label

Qualitative results of applying DragText to DragDiffusion.


DragBench label

Qualitative results of applying DragText to DragDiffusion, FreeDrag, DragNoise, and GoodDrag.

Application

Controlling Drag

DragBench label

Manipulating the optimized text embedding can control the degree and direction of the drag after editing.

Quantitative Results

DragBench label DragBench label

Quantitative results of DragText on the DragBench dataset.

BibTeX

@article{dragtext2024,
              author    = {Choi, Gayoon and Jeong, Taejin and Hong, Sujung and Joo, Jaehoon and Hwang, Seong Jae},
              title     = {DragText: Rethinking Text Embedding in Point-based Image Editing},
              month     = {July},
              year      = {2024},
          }