Abstract
Method
Point-based Image Editing
Illustration of the drag editing process within the image and text embedding spaces of the diffusion model (DM). During the edit, the original image embedding \(\mathbf{z}_t\) naturally deviates to the dragged image latent vector \(\mathbf{\bar{z}}_t\). With no text optimization, the corresponding text embedding \(\mathbf{c}\) is decoupled from \(\mathbf{\bar{z}}_t\). Hence, optimal text embedding \(\mathbf{\hat{c}}\) coupled with dragged images has to be acquired to make the optimal latent vector \(\mathbf{\hat{z}}_t\) which then holds the related semantics via text.
Pipeline
The pipeline of DragText. The image \(\mathbf{x}_0\) is mapped to a low-dimensional space through a VAE encoder, and the text is encoded by a CLIP text encoder as the text embedding \(\mathbf{c}\). Through DDIM inversion with \(\mathbf{c}\), the latent vector \(\mathbf{z}_t\) is obtained. At time step \(t=35\), \(\mathbf{z}^0_t\) and \(\mathbf{c}\) are optimized to \(\mathbf{\hat{z}}^k_t\) and \(\mathbf{\hat{c}}\) by iterating with motion supervision (M.S.) and point tracking (P.T.) \(k\)-times.
Qualitative Results
Qualitative results of applying DragText to DragDiffusion.
Qualitative results of applying DragText to DragDiffusion, FreeDrag, DragNoise, and GoodDrag.
Application
Controlling Drag
Manipulating the optimized text embedding can control the degree and direction of the drag after editing.
Quantitative Results
Quantitative results of DragText on the DragBench dataset.
BibTeX
@article{dragtext2024,
author = {Choi, Gayoon and Jeong, Taejin and Hong, Sujung and Joo, Jaehoon and Hwang, Seong Jae},
title = {DragText: Rethinking Text Embedding in Point-based Image Editing},
month = {July},
year = {2024},
}