Abstract
Presentation Video
Method
Overall Pipeline
We present CASS, object-level Context-Aware training-free open-vocabulary Semantic Segmentation model. Our method distills the vision foundation model's (VFM) object-level contextual spectral graph into CLIP's attention and refines query text embeddings towards object-specific semantics.
Spectral Object-Level Context Distillation
Detailed illustration of our proposed training-free spectral object-level context distillation mechanism. By matching the attention graphs of VFM and CLIP head-by-head to establish complementary relationships, and distilling the fundamental object-level context of the VFM graph to CLIP, we enhance CLIP's ability to capture intra-object contextual coherence.
Object Presence-Driven Object-Level Context
Detailed illustration of our object presence prior-guided text embedding adjustment module. The CLIP text encoder generates text embeddings for each object class, and the object presence prior is derived from both visual and text embeddings. Within hierarchically defined class groups, text embeddings are selected based on object presence prior, then refined in an object-specific direction to align with components likely present in the image.
Visualization
Effect of Spectral Object-Level Context Distillation
Attention score visualization for various query points. Left: Vanilla CLIP (ACLIP) shows noisy, unfocused attention. Center: VFM-to-CLIP distillation without low-rank eigenscaling shows partial object grouping with limited detail. Right: Incorporating our low-rank eigenscaling captures object-level context, improving grouping within a single object.
Qualitative Results
Datasets
Qualitative comparison across the Pascal VOC, Pascal Context, COCO, and ADE20K datasets using CLIP ViT-B/16.
Open-Vocabulary Semantic Segmentation in the Wild
Quantitative Results
mIoU
Quantitative results with state-of-the-art unsupervised open-vocabulary semantic segmentation models on eight datasets.
pAcc
Quantitative results using average pixel accuracy.
Scale-up Version (mIoU)
The scale-up version of CASS uses a larger CLIP encoder (ViT-L/14) to compute the object presence prior, rather than for feature extraction, enabling more accurate object classification. This configuration consistently outperforms CaR across all benchmarks, demonstrating the robustness and adaptability of CASS to variations in encoder capacity.
Application
Image Editing & Object Removal
Visualization of image inpainting and object removal using our predicted mask. For image inpainting, we use "red sports car with red wheels
" as an input prompt. Note that the mask refinement step is excluded when segmenting the object mask.
Our method generates an accurate and complete object mask directly from the prompt, enabling seamless inpainting and object removal. In contrast, the baseline produces incomplete masks, failing to capture essential components such as wheels and headlights, which negatively impacts the quality of the edited images.
BibTeX
@article{kim2024cass,
author = {Kim, Chanyoung and Ju, Dayun and Han, Woojung and Yang, Ming-Hsuan and Hwang, Seong Jae},
title = {Distilling Spectral Graph for Object-Context Aware pen-Vocabulary Semantic Segmentation},
booktitle = {arXiv preprint arXiv:2411.1715},
year = {2024},
}