CASS

Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation

Chanyoung Kim1Dayun Ju1Woojung Han1
Ming-Hsuan Yang1,2Seong Jae Hwang1
1 Yonsei University2 University of California, Merced
Pikachu
Mario
Thinker
Space Needle
Pyramid
Audi

Teaser Image

We introduce CASS,
Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation.

Abstract

Open-Vocabulary Semantic Segmentation (OVSS) has advanced with recent vision-language models (VLMs), enabling segmentation beyond predefined categories through various learning schemes. Notably, training-free methods offer scalable, easily deployable solutions for handling unseen data, a key goal of OVSS. Yet, a critical issue persists: lack of object-level context consideration when segmenting complex objects in the challenging environment of OVSS based on arbitrary query prompts. This oversight limits models' ability to group semantically consistent elements within object and map them precisely to user-defined arbitrary classes. In this work, we introduce a novel approach that overcomes this limitation by incorporating object-level contextual knowledge within images. Specifically, our model enhances intra-object consistency by distilling spectral-driven features from vision foundation models into the attention mechanism of the visual encoder, enabling semantically coherent components to form a single object mask. Additionally, we refine the text embeddings with zero-shot object presence likelihood to ensure accurate alignment with the specific objects represented in the images. By leveraging object-level contextual knowledge, our proposed approach achieves state-of-the-art performance with strong generalizability across diverse datasets.



Presentation Video

Method


Overall Pipeline

Main figure

We present CASS, object-level Context-Aware training-free open-vocabulary Semantic Segmentation model. Our method distills the vision foundation model's (VFM) object-level contextual spectral graph into CLIP's attention and refines query text embeddings towards object-specific semantics.


Spectral Object-Level Context Distillation

Main figure

Detailed illustration of our proposed training-free spectral object-level context distillation mechanism. By matching the attention graphs of VFM and CLIP head-by-head to establish complementary relationships, and distilling the fundamental object-level context of the VFM graph to CLIP, we enhance CLIP's ability to capture intra-object contextual coherence.


Object Presence-Driven Object-Level Context

Main figure

Detailed illustration of our object presence prior-guided text embedding adjustment module. The CLIP text encoder generates text embeddings for each object class, and the object presence prior is derived from both visual and text embeddings. Within hierarchically defined class groups, text embeddings are selected based on object presence prior, then refined in an object-specific direction to align with components likely present in the image.

Visualization


Effect of Spectral Object-Level Context Distillation

attention

Attention score visualization for various query points. Left: Vanilla CLIP (ACLIP) shows noisy, unfocused attention. Center: VFM-to-CLIP distillation without low-rank eigenscaling shows partial object grouping with limited detail. Right: Incorporating our low-rank eigenscaling captures object-level context, improving grouping within a single object.

Qualitative Results


Datasets

qualitative comparison

Qualitative comparison across the Pascal VOC, Pascal Context, COCO, and ADE20K datasets using CLIP ViT-B/16.


Open-Vocabulary Semantic Segmentation in the Wild

qualitative comparison

Quantitative Results


mIoU

quantitative comparison

Quantitative results with state-of-the-art unsupervised open-vocabulary semantic segmentation models on eight datasets.


pAcc

pacc comparison

Quantitative results using average pixel accuracy.


Scale-up Version (mIoU)

scale-up version

The scale-up version of CASS uses a larger CLIP encoder (ViT-L/14) to compute the object presence prior, rather than for feature extraction, enabling more accurate object classification. This configuration consistently outperforms CaR across all benchmarks, demonstrating the robustness and adaptability of CASS to variations in encoder capacity.

Application


Image Editing & Object Removal

application

Visualization of image inpainting and object removal using our predicted mask. For image inpainting, we use "red sports car with red wheels" as an input prompt. Note that the mask refinement step is excluded when segmenting the object mask. Our method generates an accurate and complete object mask directly from the prompt, enabling seamless inpainting and object removal. In contrast, the baseline produces incomplete masks, failing to capture essential components such as wheels and headlights, which negatively impacts the quality of the edited images.

BibTeX

@article{kim2024cass,
    author    = {Kim, Chanyoung and Ju, Dayun and Han, Woojung and Yang, Ming-Hsuan and Hwang, Seong Jae},
    title     = {Distilling Spectral Graph for Object-Context Aware pen-Vocabulary Semantic Segmentation},
    booktitle = {arXiv preprint arXiv:2411.1715},
    year      = {2024},
}