EAGLE🦅: Eigen Aggregation Learning for Object-Centric Unsupervised Semantic Segmentation

Highlight @ CVPR 2024
Chanyoung Kim*, Woojung Han*, Dayun Ju, Seong Jae Hwang
Yonsei University
Visualizing EAGLE and Baselines

We introduce EAGLE,
Eigen AGgregation LEarning for object-centric unsupervised semantic segmentation.

Abstract

Semantic segmentation has innately relied on extensive pixel-level labeled annotated data, leading to the emergence of unsupervised methodologies. Among them, leveraging self-supervised Vision Transformers for unsupervised semantic segmentation (USS) has been making steady progress with expressive deep features. Yet, for semantically segmenting images with complex objects, a predominant challenge remains: the lack of explicit object-level semantic encoding in patch-level features. This technical limitation often leads to inadequate segmentation of complex objects with diverse structures. To address this gap, we present a novel approach, EAGLE, which emphasizes object-centric representation learning for unsupervised semantic segmentation. Specifically, we introduce EiCue, a spectral technique providing semantic and structural cues through an eigenbasis derived from the semantic similarity matrix of deep image features and color affinity from an image. Further, by incorporating our object-centric contrastive loss with EiCue, we guide our model to learn object-level representations with intra- and inter-image object-feature consistency, thereby enhancing semantic accuracy. Extensive experiments on COCO-Stuff, Cityscapes, and Potsdam-3 datasets demonstrate the state-of-the-art USS results of EAGLE with accurate and consistent semantic segmentation across complex scenes.



Video

Method


Pipeline

Main figure

The pipeline of EAGLE. Leveraging the Laplacian matrix, which integrates hierarchically projected image key features and color affinity, the model exploits eigenvector clustering to capture object-level perspective cues defined as \( \mathrm{\mathcal{M}}_{eicue} \) and \( \mathrm{\tilde{\mathcal{M}}_{eicue}} \). Distilling knowledge from \( \mathrm{\mathcal{M}}_{eicue} \), our model further adopts an object-centric contrastive loss, utilizing the projected vector \( \mathrm{Z} \) and \( \mathrm{\tilde{Z}} \). The learnable prototype \( \mathrm{\Phi} \) assigned from \( \mathrm{Z} \) and \( \mathrm{\tilde{Z}} \), acts as a singular anchor that contrasts positive objects and negative objects. Our object-centric contrastive loss is computed in two distinct manners: intra(\( \mathrm{\mathcal{L}}_{obj} \))- and inter(\( \mathrm{\mathcal{L}}_{sc} \))-image to ensure semantic consistency.


Eigen Aggregation Module

Main figure

An illustration of the EiCue generation. From the input image, both color affinity matrix \( \mathrm{A_{color}} \) and semantic similarity matrix \( \mathrm{A_{seg}} \) are derived, which are combined to form the Laplacian \( \mathrm{L_{sym}} \). An eigenvector subset \( \mathrm{\hat{V}} \) of \( \mathrm{L_{sym}} \) are clustered to produce EiCue.

Visualization of Primary Elements


Eigenvectors

Visualizing eigenvectors derived from \( \mathrm{S} \) in the Eigen Aggregation Module. These eigenvectors not only distinguish different objects but also identify semantically related areas, highlighting how EiCue captures object semantics and boundaries effectively.


EiCue

Comparison between K-means and EiCue. The bottom row presents EiCue, highlighting its superior ability to capture subtle structural intricacies and understand deeper semantic relationships, which is not as effectively achieved by K-means.

Qualitive Results


COCO-Stuff

Coco label

Qualitative results of COCO-Stuff dataset trained with ViT-S/8 backbone.


Cityscapes

Coco label

Qualitative results of Cityscapes dataset trained with ViT-B/8 backbone.


Potsdam-3

Potsdam supplymentary visualization

Qualitative results of Potsdam-3 dataset trained with ViT-B/8 backbone.

Quantitative Results


COCO-Stuff

cocostuff table

Quantitative results on the COCO-Stuff dataset.


Cityscapes

cityscapes table

Quantitative results on the Cityscapes dataset.


Potsdam-3

potsdam-3 table

Quantitative results on the Potsdam-3 dataset.

BibTeX

@InProceedings{2024eagle,
      author    = {Kim, Chanyoung and Han, Woojung and Ju, Dayun and Hwang, Seong Jae},
      title     = {EAGLE: Eigen Aggregation Learning for Object-Centric Unsupervised Semantic Segmentation},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      month     = {June},
      year      = {2024}
}