Vision-Aware Text Features in Referring Image Segmentation:
From Object Understanding to Context Understanding

WACV 2025

1The Hong Kong University of Science and Technology 2University of Science, VNU-HCM, Ho Chi Minh city, Vietnam 3Viet Nam National University, Ho Chi Minh city, Vietnam 4Trinity College Dublin, Ireland 5Stony Brook University, New York, USA
*Equal contribution
empty


Qualitative comparison between LAVT and Ours. The yellow box indicates the wrong segmentation results. Object understanding and Context understanding are required to tackle the challenge of complex and ambiguous language expression.


TL;DR: VATEX is a novel method for referring image segmentation that leverages vision-aware text features to improve text understanding. By decomposing language cues into object and context understanding, the model can better localize objects and interpret complex sentences, leading to significant performance gains.

Abstract

Referring image segmentation is a challenging task that involves generating pixel-wise segmentation masks based on natural language descriptions. The complexity of this task increases with the intricacy of the sentences provided.

Existing methods have relied mostly on visual features to generate the segmentation masks while treating text features as supporting components. However, this under-utilization of text understanding limits the model's capability to fully comprehend the given expressions.

In this work, we propose a novel framework that specifically emphasizes object and context comprehension inspired by human cognitive processes through Vision-Aware Text Features.

Object Understanding

We introduce a CLIP Prior module to localize the main object of interest and embed the object heatmap into the query initialization process.

Context Understanding

We propose a combination of two components: Contextual Multimodal Decoder (CMD) and Meaning Consistency Constraint (MCC), to further enhance the coherent and consistent interpretation of language cues.

✨ Results: Our method achieves significant performance improvements on three benchmark datasets RefCOCO, RefCOCO+ and G-Ref.

🚀 Code Released: Our code and pre-trained weights are available at https://github.com/nero1342/VATEX.

Method

Overall Framework


The overall framework of VATEX processes input images and language expressions through two concurrent pathways. Initially, the CLIP Prior module generates object queries, while simultaneously, traditional Visual and Text Encoders create multiscale visual feature maps and word-level text features. These visual and text features are passed into the Contextual Multimodal Decoder (CMD) to enable multimodal interactions, yielding vision-aware text features and text-enhanced visual features. We then harness vision-aware text features to ensure semantic consistency across varied textual descriptions that reference the same object by employing sentence-level contrastive learning, as described in the Meaning Consistency Constraint (MCC) section. On the other hand, the text-enhanced visual features and the object queries generated by the CLIP Prior are refined through a Masked-attention Transformer Decoder to produce the final output segmentation masks.

CLIP Prior

The CLIP Prior module leverages the powerful visual-semantic alignment capabilities of CLIP to generate initial object queries. By utilizing CLIP's pre-trained knowledge, we can better localize objects mentioned in the referring expressions before detailed segmentation , especially for out-of-vocabulary objects.

Contextutual Multimodal Decoder (CMD)

The CMD module enables multimodal interactions by integrating visual and text features, allowing the model to understand the semantic relationships between visual and textual cues.

Meaning Consistency Constraint (MCC)

The MCC component ensures semantic consistency by employing contrastive learning at the sentence level. This helps the model understand that different textual descriptions referring to the same object should map to similar semantic representations, improving robustness and generalization.

Main Results

empty


As shown in the table, our method achieves remarkable performance improvements over state-of-the-art methods across all benchmarks on mIoU metrics. Notably, we surpass recent methods like CGFormer and VG-LAW by substantial margins: +1.23% and +3.11% on RefCOCO, +1.46% and +3.31% on RefCOCO+, and +2.16% and +4.37% on G-Ref validation splits respectively. The more complex the expressions, the greater the performance gains achieved by VATEX. Even compared to LISA, a large pre-trained vision-language model, VATEX consistently achieves an impressive 3-5% better performance across all datasets.

Visualizations

empty

Our visualizations demonstrate VATEX's superior performance in complex visual reasoning scenarios. The model excels at:
• Distinguishing Similar Objects: Successfully differentiates between multiple similar instances (e.g., 1st column: identifying the seated person vs. standing person on a tennis court).
• Fine-grained Recognition: Precisely locates specific items among visually similar alternatives (e.g., 2nd column: distinguishing a particular sushi plate among various food dishes).
• Detailed Segmentation: Captures intricate object details beyond ground truth annotations (e.g., 4th column: complete umbrella shaft segmentation).
• Accurate Object Relationships: Better understands complex interactions (e.g., 6th & 7th column: correctly segmenting the bear in "bear child is hugging" and "bear with 59 tag" scenario).
Our instance-based architecture produces smoother, more complete masks compared to LAVT and CRIS's pixel-based approach.