Seeing the Abstract: Translating the Abstract Language for Vision Language Models

1Fondazione Bruno Kessler (FBK), 2University of Verona, 3Polytechnic of Turin
*Equal Contribution
🎉 Accepted @ CVPR 2025 🎉
Fashion-ACT teaser image.

Fashion-ACT brings the representation of the abstract-oriented language found in fashion towards the concrete-oriented one in the latent space of existing VLMs, improving downstream tasks performance.

Abstract

Natural language goes beyond dryly describing visual content. It contains rich abstract concepts to express feeling, creativity and properties that cannot be directly perceived. Yet, current research in Vision Language Models (VLMs) has not shed light on abstract-oriented language.

Our research breaks new ground by uncovering its wide presence and under-estimated value, with extensive analysis. Particularly, we focus our investigation on the fashion domain, a highly-representative field with abstract expressions. By analyzing recent large-scale multimodal fashion datasets, we find that abstract terms have a dominant presence, rivaling the concrete ones, providing novel information, and being useful in the retrieval task. However, a critical challenge emerges: current general-purpose or fashion-specific VLMs are pre-trained with databases that lack sufficient abstract words in their text corpora, thus hindering their ability to effectively represent abstract-oriented language.

We propose a training-free and model-agnostic method, Abstract-to-Concrete Translator (ACT), to shift abstract representations towards well-represented concrete ones in the VLM latent space, using pre-trained models and existing multimodal databases. On the text-to-image retrieval task, despite being training-free, ACT outperforms the fine-tuned VLMs in both same- and cross-dataset settings, exhibiting its effectiveness with a strong generalization capability. Moreover, the improvement introduced by ACT is consistent with various VLMs, making it a plug-and-play solution.

Qualitatives

Qualitative examples of retrieval using our ACT, on the test split of DeepFashion.

Fashion-ACT qualitatives. Fashion-ACT qualitatives.

Acknowledgment

This study was supported by LoCa AI, funded by Fondazione CariVerona (Bando Ricerca e Sviluppo 2022/23), PNRR FAIR - Future AI Research (PE00000013) and Italiadomani (PNRR, M4C2, Investimento 3.3), funded by NextGeneration EU. We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources and support. We acknowledge EuroHPC Joint Undertaking for awarding us access to MareNostrum5 as BSC, Spain. Finally, we acknowledge HUMATICS, a SYS-DAT Group company, for their valuable contribution to this research.

BibTeX

@inproceedings{talon2025seeing,
      title={Seeing the Abstract: Translating the Abstract Language for Vision Language Models},
      author={Talon, Davide and Girella, Federico and Liu, Ziyue and Cristani, Marco and Wang, Yiming},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
      year={2025}
}