KN-VLM: KNowledge-guided Vision-and-Language Model for Visual Abductive Reasoning

doi:10.21203/rs.3.rs-4934011/v1

Download PDF

Research Article

KN-VLM: KNowledge-guided Vision-and-Language Model for Visual Abductive Reasoning

https://doi.org/10.21203/rs.3.rs-4934011/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Visual abductive reasoning strives to deduce the most suitable hypothesis that effectively explains the underlying visual context, garnering considerable attention in the academic community. However, recent efforts are inherently limited by their exclusive reliance on visual information, overlooking the invaluable com-monsense and the semantic/causal relationships, leading to inaccurate abductive reasoning outcomes. To tackle the above issue, we propose a simple but powerful KNowledge-guided Vision-and-Language Model (KN-VLM), which primarily consists of a visual reasoning branch and a knowledge reasoning branch. The visual reasoning branch utilizes a powerful visual embedding model followed by a visual-Qformer to capture visual features. The knowledge reasoning branch aims to acquire two complementary types of knowledge commonsense knowledge and complemented knowledge. The former aims to extract the intricate and detailed conceptual knowledge embedded within the observed video, which deepens the model’s comprehension of the presented video content. The latter utilizes the external knowledge base to further augment the understanding of the interconnec-tions and causal relationships among these concepts, thereby strengthening the model’s abductive reasoning capability. After that, the effective fusion of the two branches completes the abductive reasoning task, which generates descriptions for the observed and explanation events. Experimental results on the VAR and CookReasoning dataset show that our model achieves promising performance.

Commonsense Knowledge

Visual Language Model

Visual Abductive Reasoning

Dense Video Captioning

No competing interests reported.

Download PDF

Reviewers invited by journal
23 Aug, 2024
Editor assigned by journal
19 Aug, 2024
Submission checks completed at journal
19 Aug, 2024
First submitted to journal
18 Aug, 2024

You are reading this latest preprint version

KN-VLM: KNowledge-guided Vision-and-Language Model for Visual Abductive Reasoning

Status:

Version 1

Abstract

Full Text

Additional Declarations

Status:

Version 1