Visual abductive reasoning strives to deduce the most suitable hypothesis that effectively explains the underlying visual context, garnering considerable attention in the academic community. However, recent efforts are inherently limited by their exclusive reliance on visual information, overlooking the invaluable com-monsense and the semantic/causal relationships, leading to inaccurate abductive reasoning outcomes. To tackle the above issue, we propose a simple but powerful KNowledge-guided Vision-and-Language Model (KN-VLM), which primarily consists of a visual reasoning branch and a knowledge reasoning branch. The visual reasoning branch utilizes a powerful visual embedding model followed by a visual-Qformer to capture visual features. The knowledge reasoning branch aims to acquire two complementary types of knowledge commonsense knowledge and complemented knowledge. The former aims to extract the intricate and detailed conceptual knowledge embedded within the observed video, which deepens the model’s comprehension of the presented video content. The latter utilizes the external knowledge base to further augment the understanding of the interconnec-tions and causal relationships among these concepts, thereby strengthening the model’s abductive reasoning capability. After that, the effective fusion of the two branches completes the abductive reasoning task, which generates descriptions for the observed and explanation events. Experimental results on the VAR and CookReasoning dataset show that our model achieves promising performance.