Previous research has demonstrated that referential communication tasks (RCTs) can be used to detect language deficits in people with Alzheimer’s Disease (AD). This study carried out a multi-modal vision-and-language analysis on data produced during RCT. Using the CLIP model, we calculated the association between the transcripts of image descriptions collected in RCTs and the images being described. Statistical analyses were conducted to examine the differences between people with AD and cognitively healthy older adults. The analysis results are significantly different between the two groups. Moreover, the results vary significantly across different experimental conditions in the cognitively healthy group, but not in the AD group. This paper is the first study on multi-modal vision-language analysis of RCTs using CLIP. The study reveals communication deficits in vision-language association in people with AD. Further research is needed to evaluate the potential of using CLIP for automatic dementia screening using interactive image-based description tasks