Assessing Visual Hallucinations in Vision-Enabled Large Language Models

doi:10.21203/rs.3.rs-4389431/v1

Download PDF

Research Article

Assessing Visual Hallucinations in Vision-Enabled Large Language Models

https://doi.org/10.21203/rs.3.rs-4389431/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Recent advancements in vision-enabled large language models have prompted a renewed interest in evaluating their capabilities and limitations when interpreting complex visual data. The current research employs ImageNet-A, a dataset specifically designed with adversarially selected images that challenge standard AI models, to test the visual processing robustness of three prominent models: GPT-4 Vision, Google Gemini 1.5, and Anthropic Claude 3. Quantitative analyses revealed notable disparities in misclassification rates and types of errors among these models, indicating a variation in their ability to handle adversarial inputs effectively. GPT-4 Vision demonstrated a commendable robustness, whereas Google Gemini 1.5 excelled in processing speed and efficiency. Anthropic Claude 3, while showing intermediate accuracy levels, displayed a significant propensity for contextual misinterpretations. Qualitative evaluations further assessed the relevance and plausibility of the models' visual hallucinations, uncovering challenges in achieving human-like understanding of ambiguous or complex scenes. The findings emphasize the necessity for further improvements in semantic accuracy and contextual understanding. Future research directions include enhancing adversarial robustness, refining evaluation metrics to better capture the qualitative aspects of visual understanding, and fostering interdisciplinary collaborations to develop AI systems with more nuanced interpretive abilities. The study underscores the ongoing journey towards AI models that can match human perceptual skills, highlighting both the progress made and the considerable challenges that remain.

Artificial Intelligence and Machine Learning

Visual Hallucinations

ImageNet-A

Adversarial Robustness

Misclassification

Semantic Analysis

AI Interpretability

The authors declare no competing interests.

Download PDF

Version 1

posted

You are reading this latest preprint version

Assessing Visual Hallucinations in Vision-Enabled Large Language Models

Status:

Version 1

Abstract

Full Text

Additional Declarations

Status:

Version 1