A Comprehensive Survey of Multimodal Large Language Models: Concept, Application and Safety

doi:10.21203/rs.3.rs-5270567/v1

Download PDF

Research Article

A Comprehensive Survey of Multimodal Large Language Models: Concept, Application and Safety

https://doi.org/10.21203/rs.3.rs-5270567/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Recent advancements in MLLM, such as those exemplified by developments like GPT-4o, have positioned them as a significant focus within the research community. MLLMs leverage the general capabilities of Large Language Models (LLMs) to handle tasks across multiple modalities, including text, image, audio, and video. With their unique ability to understand and generate content, such as composing narratives from visual inputs, MLLMs are attracting substantial interest from both academia and industry. However, the great outburst of algorithms and techniques of MLLMs has led to the emergence of new types of architectures, applications and safety issues in MLLMs. We provide this more comprehensive survey aiming to document and analyze the latest advancements in MLLMs. First, we introduce the fundamental concepts of MLLMs, including the development history of multimodal algorithms, the architecture of MLLMs, and their evaluation and benchmarks. We then explore advanced techniques in MLLMs, such as Multimodal In-Context Learning, Multimodal Chain of Thought, and LLM-aided Visual Reasoning. Following this, we examine the safety aspects of MLLMs, focusing on security issues, potential attacks, and model safety assessments. Finally, we discuss the current challenges and identify potential areas for future research.

Multimodal large language models

language models

large models

modalities

survey

No competing interests reported.

Download PDF

Reviewers invited by journal
05 Nov, 2024
Editor assigned by journal
22 Oct, 2024
Submission checks completed at journal
16 Oct, 2024
First submitted to journal
15 Oct, 2024

You are reading this latest preprint version

A Comprehensive Survey of Multimodal Large Language Models: Concept, Application and Safety

Status:

Version 1

Abstract

Full Text

Additional Declarations

Status:

Version 1