In recent years, With the advent of large language models (LLMs) such as GPT-4, LLaMA, and ChatGLM, leveraging multimodal information (e.g., images and audio) to enhance recommendation systems has become possible. To further enhance the performance of recommendation systems based on large language models (LLMs), we propose MLLM4Rec, a sequence recommendation framework grounded in LLMs. Specifically, our approach integrates multimodal information, with a focus on image data, into LLMs to improve recommendation accuracy. By employing a hybrid prompt learning mechanism combined with role-playing for model fine-tuning, MLLM4Rec effectively bridges the gap between textual and visual representations, enabling text-based LLMs to "read" and interpret images. Moreover, the fine-tuned LLM is utilized to rank retrieval candidates, thereby maintaining its generative capabilities while optimizing item ranking according to user preferences. Extensive experiments were conducted on three publicly available benchmark datasets to evaluate the proposed method. The results demonstrate that MLLM4Rec outperforms traditional sequence recommendation models and pre-trained multimodal models in terms of NDCG, MRR, and Recall metrics.