Entity alignment is a critical technique for integrating diverse knowledge graphs. Although existing methods have achieved impressive success in traditional entity alignment, they may struggle to handle the complexities arising from interactions and dependencies in multi-modal knowledge. In this paper, we propose a novel multi-modal entity alignment model called ERMF, which leverages distinct modal characteristics of entities to identify equivalent entities across different multi-modal knowledge graphs. Specifically, we first utilize different feature encoders to independently extract features from different modalities. Concurrently, we incorporate visual features and random sampling methods to design a vision-guided negative sample generation strategy based on contrastive learning, guiding the model to learn relationship embeddings. Subsequently, in the feature fusion stage, we propose a multi-layer feature fusion approach that incorporates multiple attention mechanisms to hierarchically process the importance weights and interactions of the modalities, thereby obtaining multi-granularity features of the modalities. Extensive experiments were conducted on two public datasets, and the results demonstrated that ERMF significantly outperforms competitive baseline models, confirming the effectiveness of the proposed model.