This paper presents Mutual Feature Fusion Network (MFFNet), a novel cross-modal object detection framework that leverages infrared and visible light images to enhance detection accuracy. MFFNet employs a dual-stream backbone network based on YOLOv5 to extract features from both modalities independently. The proposed Interassisted Fusion Block (IFB) integrates within the network's intermediate layers, facilitating complementary fusion of features by enabling mutual assistance between the two modalities. To address the issue of uneven sample difficulty, we introduce the Generalized Efficient Intersection over Union (EIOU) loss function, which adaptively adjusts weights to prioritize high-quality anchor boxes. Extensive experiments on two public datasets, M3FD and LLVIP, demonstrate that MFFNet achieves state-of-the-art performance in terms of detection accuracy and efficiency. By effectively exploiting the complementary strengths of infrared and visible light modalities, MFFNet significantly improves detection accuracy, especially for small objects and in challenging lighting conditions. The code for this study is available on GitHub: GitHub Repository.