Existing end-to-end multi-focus image fusion networks demonstrate efficacy in merging two images but often introduce various types of image degradation due to error accumulation resulting from iterative fusion when applied to image stacks. To address this limitation, we propose a novel approach that directly fuses the entire image stack using a specially designed 3D convolutional neural network. The proposed method leverages an innovative training pipeline based on monocular depth estimation to generate a large-scale dataset, ensuring robust performance across diverse scenarios. Furthermore, to facilitate comprehensive evaluation and comparison within the field, we establish a benchmark for this field and release a comprehensive toolbox encompassing 12 distinct algorithms. Extensive experimental results demonstrate that our proposed method effectively fuses multi-focus image stacks while mitigating image degradation, achieving state-of-the-art performance in both fusion quality and processing speed. The codes are available at https://github.com/Xinzhe99/StackMFF.