Typically, accelerometers need to be installed in multiple directions simultaneously to enhance the accuracy of bea-ring fault diagnosis.However, due to certain environmental constraints, it is sometimes Impractical to install accelerometers in multiple directions simultaneously. In contrast, acoustic sensors can overcome the limitations of contact-based measurements but are more susceptible to interference from environmental noise.To address this issue, a novel method for fault diagnosis of rolling bearings that integrates both acoustic and vibration signals is proposed. First, a 2D convolutional fusion layer is employed to process the two types of signals, achieving an initial fusion of the different signals. Secondly, to effectively extract sound-vibration fusion features, a multi-scale CNN-GRU module is introduced to enhance the method's ability to capture features at different scales. Finally, a model pre-training-based transfer learning strategy is used, achieving high performance in experi-ments with an average accuracy exceeding 90%.