In the realm of manufacturing, the prediction of surface roughness during milling processes is of paramount importance as it directly influences product quality, tool life, and manufacturing efficiency. However, current surface roughness prediction methods exhibit limitations, particularly in terms of feature extraction and prediction accuracy. This study introduces a novel surface roughness prediction method based on a hybrid neural network, which uses time-frequency image and feature vector as multiple inputs, and integrates convolution and multi-head self-attention (MHA) mechanism. Initially, the input signals are subjected to noise reduction using Variational Mode Decomposition (VMD), which effectively extracts cleaner signal features. Subsequently, Continuous Wavelet Transform (CWT) is applied to generate time-frequency maps of the signals, providing a rich source of information for the Convolutional Neural Network (CNN). Furthermore, the Multi-Head Attention mechanism is incorporated to enhance the model's comprehension of global signal characteristics. A multi-input hybrid neural network model is constructed, combining CNN with MHA, and utilizing deep learning techniques to decipher complex relationships between features, thereby achieving high-precision prediction of surface roughness. The results show the proposed method significantly outperforms single-input models in predictive accuracy, with a root mean square error (RMSE) of 0.0349 and a maximum absolute error (MaxAE) of 0.0683.