Traffic flow prediction plays a crucial role in improving traffic management efficiency and optimizing traffic resource allocation. However, existing traffic flow prediction models are often difficult to accurately capture complex spatial–temporal dependencies. To this end, this paper proposes a spatial–temporal fusion Transformer based graph convolutional network (STFFormer-GCN) for traffic flow prediction, aiming to address the issues of long- and short-term dynamic spatial–temporal relationships, traffic propagation delays, and spatial–temporal feature fusion. First, we propose a multi-scale time convolution module employing recursion and sliding windows to model long- and short-term temporal dependencies. Second, a new dynamic relational convolution matrix is proposed for learning dynamic spatial dependencies that change over time and road structure. Then, we propose a time-delay cycle module that combines the current traffic flow and spatially propagated time delay information to model the propagation process of traffic flow in space. Finally, we employ Transformer's encoder to fuse dynamic spatial–temporal features with time delayed spatial–temporal features. We conducted extensive experiments on four real datasets and the results show that STFFormer-GCN achieves state-of-the-art performance,especially the MAPE value of the PeMS08 dataset experiment is improved by 4.05%. In addition, we conducted an ablation study and a single-step prediction performance study to evaluate the contribution of each component to the prediction accuracy of the model.