In recent years, with the development of deep learning technologies, significant progress has been made in the field of semantic segmentation. However, existing models still face challenges in utilizing spatial information and capturing multi-level and multi-scale information. To address this, this paper proposes a novel semantic segmentation model called SCFI-ESeg, built on the Segmenter framework. By integrating spatial information and content features, SCFI-ESeg significantly enhances segmentation accuracy. The SCFI-ESeg model introduces the Spatial Feature Enhancement Module (SFEM), Multi-Stage Attention Module (MStA), and Dense Continuous Atrous Spatial Pyramid Pooling (DCASPP), effectively improving the model's ability to express spatial and semantic information. The SFEM module leverages the encoder's query features to specifically enhance spatial information, thereby improving the model's perception of image details. The MStA module strengthens the interaction between high-level and low-level features through multi-stage feature fusion, effectively enhancing the integration of features at different levels. The DCASPP extracts features under varying receptive fields and merges them with pooling results, improving the network's understanding of multi-scale information. Experimental results demonstrate that SCFI-ESeg exhibits excellent performance on the public datasets ADE20K and Pascal Context, particularly in complex scenarios. Various experimental variants achieve an average improvement of 1.6% over the baseline model on the ADE20K dataset, and a 2.5% improvement when using the ViT-tiny configuration. Additionally, the model maintains a low computational cost and parameter count, showcasing its practicality.