Text detection is a vital step towards the intelligent management of historical archives. With the development of CNNs, numerous effective methods for archive text detection have been proposed. However, text detection networks and publicly available data for Mongolian historical archives are still limited. Moreover, historical archives often feature blurred text and complicated writing backgrounds, which pose significant challenges for text detection. To address these issues, this paper introduces a text detection network, ESFCENet, specifically designed for Mongolian historical archives, integrating attention and Swin Transformer. Specifically, we propose an improved Swin Transformer (SWT) module that optimizes multi-scale archive text detection by fusing global text feature information, enhancing text feature extraction in the presence of blurred fonts and complex backgrounds. In addition, we introduce an Efficient Channel and Self-Attention Integration Module (ECASI), which suppresses redundant background information while enhancing feature extraction of foreground information. Finally, to facilitate research in this field, we have compiled a Mongolian dataset. Experimental results demonstrate that our proposed network achieves accurate text detection on Mongolian historical archives, with an F-measure reaching 99.0%. It also shows promising ability on multiple related benchmarks, e.g., an F-measure of 87.9% on ICDAR 2015, 85.7% on CTW1500, and 86.6% on Total-Text.