Given a query text and a database of images to be queried, a scene text retrieval system can output the addresses of images containing this query word, as well as the location of the query word in the images. The current state-of-the-art method matches the visual features of query embedding and text instance embedding. This approach effectively mitigates the heterogeneity differences between the two modalities, but it disrupts the visual structure of the query images. In this paper, we optimize this method. First, we directly convert the entire query text into an image, thereby maximally preserving the visual information of the original text image. Then, we devised a Coarse-Grained Feature Rectification (CGFR) module, which facilitates visual alignment. Finally, we propose Adaptive Edit Distance (AED) and improve the main loss function. With the promotion of the above scheme, our method has reached the state-of-the-art level on multiple benchmark datasets. Particularly, this method is suitable for multilingual retrieval tasks, with a 16.98% improvement in mAP score compared to the current state-of-the-art method.