In neural decoding research, reconstructing natural images from fMRI signalsposes a captivating yet challenging problem. Conventional approaches use basiclinear mapping functions to project fMRI signals into a prior latent space (e.g.,image and text embeddings) and subsequently utilize a pre-trained image gener-ation model to create images conditioned on the embedding. While effective atcapturing low-level features like layout, texture, and shape, these methods oftenfail to capture high-level features such as object categories, spatial positions, andthe number of objects. These failures are often due to the insufficient seman-tic information carried by the text embeddings. To address these challenges, wefocus on improving text conditioning for image generation and enhancing the in-formativeness of the text embeddings used in the process. We propose the MDIRframework, which employs a unified and deep neural network called MindMap-per to align fMRI signals with text embeddings more effectively. Our approachenhances the consistency between reconstructed images and the semantic infor-mation in text captions, including object categories and their relationships. Thismethod achieves superior semantic fidelity, producing images closely resemblingthe ground truth in fine-grained detail.