Indoor scene understanding is crucial for intelligent robotics, and RGB-D imagesprovide complementary depth information to enhance semantic segmentation.However, integrating multi-modal data poses challenges due to their inherentdifferences. In this paper, we propose CRDF, a Cross-modal Fusion frame-work specifically tailored for indoor scene RGB-D semantic segmentation usingTransformers. CRDF introduces the Pattern-Variable Feature Rectification (PV-FR) and Pattern-Variable Feature Fusion (PV-FF) modules, which effectivelyextract and fuse multi-scale features from RGB and depth images. By reducingattention computations on depth images during downsampling, CRDF achievesfast convergence and robust performance. Experiments on the NYU Depth v2and SUN-RGBD datasets demonstrate the effectiveness of CRDF, achievingstate-of-the-art results with 55.51% mIoU and 67.73% MPA on NYU Depthv2, and 52.19% mIoU and 64.58% MPA on SUN-RGBD. Code is available at:https://github.com/tqqwww/CRDF.