Self-supervised monocular depth estimation leverages geometric constraints to create pseudolabels, eliminating the need for ground-truth data during training. In the field of 3D geometric constraints, the optical flow and iterative closest point methods are widely used to extract geometric constraints. However, the application of the existing 3D geometric constraint methods in self-supervised learning tasks increases network complexity by introducing additional submodules and iterative loops. In this paper, we propose a point cloud projection method that replaces the iterative closest point algorithm. Following depth prediction, our method utilizes the difference of 3D gaussian measure to extract 3D point cloud structures and the difference of 2D gaussian measure for the 2D structures derived from synthesized multiview surfaces. By projecting a 3D structure onto a 2D image, the errors between the target point cloud image and the synthesized views are estimated to enhance the training process of self-supervised learning. 3D geometric constraints form a point cloud optimization problem in our approach, requiring deep prediction networks with strong receptive fields for global features. Benefiting from the global receptive field of transformers, we use a vision transformer for self-supervised learning. Furthermore, we propose an encoder block to more effectively learn the patterns of 3D geometric constraints. Experiments show that we achieve strong performance on the KITTI dataset, achieving 0.095, 0.691, 4.243, and 0.171 Abs Rel, Sq Rel, RMSE, and RMSE log values for monocular depth estimation, respectively. Moreover, the model outperforms the comparison models in 3D generation tasks conducted on the Make3D dataset. All code is available at https://github.com/weentiaan/MonoViT-3D.