Multimodal web rumors, which combine images and text, are confusing and can be inflammatory, and therefore can be harmful to national security and social stability. Currently, web rumor detection fully considers text content but ignores image content, including text embedded in images. This paper proposes a multimodal web rumor detection method based on a deep neural network considering images, image-embedded text, and text content. This method uses a VGG-19 network to extract image content features, DenseNet to extract embedded text content, and an LSTM (Long Short-term Memory) network to extract text content features. After concatenation with image features, the mean and variance vectors of the image and text shared representations are obtained through a completely connected layer, and random variables sampled from a Gaussian distribution are used to form a reparameterized multimodal feature as the input of the rumor detector. Experiments show that the accuracy of this method is 68.5% and 79.4% on Twitter and Weibo, respectively.