Deep saliency models represent the current state-of-the-art for predicting where humans look in real-world scenes. However, for deep saliency models to inform cognitive theories of attention, we need to know how deep saliency models predict where people look. Here we open the black box of deep saliency models using an approach that models the association between the output of 3 prominent deep saliency models (MSI-Net, DeepGaze II, and SAM-ResNet) and low-, mid-, and high-level scene features. Specifically, we measured the association between each deep saliency model and low-level image saliency, mid-level contour symmetry and junctions, and high-level meaning by applying a mixed effects modeling approach to a large eye movement dataset. We found that despite different architectures, training regimens, and loss functions, all three deep saliency models were most strongly associated with high-level meaning. These findings suggest that deep saliency models are primarily learning image features associated with scene meaning.