In recent years, deep learning is the fastest growing machine learning method. An artificial neural network takes a biological neural network as a model and designs corresponding algorithms to simulate some intelligent activities of the human brain. Because it can analyze and reason a large amount of data, it can perform various high-level tasks and has brought enormous conveniences to human life such as image analysis [1, 2, 3], face recognition [4], autopilot [5], language translation [6], and etc.
With the increase in data processing capacity and computing resources, the development of electronic chips is approaching the physical limit, and Moore's law of electronic computing is slowing down [7]. Therefore, the traditional electric neural network is also limited. To solve this problem, Lin et al. propose a Diffractive Deep Neural Network (D2NN) [8] where the neural network is physically formed by multiple layers of diffractive surfaces that work in collaboration to optically perform the classification of handwritten digits and imaging reconstructions as physical auto-encoder. At the same time, they also proved that D2NN has the advantage of depth, the effect was improved with the increase of network layers, and nonlinear materials can be used to achieve optical nonlinearity [9].
However, the D2NN structure can only operate in real space, with limited processing performance for some higher-level tasks with fewer advantages over the traditional electric neural network. Therefore, D2NN in Fourier space is proposed, which can preserve spatial communication and facilitate the mapping task from image to image. Although it shows advantages in cell segmentation and handwritten digit recognition, the disadvantage is that its structure is too complex. Due to multiple 2F systems, the network needs at least10 layers to achieve the best effect, which means that there are 10 lenses in the network, and the distance between the two layers is only 1mm. By placing the cumbersome diffractive modulation layers on both Fourier space and real space, a ten-layer nonlinear hybrid D2NN configuration has a classification accuracy of 98.1%. But as the number of layers increases, it decreases the performance of saliency detection and reduces the feasibility of physical implementation [10].
In recent research, there are many ways to reconstruct the network structures [11, 12, 13, 14, 15, 16]. For example, the Res-D2NN framework [17] was inspired by Res-Net with residual connections [18]. The framework aims to solve the problem of gradient extinction and gradient explosion through learnable light shortcut paths. But for a shallow neural network, the gradient vanishing issue is not that severe, and the contributions of those light shortcut paths are minor, In the classification task, the improvement in accuracy is only 0.5% for the 5-layer and 0.2% for the 10-layer networks with optical shortcuts paths, and 1.1% and 2.4% for the 15-layer and 20-layer networks, respectively, so the number of layers need to be increased to achieve a large improvement. Thus, the hardware complexity and optical information loss can be increased due to the need for more than 20-layer. In addition, the increase in the number of optical shortcuts can also lead to higher requirements on the actual physical environment of the model, thus widening the gap between the actual effect and the simulated effect. Therefore, to avoid the drawbacks of the residual part in the actual deployment, we can design the residual block as a part of the electrical neural network [19, 20]. Zhou T proposed an in situ optical backpropagation training method [21, 22] to solve the error problem from simulation to actual deployment. The phase of neurons in each layer is modulated by cascaded phase SLM, and the output of neurons in each layer is input to the sensor in the corresponding layer through the optical path. The output amplitude and phase were calculated by the four-step phase-shift method. The disadvantage of this model is that the actual deployment requires a large number of SLM and sensors, which is expensive and hardly be applied on a large scale in real life. Therefore, the optimal solution could be the fine feature extraction part can be processed by an electric neural network, and the low-priced diffraction plate made by the lithography can be used as a light-speed, paralleling pre-processing unit.
In the paper, we design all-optical cascaded U-net neural networks and demonstrate their application in gray-scale digital imaging as a lens and speckle reconstruction using the MNIST and ImageNet datasets. Here we name them Hybrid Optical Diffractive Neural Networks (HODNNs), the U-net includes two types: convolutional block-based U-net and residual block-based U-net, which are pre-cascaded by a multilayer all-optical network. About the training process of the model: we first trained the all-optical neural network, normalized the phase information through a layer of Tanh, and then enter the U-net network for further feature extraction with Softmax as the final output layer. Compared with the previous D2NN and Fourier-domain D2NN (F-D2NN), HODNNs not only have the advantages of low power consumption, light-speed processing, and high throughput of light but also have high feature extraction ability by convolutional block-based U-net. Whether it is for speckle reconstruction, or the network trained as a lens to enlarge and reduce the physical size, also has advantages. The overall performance for speckle reconstruction and auto-encoding has lower complexity and higher precision than D2NN. For the task of speckle reconstructions of handwritten digits, NPCC reaches − 0.999, SSIM reaches 0.984, and PSNR reaches 30.439 dB. Besides, there is no need to perform Fourier transform on the input image. In other words, HODNNs simplify the whole network model and make the network more portable, less hardware complexity, and high efficiency in processing.