While a single image measures the lateral position of objects as in conventional cameras, differences between images captured in different sensor planes contain the depth information of the point object. Hence focal stack data can be used to reconstruct the 3D position of the point object. Here we consider three different types of point objects: a single-point object, a three-point object, and a two-point object that is rotated and translated in three dimensions.
First, we consider single-point tracking. In this experiment, we scanned the point source (dotted circle in Fig. 2(a)) in a 3D spatial grid of size 0.6 mm × 0.6 mm (x, y axes) × 20 mm (z axis, i.e., the longitudinal direction). The grid spacing was 0.06 mm along the x, y axes, and 2 mm along the z axis, leading to 1,331 grid points in total. For each measurement, two images were recorded from the graphene sensor planes. We randomly split the data into two subsets, training data with 1131 samples (85% of total samples) and testing data with 200 samples (15% of total samples); all experiments used this data splitting procedure. To estimate three spatial coordinates of the point object from the focal stack data, we trained three separate MLP15 neural networks (one for each spatial dimension) with mean-square error (MSE) loss. The results (Fig. 3(a)(b)) show that even with the limited resolution provided by 4 × 4 arrays, and only two sensor planes, the point object positions can be determined very accurately. We used the root-mean-square error (RMSE) to quantify the estimation accuracy on the testing dataset; we obtained RMSE values of 0.012 mm, 0.014 mm, and 1.196 mm along the x, y, and z directions, respectively.
Given the good tracking performance with the small-scale (i.e., 4 × 4 arrays) graphene transistor focal stack, we studied how the tracking performance scales with array size. We determined the performance advantages of larger arrays by using conventional CMOS sensors to acquire the focal stack data. For each point source position, we obtained multi-focal plane image stacks by multiple exposures with varying CMOS sensor depth (note that focal stack data collected by CMOS sensors with multiple exposures would be comparable to that obtained by the proposed transparent array with a single exposure, as long as the scene being imaged is static), and down-sampled the resolution of high resolution (1280 × 1024) images captured by CMOS sensor to 4 × 4, 9 × 9, and 32 × 32. We observed that tracking performance improves as the array size increases; results are presented in SI B-V.
We next considered the possibility of tracking multi-point objects. Here, the object consisted of three point objects, and these three points can have three possible relative positions to each other. We synthesized 1,880 3-point objects images as the sum of single-point objects images from either the graphene detectors or the CMOS detectors (see details of focal stack synthesis in SI B-II). This synthesis approach is reasonable given that the detector response is sufficiently linear and it avoids the complexity of precisely positioning multiple point objects in the optical setup. To estimate the spatial coordinates of the 3-point synthetic objects, we trained a MLP neural network with MSE loss that considers the ordering ambiguity of the network outputs (see Eq. (1) in SI). We used 3-point object’s data synthesized from the CMOS-sensor readout in the single-point tracking experiment (with each CMOS image smoothed by spatial averaging and then down-sampled to 9 × 9). We found that the trained MLP neural network can estimate a multi-point object’s position with remarkable accuracy; see Fig. 3(c-d). The RMSE values calculated from the entire test set are 0.017 mm, 0.016 mm, 0.59 mm, along x-, y-, z-directions, respectively. Similar to the single-point object tracking experiment, the multi-point object tracking performance improves with increasing sensor resolution (see SI B-V).
Finally, we considered tracking of a two-point object that is rotated and translated in three dimensions. This task aims to demonstrate 3D tracking of a continuously moving object, such as a rotating solid rod. Similar to the 3-point object tracking experiment, we synthesized a 2-point object focal stack from single-point object focal stacks captured using the graphene transparent transistor array. The two points are located at the same x-y plane and are separated by a fixed distance, as if tied by a solid rod. The rod is allowed to rotate in the x-y plane and translate along the z-axis, forming helical trajectories, as shown in Fig. 3(e). We trained a MLP neural network with 242 training trajectories using MSE loss to estimate the object’s spatial coordinates and tested its performance on 38 test rotating trajectories. Figure 3(e) shows the results of one test trajectory. The neural network estimated the orientation (x- and y-coordinates) and depth (z-coordinate) of test objects with good accuracy: RMSE along x-, y-, and z-directions for the entire test set are 0.016 mm, 0.024 mm, 0.65 mm, respectively.
SI B-IV gives further details on the MLP neural network architectures and training.