Since estimating depth is vital for scene understanding and improved 3D perception, the proposed algorithm presents a method for estimating depth and robust object classification with monocular camera. A single camera has high resolution, portable and much cheaper than depth cameras. Also, several state-of-the-art methods have already found impressive object classification performance based on color imagery. The effort for this research is to estimate the depth map from a monocular camera to leverage information with color images to improve object classification performance. The idea is that if monocular cameras can infer depth alone, the cost of purchasing depth sensors can come down significantly. However, the proposed algorithm works based on depth inferred from a monocular camera with combined color features. Therefore, the project comprises two parts. Firstly, the proposed algorithm estimates depth from monocular camera measurements. Secondly, the proposed algorithm uses depth information along with conventional color image features to improve object classification performance. The reason for fusing depth and color information is because it is expected that color and depth information contain useful complementary characteristics that can be utilized for improved object recognition. The following section describes the proposed approach in the next.
For the first part of the research, two state of the art methods are utilized and compared its performance to select the best method for use in depth estimation. For the second part, extracted relevant depth and RGB features are used from the fused RGB and depth images together to classify objects. The following subsections describe used approaches in detail.
3.1 Depth Estimation from Monocular Camera
Liu et al. [13] shows that the property of continuous characteristic of depth values can be used to formulate the problem of depth estimation as a continuous conditional random field (CRF) learning problem. A deep convolutional neural field model is proposed which can jointly explore the capacity of deep CNN and continuous CRF. The structured inference model (CRF model) can learn the unary and pairwise potentials of that model jointly in a deep CNN framework. CNN predicts the unary and pairwise potentials on given superpixels which are input to the CRF loss layer. The predicted depth map is then backpropagated to the loss to the CNN network see Fig. 1. It showed that the integral of the partition function in a CRF can be calculated in closed form, allowing an exact solution to the maximum-likelihood estimation problem. As a result, no approximate inferences are required. To go at the problem, Liu et al. [13] proposes over segmenting the image into plenty of small regions called superpixels. Each superpixel defines a homogeneous region where depth is assumed to be constant. Based on this assumption, the CRF model is built by defining unary and pairwise potentials on superpixels and its neighbors. A rectangular patch is extracted from each superpixel so that it can capture the local context of the superpixel, and then the patch is fed into a deep network for estimating depth of each superpixel. The final depth map needs to be very close to the estimated depth of the superpixel. The rectangular patch is then fed into the AlexNet architecture for deep learning to estimate the depth of the superpixel. The task of the deep network is to infer depth based on the local context of the region around each superpixel. Once a depth value is inferred from the deep network, the method then estimates depth by MAP inference:
$$\:{y}^{*}=arg\:Pr\left(y\right|x)\:$$
Unary and pairwise potentials of the superpixels are given by the joint energy function:
$$\:E\left(y,x\right)=\sum\:_{pϵn}U\left({y}_{p},x\right)+\sum\:_{(p,q)ϵs}V\left({y}_{p},{y}_{q},x\right)$$
where \(\:U\left({y}_{p},x\right)\) is given by:
$$\:U\left({y}_{p},x;\theta\:\right)={\left({y}_{p}{-z}_{p}\right(\theta\:\left)\right)}^{2},\:\forall\:p=\text{1,2},\dots\:n$$
The unary potential captures how close the inferred depth value can be to the ground-truth. The \(\:{y}_{p}\) is the depth-measurement obtained from lidar. If there are no depth measurements available in that region, it can fill it up with a random value. The optimization should then pick up the correct depth value at that region. The pairwise potentials \(\:V\left({y}_{p},{y}_{q},x\right)\) is given by:
$$\:V\left({y}_{p},{y}_{q},x;\beta\:\right)={\frac{1}{2}{R}_{pq}\left({y}_{p}{-y}_{q}\right)}^{2},\:\forall\:p=\text{1,2},\dots\:n$$
The pairwise potential basically captures the context and the property of the neighboring super pixels by measuring the similarity of the color features between them. It is mainly captured by Rpq where
$$\:{{R}_{pq}={\beta\:}^{T}[{S}_{pq}^{1},\dots\:,{S}_{pq}^{K}]}^{T}={\sum\:}_{k=1}^{K}{\beta\:}_{k}{S}_{pq}^{k}$$
β comes from the weights learnt by the single layer network 1 and
$$\:{S}_{pq}^{k}={exp}^{-\gamma\:\Vert\:{S}_{p}^{k}-{S}_{q}^{k}\Vert\:},k=\text{1,2},3;$$
Sp and Sq are observation values of the superpixel obtained from the color, color histogram and LBP, k is the l2 norm. The MAP inference can be solved once closed form expression found for the energy function and the partition function.
$$\:Pr\:\left(y|x\right)\:=\frac{exp(-E(y,x\left)\right)}{Z\left(x\right)}$$
It proved analytically that a closed form expression is possible. The final optimization is given by:
$$\:{\lambda\:}_{1}/2{\Vert\:\theta\:\Vert\:}_{2}^{2}+{\lambda\:}_{2}/2{\Vert\:\beta\:\Vert\:}_{2}^{2}-{\sum\:}_{1}^{N}logPr\left({y}_{i}\right|{x}_{i};\theta\:,\beta\:)\:\:$$
3.2 Unsupervised Learning by Deep CNN
The second method evaluated is Gargi et al.’s work [14] which is compared with superpixel. This uses an unsupervised learning framework to estimate depth. They evaluate their depth predictions in one image based on how well the corresponding disparities predict the other image. The method is similar to autoencoders, and it uses both left and right cameras provided in the KITTI dataset for calculating the reconstruction error. Although no ground-truth (absolute depth values) is explicitly needed, they needed two cameras with known focal length and baseline (for knowing the scale) to efficiently train the network. Images fed to the deep CNN for input and a predicted depthmap is obtained at the output. To retrain the network, the proposed algorithm needs a ground-truth depthmap. In this case, since ground-truth depth map is not possible, it has been reconstructed the input image by warping a right image given the disparity measures for each pixel. The left and right images are captured from two side-by-side cameras located at a baseline of 0.5m. The concept mainly comes from the principle of stereo vision. The network basically comprises two parts: a convolutional encoder and deconvolutional decoder network. The convolutional encoder basically encodes the color image so that it can infer the depthmap of the same size of the color image at the output of the deconvolutional network. The main point is the skip architecture that they used to sharpen the local details that get blurred as the deconvolutional layer upsamples the depthmap based on the encoded map at the end of the convolutional layer. The total error of the network is given by the following equation:
$$\:{\sum\:}_{i}^{N}{E}_{reconst}^{i}+\gamma\:{E}_{smooth}^{i}\:$$
Here \(\:E\) comprises of \(\:{E}_{reconst}\), photometric reconstruction error and \(\:{E}_{smooth}\) error. It is given by the sum of the error over each pixel. \(\:{E}_{reconst}\) is obtained from warping the image of the right camera with the disparity map to the left camera, and is given by:
$$\:\int\:{\Vert\:{I}_{w}^{i}\left(x\right)-{I}_{1}^{i}\left(x\right)\Vert\:}^{2}dx=\int\:{\Vert\:{I}_{2}^{i}\left(x+{D}^{i}\left(x\right)\right)-{I}_{1}^{i}\left(x\right)\Vert\:}^{2}dx$$
Here Ω is the region over all the pixels.
$$\:{I}_{2}\left(x+{D}^{n}\left(x\right)\right)={I}_{2}\left(x+{D}^{n-1}\left(x\right)\right)+\left({D}^{n}\left(x\right)+{D}^{n-1}\left(x\right)\right){I}_{2h}\left(x+{D}^{n-1}\left(x\right)\right)$$
The most fascinating part is the way the warping function is developed. The warping function can be linearized by breaking \(\:{I}_{2}(x+D(x\left)\right)\) as a Taylor series expansion given the disparity measure between two iterations at that pixel remains small. \(\:{D}_{i}\left(x\right)\) is given by: \(\:fB/{d}^{i}\left(x\right)\); where f is the focal length of the camera, B is the baseline between the two cameras, and \(\:{d}^{i}\) is the disparity between left and right stereo. The smooth function is basically a penalty function to keep the gradient of the depth map low. The fact that the gradient of the depth map is low is a very important property satisfied by natural depth maps. The penalty is given by:
$$\:{E}_{smooth}^{i}={\Vert\:\nabla\:{D}^{i}\left(x\right)\Vert\:}^{2}$$
3.3 Object Classification from RGBD image
Neural networks have a proven record of excellence in object classification. For the proposed algorithm, the rich semantic feature information obtained via the best performance pre-built network is used. Currently, one the best pre-built object classifiers available through MatConvNet is the ResNet-50 [15]. The ResNet-50 is not a conventional neural network, but a Deep Residual Neural Network. Typical, Deep Convolutional Neural Networks combined feature level information extracted at multiple levels of the network to attempt to combine high, mid and low-level feature information. Continuously adding layers to a network to make it deeper is only expected to work up to a point due to vanishing gradients and no guarantee that the additional layers learn anything new of value. Residual networks solve both problems. To ensure that an added layer will learn new information in the network, the layer output will have access to the input before transformation. The intuition behind this type of architecture is that it is easier to optimize the residual mapping than the original mapping. These Deep Residual Networks have been shown to outperform Deep Convolutional Neural Networks, where the best performing network has 152 layers. The proposed algorithm uses extracted information from the final fully connected layer of the ResNet-50 model architecture and uses this as feature input for SVM. This ensures that the RGB features to classify vehicles and pedestrians are rich and discriminative. Each training image is provided with 2048 meaningful RGB feature descriptors. From the depthmap, the proposed algorithm generates 3D point clouds and extracts two sets of features: depth features and normal features. For the depth features, the proposed algorithm finds the covariance matrix of the point-cloud at the bounded/masked regions and uses them as features. For the features, the proposed algorithm calculates the mean and the variance of the local surface normals calculated for each pixel based on the 3D point-cloud. Surface normals are calculated by extracting a rectangular box and selecting points having estimated depth within a given threshold. The proposed algorithm uses those points for calculating the minimum eigen-vector of the point-cluster. The minimum eigen-vector then represents the surface normal for those selected points. In training each SVM, it standardizes the feature values. Tests are performed using k-Fold Cross Validation to ensure that there is no overfitting occurring in the proposed model. In experiments, parameter k is selected as 10.