Enhancing Object Classification for Autonomous Vehicles via Rgbd Fusion From Monocular Cameras: a Depth-aware Approach

doi:10.21203/rs.3.rs-4920598/v1

Download PDF

Research Article

Enhancing Object Classification for Autonomous Vehicles via Rgbd Fusion From Monocular Cameras: a Depth-aware Approach

https://doi.org/10.21203/rs.3.rs-4920598/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Object classification is crucial for autonomous vehicle navigation, enabling robust perception of the surrounding environment. This paper proposes an innovative method to enhance object classification accuracy for autonomous vehicles by fusing depth estimates from monocular cameras with conventional color image features. We demonstrate that estimating depth using a deep neural network and integrating this information with RGB features consistently improves classification performance, particularly for autonomous vehicle applications. Our approach outperforms baseline methods, achieving a classification accuracy of 94.46% on the KITTI dataset, an improvement from 93.5%. This work highlights the potential of low-cost monocular cameras for advanced 3D perception, crucial for developing safer and more reliable autonomous vehicles. Our depth-aware RGBD object classification not only improves perception capabilities but also presents an alternative to expensive lidar-based systems.

Fusion

object detection

perception

lidar

monocular camera

Object detection and classification has a wide range of applications in the field of Computer Vision [1–5]. Perception of the 3D environment has turned out to be vital for self-driving cars. It can tremendously help in self-driving by estimating how far an object is from the vehicles, the pose of the vehicle and even behavioral analysis of static and dynamic objects in addition to robust object detection and classification. To understand the 3D environment more accurately, selecting hi-resolution sensors for estimating depth turns out to be very important for the vehicle industries. The sensors need to be cheap, portable, dense resolution and cover the entire field of view of the mobile platform. Recently, lidars have emerged as an accurate depth sensing device for the outside environment. However, lidars are expensive, bulky, and have low resolution that gives sparse data compared to standard color cameras. Also, the issue of calibration and fusion are additional overheads if multiple sensors are to be used for robust object detection. The motivation for this work comes from the idea of using a monocular color camera alone for depth estimation with deep learning and using fused color image features for improved object classification/recognition. Specifically, the proposed algorithm consistently enhances the object classification performance on Avs by estimating depth and fusion with color image information.

There are many researchers that propose object detection and classification by processing RGB-D imagery. Most of the depth images are obtained from active depth sensors like Kinect, or Intel Senz3D cameras [6, 7]. These sensors can produce dense, reliable depth-maps which can be fused with RGB cameras relatively easily. However, Kinect and Intel-Senz Cameras can work only indoor scenarios, and it completely fails in outdoor environments due to sunlight interference. Fusion of lidars with cameras can also be a good alternative for detection problems that enables high accuracy range measurements, but Lidar measurement is sparse and even if it is possible to get a reliable dense depth map by super resolution [8, 9]. Bad weather conditions such as rain, fog, snow can significantly decrease effectiveness of lidar based algorithms [10, 11]. Lidars are currently expensive devices that increase the development of AVs and there is no large-scale dataset with Lidar scans that causes limitations on deep learning methods and AV applications. Then, this research study is aimed to use monocular cameras for depth map estimation, and to explore whether the depth estimates are reliable enough for improved RGB-D object classification. There is plenty of literature in this field; but Eigen et al. [12] changed the landscape of this research by incorporating deep learning to estimate depth. Much new research has now used their own deep learning paradigm to estimate depth [13, 14]. Most of the previous research relies on aggregate statistics, and it is hard to decide which of them would perform best based on the statistics. Therefore, in this research, the proposed algorithm presents a state-of-the-art technique to estimate depth to check whether it can improve RGB-D object classification.

Since estimating depth is vital for scene understanding and improved 3D perception, the proposed algorithm presents a method for estimating depth and robust object classification with monocular camera. A single camera has high resolution, portable and much cheaper than depth cameras. Also, several state-of-the-art methods have already found impressive object classification performance based on color imagery. The effort for this research is to estimate the depth map from a monocular camera to leverage information with color images to improve object classification performance. The idea is that if monocular cameras can infer depth alone, the cost of purchasing depth sensors can come down significantly. However, the proposed algorithm works based on depth inferred from a monocular camera with combined color features. Therefore, the project comprises two parts. Firstly, the proposed algorithm estimates depth from monocular camera measurements. Secondly, the proposed algorithm uses depth information along with conventional color image features to improve object classification performance. The reason for fusing depth and color information is because it is expected that color and depth information contain useful complementary characteristics that can be utilized for improved object recognition. The following section describes the proposed approach in the next.

For the first part of the research, two state of the art methods are utilized and compared its performance to select the best method for use in depth estimation. For the second part, extracted relevant depth and RGB features are used from the fused RGB and depth images together to classify objects. The following subsections describe used approaches in detail.

3.1 Depth Estimation from Monocular Camera

Liu et al. [13] shows that the property of continuous characteristic of depth values can be used to formulate the problem of depth estimation as a continuous conditional random field (CRF) learning problem. A deep convolutional neural field model is proposed which can jointly explore the capacity of deep CNN and continuous CRF. The structured inference model (CRF model) can learn the unary and pairwise potentials of that model jointly in a deep CNN framework. CNN predicts the unary and pairwise potentials on given superpixels which are input to the CRF loss layer. The predicted depth map is then backpropagated to the loss to the CNN network see Fig. 1. It showed that the integral of the partition function in a CRF can be calculated in closed form, allowing an exact solution to the maximum-likelihood estimation problem. As a result, no approximate inferences are required. To go at the problem, Liu et al. [13] proposes over segmenting the image into plenty of small regions called superpixels. Each superpixel defines a homogeneous region where depth is assumed to be constant. Based on this assumption, the CRF model is built by defining unary and pairwise potentials on superpixels and its neighbors. A rectangular patch is extracted from each superpixel so that it can capture the local context of the superpixel, and then the patch is fed into a deep network for estimating depth of each superpixel. The final depth map needs to be very close to the estimated depth of the superpixel. The rectangular patch is then fed into the AlexNet architecture for deep learning to estimate the depth of the superpixel. The task of the deep network is to infer depth based on the local context of the region around each superpixel. Once a depth value is inferred from the deep network, the method then estimates depth by MAP inference:

$$\:{y}^{*}=arg\:Pr\left(y\right|x)\:$$

Unary and pairwise potentials of the superpixels are given by the joint energy function:

$$\:E\left(y,x\right)=\sum\:_{pϵn}U\left({y}_{p},x\right)+\sum\:_{(p,q)ϵs}V\left({y}_{p},{y}_{q},x\right)$$

where $\:U\left({y}_{p},x\right)$ is given by:

$$\:U\left({y}_{p},x;\theta\:\right)={\left({y}_{p}{-z}_{p}\right(\theta\:\left)\right)}^{2},\:\forall\:p=\text{1,2},\dots\:n$$

The unary potential captures how close the inferred depth value can be to the ground-truth. The $\:{y}_{p}$ is the depth-measurement obtained from lidar. If there are no depth measurements available in that region, it can fill it up with a random value. The optimization should then pick up the correct depth value at that region. The pairwise potentials $\:V\left({y}_{p},{y}_{q},x\right)$ is given by:

$$\:V\left({y}_{p},{y}_{q},x;\beta\:\right)={\frac{1}{2}{R}_{pq}\left({y}_{p}{-y}_{q}\right)}^{2},\:\forall\:p=\text{1,2},\dots\:n$$

The pairwise potential basically captures the context and the property of the neighboring super pixels by measuring the similarity of the color features between them. It is mainly captured by R_pq where

$$\:{{R}_{pq}={\beta\:}^{T}[{S}_{pq}^{1},\dots\:,{S}_{pq}^{K}]}^{T}={\sum\:}_{k=1}^{K}{\beta\:}_{k}{S}_{pq}^{k}$$

β comes from the weights learnt by the single layer network 1 and

$$\:{S}_{pq}^{k}={exp}^{-\gamma\:\Vert\:{S}_{p}^{k}-{S}_{q}^{k}\Vert\:},k=\text{1,2},3;$$

S_p and S_q are observation values of the superpixel obtained from the color, color histogram and LBP, k is the l2 norm. The MAP inference can be solved once closed form expression found for the energy function and the partition function.

$$\:Pr\:\left(y|x\right)\:=\frac{exp(-E(y,x\left)\right)}{Z\left(x\right)}$$

It proved analytically that a closed form expression is possible. The final optimization is given by:

$$\:{\lambda\:}_{1}/2{\Vert\:\theta\:\Vert\:}_{2}^{2}+{\lambda\:}_{2}/2{\Vert\:\beta\:\Vert\:}_{2}^{2}-{\sum\:}_{1}^{N}logPr\left({y}_{i}\right|{x}_{i};\theta\:,\beta\:)\:\:$$

3.2 Unsupervised Learning by Deep CNN

The second method evaluated is Gargi et al.’s work [14] which is compared with superpixel. This uses an unsupervised learning framework to estimate depth. They evaluate their depth predictions in one image based on how well the corresponding disparities predict the other image. The method is similar to autoencoders, and it uses both left and right cameras provided in the KITTI dataset for calculating the reconstruction error. Although no ground-truth (absolute depth values) is explicitly needed, they needed two cameras with known focal length and baseline (for knowing the scale) to efficiently train the network. Images fed to the deep CNN for input and a predicted depthmap is obtained at the output. To retrain the network, the proposed algorithm needs a ground-truth depthmap. In this case, since ground-truth depth map is not possible, it has been reconstructed the input image by warping a right image given the disparity measures for each pixel. The left and right images are captured from two side-by-side cameras located at a baseline of 0.5m. The concept mainly comes from the principle of stereo vision. The network basically comprises two parts: a convolutional encoder and deconvolutional decoder network. The convolutional encoder basically encodes the color image so that it can infer the depthmap of the same size of the color image at the output of the deconvolutional network. The main point is the skip architecture that they used to sharpen the local details that get blurred as the deconvolutional layer upsamples the depthmap based on the encoded map at the end of the convolutional layer. The total error of the network is given by the following equation:

$$\:{\sum\:}_{i}^{N}{E}_{reconst}^{i}+\gamma\:{E}_{smooth}^{i}\:$$

Here $\:E$ comprises of $\:{E}_{reconst}$, photometric reconstruction error and $\:{E}_{smooth}$ error. It is given by the sum of the error over each pixel. $\:{E}_{reconst}$ is obtained from warping the image of the right camera with the disparity map to the left camera, and is given by:

$$\:\int\:{\Vert\:{I}_{w}^{i}\left(x\right)-{I}_{1}^{i}\left(x\right)\Vert\:}^{2}dx=\int\:{\Vert\:{I}_{2}^{i}\left(x+{D}^{i}\left(x\right)\right)-{I}_{1}^{i}\left(x\right)\Vert\:}^{2}dx$$

Here Ω is the region over all the pixels.

$$\:{I}_{2}\left(x+{D}^{n}\left(x\right)\right)={I}_{2}\left(x+{D}^{n-1}\left(x\right)\right)+\left({D}^{n}\left(x\right)+{D}^{n-1}\left(x\right)\right){I}_{2h}\left(x+{D}^{n-1}\left(x\right)\right)$$

The most fascinating part is the way the warping function is developed. The warping function can be linearized by breaking $\:{I}_{2}(x+D(x\left)\right)$ as a Taylor series expansion given the disparity measure between two iterations at that pixel remains small. $\:{D}_{i}\left(x\right)$ is given by: $\:fB/{d}^{i}\left(x\right)$; where f is the focal length of the camera, B is the baseline between the two cameras, and $\:{d}^{i}$ is the disparity between left and right stereo. The smooth function is basically a penalty function to keep the gradient of the depth map low. The fact that the gradient of the depth map is low is a very important property satisfied by natural depth maps. The penalty is given by:

$$\:{E}_{smooth}^{i}={\Vert\:\nabla\:{D}^{i}\left(x\right)\Vert\:}^{2}$$

3.3 Object Classification from RGBD image

Neural networks have a proven record of excellence in object classification. For the proposed algorithm, the rich semantic feature information obtained via the best performance pre-built network is used. Currently, one the best pre-built object classifiers available through MatConvNet is the ResNet-50 [15]. The ResNet-50 is not a conventional neural network, but a Deep Residual Neural Network. Typical, Deep Convolutional Neural Networks combined feature level information extracted at multiple levels of the network to attempt to combine high, mid and low-level feature information. Continuously adding layers to a network to make it deeper is only expected to work up to a point due to vanishing gradients and no guarantee that the additional layers learn anything new of value. Residual networks solve both problems. To ensure that an added layer will learn new information in the network, the layer output will have access to the input before transformation. The intuition behind this type of architecture is that it is easier to optimize the residual mapping than the original mapping. These Deep Residual Networks have been shown to outperform Deep Convolutional Neural Networks, where the best performing network has 152 layers. The proposed algorithm uses extracted information from the final fully connected layer of the ResNet-50 model architecture and uses this as feature input for SVM. This ensures that the RGB features to classify vehicles and pedestrians are rich and discriminative. Each training image is provided with 2048 meaningful RGB feature descriptors. From the depthmap, the proposed algorithm generates 3D point clouds and extracts two sets of features: depth features and normal features. For the depth features, the proposed algorithm finds the covariance matrix of the point-cloud at the bounded/masked regions and uses them as features. For the features, the proposed algorithm calculates the mean and the variance of the local surface normals calculated for each pixel based on the 3D point-cloud. Surface normals are calculated by extracting a rectangular box and selecting points having estimated depth within a given threshold. The proposed algorithm uses those points for calculating the minimum eigen-vector of the point-cluster. The minimum eigen-vector then represents the surface normal for those selected points. In training each SVM, it standardizes the feature values. Tests are performed using k-Fold Cross Validation to ensure that there is no overfitting occurring in the proposed model. In experiments, parameter k is selected as 10.

Publicly available KITTI dataset is used for training, testing and evaluation purposes. The KITTI dataset contains images from 4 different cameras at fixed baselines between them. They can be used as a stereo camera system. In addition, it has a 64 channel Velodyne lidar that can be used for ground-truth or evaluation purposes. For estimating depth, the proposed algorithm uses 56 different video scenes, belonging to categories ’city’, ’road’, ’residential’. Frames can be extracted from videos for training and testing. Lidar measurements are used for evaluation purposes only. KITTI also provides a subset database meant specifically for object detection. This database contains 7481 training, and 7518 test scenes extracted from their videos, these images are independent meaning they come from various videos and various points in time from each video. Each scene contains a variety of vehicles, pedestrians, and other points of interest. The ground truth files for this dataset contain tight bounding boxes for each object, as well as a label for the object at that location. Officially, there are 8 labeled classes in the dataset; Car, Truck, Van, Tram, Pedestrian, Person Sitting, Cyclist, Misc. and Don't Care. Both the misc. and tram class types contain very few examples within the database, and so it has been ignored. Various vehicles such as Car, Truck and Van classes bundled into a single Vehicle class. Likewise, the Pedestrian, Person Sitting and Cyclist classes are combined to represent a singular Pedestrian class. The Don’t care objects are those regions of interest which can’t yet be labeled, typically these are vehicles or pedestrians which are too far from the camera or lidar sensor to obtain meaningful information from the vehicle. For algorithm development, the proposed algorithm used a database of 500 samples per class; vehicles, pedestrians, and a class of miscellaneous objects which are referred to as the negative class ’other’. The subregions in a scene of vehicles and pedestrians are provided to use by the ground truth file from the KITTI dataset. Meaningful negative examples of varying sizes were obtained by guaranteeing that negatives shared anywhere between a 5 to 25 percent region overlap with a ground truth area. Therefore, it has been ensured that negative examples are difficult and relevant. They each contain fragments which belong to either a vehicle or pedestrian, presumably this should help classification performance on objects which are occluded in part. Additionally, this can make the classification of true vehicle and pedestrian samples more challenging and therefore the feature descriptors that define them will be more robust.

5.1 Depth Estimation

There were several metrics that were used for comparing depth estimation between these two methods. The metrics are root mean square linear error (RMSE(linear)), root mean square logarithm error (RMSE(log)), absolute relative difference (AbsRelDiff), squared relative difference (SquaredRelDiff).

$$\:RMSE\left(linear\right)=\sqrt{\frac{1}{\left|T\right|}\sum\:_{{y}_{i}ϵT}{\Vert\:{y}_{i}-{y}_{i}^{*}\Vert\:}^{2}}$$

$$\:RMSE\left(log\right)=\sqrt{\frac{1}{T}\sum\:_{{y}_{i}ϵT}{\Vert\:loglog\:{y}_{i}\:-loglog\:{y}_{i}^{*}\:\Vert\:}^{2}}$$

$$\:AbsRelDiff=\frac{1}{\left|T\right|}\sum\:_{{y}_{i}ϵT}\frac{{y}_{i}-{y}_{i}^{*}}{{y}_{i}^{*}}$$

$$\:SquaredRelDiff=\frac{1}{\left|T\right|}\sum\:_{{y}_{i}ϵT}\frac{{\Vert\:{y}_{i}-{y}_{i}^{*}\Vert\:}^{2}}{{y}_{i}^{*}}$$

The proposed algorithm uses a 64 channel lidar for evaluation purposes. Table 1 shows the evaluation of depth estimate for two different methods. It clearly shows the depthmap obtained from unsupervised learning performs way better compared to the supervised learning framework. The RMSE for unsupervised learning is about 6m while for supervised learning it is about 12m which is quite significant. None of the methods can generate depthmap with precision but the former can be used as depth estimator for purpose. Figure 1 shows the RMSE depth error at different depth ranges for both methods. Once again, the unsupervised method clearly outperforms the latter one at most of the ranges, only at highest range d > 40m is when the former breaks down because the principle of stereo is not good for inferring depth at long ranges. The RMSE magnitude of the bar graph increases quadratically as ranges are increased, while the supervised framework has a consistent depth error on the overall image. The depthmap estimate obtained from the color image can generate clutter in the 3D world and might turn out to be useless in the long run. To investigate this scenario, the point cloud based on the depthmap generated which is obtained from the color image, see Fig. 2. Figure 2 (a) shows the lidar points projected into the RGB image. Depth-range has been selected within 7–12 m to concentrate on a small region of the point cloud. When the same point cloud is projected back to the world, depthmap has been recreated which is shown in Fig. 2 (b). As it can be seen, the roads, poles and the traffic signs could be easily recognized. If dense depthmap projected into the 3D world within the same range, it has been seen lots of clutter. It is harder to recognize the poles and traffic signs. But the road could be easily recognized. Based on this, it has been concluded that the dense depthmap generated from the color image introduces a lot of clutter in the 3D environment. In the next section, it has been discussed whether this dense depthmap can be used to improve object classification.

Table 1

Errors given in different parameters.
Methods	RMSE	RMS-log	Abs-RelDiff	Sqr-RelDiff
Unsupervised	6.1	0.3	0.22	1.33
Supervised	12.29	0.6	0.74	10.16

5.1 RGBD Object Classification

In this part, fused RGB features with depth extracted from RGBD images that utilized for object classification. The proposed algorithm integrated to Garg et al.’s [14] work for generating depth-map which was then used to extract depth and normal features. It has been concentrated on three classes: vehicle, pedestrian, and others. ’Others’ class contains negative samples obtained by generating random ROI on images. To make negative classes harder, it has been made sure that the negative samples have ROI that overlaps between 5 and 20 percent of the positive classes, but it has been ensured that neither bounded region is overlapped by more than 40 percent of its own area. Also, the bounding box might contain some background along with the original mask of the objects of interest. Therefore, it has been created tight elliptical masks within the bounded box to free the 3D point cloud as much as possible from any background clutter that might corrupt the depth and normal features.

While the baseline RGB feature SVM works quite well for the classification problem, it is evident that the addition of various in-depth information improves class discrimination. The region normals provided more improvement over regular depth features, but the combination of both normals and depth information helped improve classification performance. Because the groundtruth bounding boxes were tightly fit around an object, it has been seen that creating a mask of the region would help improve the significance of the depth information. Pixels nearer to the center would be more relevant than the information obtained from pixels on the far corners which were less likely to contain actual target. It turns out that adding this mask decreases the overall performance slightly.

Table 2

SVM Accuracy using various Features with masks.
Methods	RGB	RGB + Depth	RGB + Normals	RGB + Depth+ Normals
Accuracy	93.5	93.59	93.69	94.46

Table 3

SVM accuracy using Various Features without mask.
Methods	RGB	RGB + Depth	RGB + Normals	RGB + Depth+ Normals
Accuracy	93.5	93.67	93.77	94.37

This research study has shown that depth extracted from monocular images and color image features improves the classification accuracy of objects even though it has created clutters to the 3D point cloud distribution. Fusing depth information with color features improves object classification accuracy. The overall improvement is slight, but the proposed algorithm is relatively consistent and improves with previous methods. As an example, [16] implements a Fully Convolutional Neural Network for object segmentation purposes. The authors briefly report the success of combining in-depth information from the NYUDv2 dataset [17] which was collected through a Microsoft Kinect. Some of their earlier attempts to integrate in-depth information led to marginal performance increases. They believe this was due to the difficulty of passing down meaningful depth gradients through their network. However, in following a depth encoding defined by [7] they were able to obtain far more success with obtaining improvements in segmentation up to 5 percent. Their depth map embedding contained information containing horizontal disparity, height above ground, and local surface normal angles with an inferred gravity direction. Therefore, encoding the depthmap into far more meaningful features (e.g height above ground, angle of the surface normal with the inferred gravity direction), this can uplift the performance significantly. Li et al. developed AFI-NET for object detection [18] that is developed on image features as the proposed method. Additionally, the dataset that has been studied with contained many ground truth samples of vehicles and pedestrians which were a good distance away from the camera which may have impacted the significance of the depth information. One of the reasons that depth information can be bad at far ranges is because the proposed algorithm used the principle of stereo to train the CNN architecture. Stereo depth error quadratically increases with far-ranges and it was also evident with Fig. 1 where at ranges beyond 50m, the unsupervised learning approach fails to compete with the supervised learning approach on depth estimation.

The purpose of the research is to explore the limits of monocular cameras in estimating depths and perceive the 3D environment based on enhanced prediction. This research concentrated on the object classification performances using RGB-D information. The depth-map prediction based on monocular camera and fusion with RGB imagery improved object classification, and it has been concluded that depth information improves object classification performance. In doing so, RGBD camera sensor-based algorithms can replace lidar based algorithms which are expensive and need lots of object scans that makes it much harder to generalize autonomous driving. In the future, it has been planned to build more meaningful depth features as opposed to statistical measures of the point-cloud to see if that can improve performance further. Also, it has been seen that the evaluation performance goes down with far-away objects because the current method is bad at depth-prediction at far away ranges. It has been planned to develop an algorithm that can work for short-range and long-range distances as well. Also, it has been planned to deploy a neural network based on RGB-D channels for the neural network to learn the combined feature information. This can help to improve the classification performance further. This can also help on segmenting ground-planes (road surface), walls, and buildings by clustering and analyzing the surface normal and help to give important insight into the complex field of scene understanding. Analyzing RGB images based on depth maps from a monocular camera alone contains a research opportunity that can be exploited for improved perception of the 3D environment.

Conflict of Interest

There is no conflict of interest.

Availability of Data and Material

KITTI object detection dataset has been studied for this research; the dataset can be freely downloaded here.

Acknowledgements

Author of this paper is grateful to GameAbove partial funding of this research.

F. Galetto and G. Deng, “Single image defocus map estimation through patch blurriness classification and its applications,” The Visual Computer, vol. 39, no. 10, pp. 4555–4571, Jul. 2022, doi: https://doi.org/10.1007/s00371-022-02609-9.
Gábor Szűcs, “Multiclass classification by Min–Max ECOC with Hamming distance optimization,” The Visual Computer, vol. 39, no. 9, pp. 3949–3961, Jun. 2022, doi: https://doi.org/10.1007/s00371-022-02540-z.
Z. Qi et al., “A deep learning system for myopia onset prediction and intervention effectiveness evaluation in children,” npj Digital Medicine, vol. 7, no. 1, Aug. 2024, doi: https://doi.org/10.1038/s41746-024-01204-7.
T. Zhou, D.-P. Fan, M.-M. Cheng, J. Shen, and L. Shao, “RGB-D salient object detection: A survey,” Computational Visual Media, vol. 7, no. 1, pp. 37–69, Jan. 2021, doi: https://doi.org/10.1007/s41095-020-0199-z.
M. Jiang et al., “Intelligent 3D garment system of the human body based on deep spiking neural network,” Virtual Reality & Intelligent Hardware, vol. 6, no. 1, pp. 43–55, Feb. 2024, doi: https://doi.org/10.1016/j.vrih.2023.07.002.
M. Schwarz, H. Schulz, and S. Behnke, “RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features,” 2015 IEEE International Conference on Robotics and Automation (ICRA), May 2015, doi: https://doi.org/10.1109/icra.2015.7139363.
S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning Rich Features from RGB-D Images for Object Detection and Segmentation,” European Conference on Computer Vision, pp. 345–360, Jul. 2014.
X. Song, Y. Dai, and X. Qin, “Deep Depth Super-Resolution : Learning Depth Super-Resolution using Deep Convolutional Neural Network,” arXiv.org, 2016. https://arxiv.org/abs/1607.01977 (accessed Aug. 15, 2024).
Daniel Herrera C, Juho Kannala, L’ubor Ladický, and Janne Heikkilä, “Depth Map Inpainting under a Second-Order Smoothness Prior,” Lecture notes in computer science, pp. 555–566, Jan. 2013, doi: https://doi.org/10.1007/978-3-642-38886-6_52.
Y. Li, P. Duthon, M. Colomb, and J. Ibanez-Guzman, “What Happens for a ToF LiDAR in Fog?,” IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 11, pp. 6670–6681, Nov. 2021, doi: https://doi.org/10.1109/tits.2020.2998077.
A. Filgueira, H. González-Jorge, S. Lagüela, L. Díaz-Vilariño, and P. Arias, “Quantifying the influence of rain in LiDAR performance,” Measurement, vol. 95, pp. 143–148, Jan. 2017, doi: https://doi.org/10.1016/j.measurement.2016.10.009.
D. Eigen and R. Fergus, “Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture,” International Conference on Computer Vision, Dec. 2015, doi: https://doi.org/10.1109/iccv.2015.304.
F. Liu, C. Shen, G. Lin, and I. Reid, “Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 2024–2039, Oct. 2016, doi: https://doi.org/10.1109/tpami.2015.2505283.
R. Garg, G. Vijay, G. Carneiro, and I. R. Reid, “Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue,” Springer eBooks, vol. 9912, pp. 740–756, Oct. 2016, doi: https://doi.org/10.1007/978-3-319-46484-8_45.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, Jun. 2016, doi: https://doi.org/10.1109/cvpr.2016.90.
J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” openaccess.thecvf.com,2015,https://openaccess.thecvf.com/content_cvpr_2015/html/Long_Fully_Convolutional_Networks_2015_CVPR_paper.html (accessed 2015).
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor Segmentation and Support Inference from RGBD Images,” Computer Vision – ECCV 2012, pp. 746–760, 2012, doi: https://doi.org/10.1007/978-3-642-33715-4_54.
L. Li et al., “AFI-Net: Attention-Guided Feature Integration Network for RGBD Saliency Detection,” Computational Intelligence and Neuroscience, vol. 2021, pp. 1–10, Mar. 2021, doi: https://doi.org/10.1155/2021/8861446.

No competing interests reported.

Download PDF

Editorial decision: Revision requested
22 Aug, 2024
Reviewers invited by journal
22 Aug, 2024
Editor assigned by journal
20 Aug, 2024
Submission checks completed at journal
20 Aug, 2024
First submitted to journal
15 Aug, 2024

You are reading this latest preprint version

Enhancing Object Classification for Autonomous Vehicles via Rgbd Fusion From Monocular Cameras: a Depth-aware Approach

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Study

3. Method

3.1 Depth Estimation from Monocular Camera

3.2 Unsupervised Learning by Deep CNN

3.3 Object Classification from RGBD image

4. Datasets

5. Experiments and Results

5.1 Depth Estimation

5.1 RGBD Object Classification

6. Conclusion

Declarations

References

Additional Declarations

Status:

Version 1