Data: A supervised deep learning model takes in the inputs and the corresponding outputs and learns the mapping between the inputs and the outputs. For developing a deep learning model for predicting time history of angular velocities from crash videos, crash videos are required as inputs and the corresponding time history of angular velocities are required as outputs. There are no such real-world datasets available. Data can be taken from the NHTSA commissioned crash tests where crash test videos can be used as inputs and the corresponding head angular velocities measured on the ATD can be used as outputs to develop a deep learning model. However, there is very limited data from limited test conditions. Hence, FE based crash simulation data was utilized in this proof of concept study.
To generate the data, validated simplified Global Human Body Models Consortium (GHBMC) 50th percentile male [15, 16] and 5th percentile female [17, 18] FE human models were used. These human models were positioned in the driver compartment (Figure 1) that was extracted from the validated FE model of a 2014 Honda Accord [19].
A validated generic seatbelt system with retractor, pretensioner and load limiter was included in the model along with validated frontal and side airbags [19]. In addition, steering column collapse was implemented and was included in these simulations. The roof rails, side door, B-pillar, and floor were deformable in the full FE model, but were made rigid in this study. The knee bolster and A-pillar were kept deformable. The human models were positioned in the driver compartment based on clearance measurements taken from physical crash tests (NHTSA test number 8035 for 50th male, NHTSA test number 8380 for 5th female; https://www-nrd.nhtsa.dot.gov/database/veh/veh.htm). The crash pulse used for the simulations was taken from the physical crash test and is shown in Figure 2.
These human models were evaluated in full frontal test condition, following which a design of experiments (DOE) study was conducted. For the DOE study, both crash-related parameters and restraint-related parameters were varied (Table 1). The crash related parameters were Delta-V and principal direction of force (PDOF). The restraint parameters were both seatbelt and airbag related. The parameters were varied over a wide range to generate a range of head motions including cases where the head hits the steering wheel.
Parameter
|
Range
|
Crash-related Parameters
|
Delta-V
|
25 mph - 45 mph
|
PDOF
|
-30o (near side) – 30o (far side)
|
Restraint-related Parameters
|
Frontal & side airbag mass flow rate
|
± 25 %
|
Frontal & side airbag firing time
|
5 ms- 70 ms
|
Collapsible column breaking force
|
3000 N- 10000 N
|
Load limiter
|
1000 N- 5000 N
|
Pretensioner limiting force
|
1000 N- 3000 N
|
Friction between head and front/side airbag
|
0 - 3
|
Table 1. Parameters and their ranges
The crash pulse for the same vehicle may be different for different PDOF, frontal overlap, and type and stiffness of the impacting surface. In addition, for the same PDOF, frontal overlap, and impacting surface, the crash pulse can vary for different vehicles of the same size (e.g., mid-size sedans). To keep the number of variables manageable for the DOE study, crash pulse shape was kept constant. Only crash pulse magnitude was scaled to achieve different Delta-Vs.
A total of 1010 scenarios were simulated covering a wide range of crash conditions. Each crash scenario was simulated for a duration of 150 ms. For each simulation, the time history of head angular velocities was computed and four crash videos with different views were generated (Figure 3).
The views chosen were similar to camera views available from NHTSA crash tests. Since the aim of the study was to predict the time history of head angular velocities from any view, each crash view was treated as a separate sample. Thus, we had a total of 4040 crash videos and their corresponding head angular velocity time histories (ωx, ωy, ωz) for each of the three head rotational axes. The crash videos were then used as inputs for the deep learning model and the corresponding angular velocity time histories were used as the “ground truth” outputs. For the purposes of this study, all crash videos were generated such that only the human model was visible. The vehicle structure and the airbags were removed from the videos to prevent any head occlusion.
Since videos are used as inputs to the deep learning model in the form of sequence of images, an additional input pre-processing step was carried out to convert the FE based crash videos to sequences of images. Given the goal of this study was to predict the time histories of head angular velocities, the motion of the head was extracted as a sequence of images over time from each FE crash video (Figure 4). These sequences of images were then used as inputs to the deep learning model.
To extract the motion of the head over time as a sequence of images, the head needs to be detected in each frame of the crash video. A head detection algorithm may be employed for this purpose. The use of FE based crash videos in our study offered an additional advantage in utilization of a fast and accurate computer vision-based color mask as a head detection algorithm. In all crash videos, the head of the human model was colored green and the rest of the body was kept gray such that the head could be easily detected in each frame with a bounding box using Contours in OpenCV [20]. Once detected, the head image inside the bounding box was extracted from each frame to obtain a sequence of head images over time. The images were extracted every 2 ms from the 150 ms crash event and thus each sequence of images had a length of 76. The corresponding “ground truth” time histories of angular velocities (outputs or targets) were also sampled every 2 ms to match the corresponding sequence of images. An example of the input and corresponding output for training the deep learning model is shown in Figure 5.
The Contours based detection technique gave zero false positives and generated a complete sequence without missing any frames. However, it only works if the user has a full control over all color aspects of the videos, which was the case in our study. For a “real world condition” (not based on the simulations), a head detection model with a head tracking algorithm, such as Kalman filter, may be more appropriate.
Input data Transformation: The input data (sequence of images) were RGB (red, green, blue) images with a data type of “uint8.” A “uint8” data type contains pixel values in the range from 0 to 255 (pixel value of 0 corresponds to the darkest color in the range while value of 255 corresponds to the lightest value). Deep learning models train better and faster when input data is on the same scale. Thus, all the input sequences of images were normalized so that the pixel values were in the range from 0 to 1 with the data type of “float32.” Due to resource limitations, all images were resized to a height and width of 64 pixels and subsequently converted to grayscale such that each sequence of images had a shape of (76, 64, 64, 1), where number 76 stands for the number of images in a sequence, numbers 64 are for image size, and 1 represents a grayscale image (for color images the number 1 would’ve been replaced with number 3 for each channel of the RGB image).
Data splitting: The entire dataset had 4040 samples. The count plot in Figure 6a shows the distribution of data for each human model size and for each view. For developing the deep learning model, this dataset was split into three datasets: training, validation and test datasets. 74% the data was used for training, 13% of the data was used for validation and 13% of the data was used for testing. Data splitting was carried out using stratified sampling based on human model size and the crash view to ensure each of these (human model size and crash view) were equally represented in all three datasets (Figure 6b).
The training and validation datasets (87% of the data) were used for model development. The validation dataset was used for hyperparameters tuning and was a part of model development. The test dataset was not used in model development and was treated as an unseen dataset that was used to evaluate the final performance of the model.
Deep learning model: The overall architecture for a deep learning model depends on the type of input data. The input data in this study is a sequence of images over time. Convolutional Neural Networks (CNN) can capture spatial dependency and are one of the most common types of neural networks used in computer vision to recognize objects and patterns in images. On the other hand, Recurrent Neural Networks (RNN) can capture temporal dependency and are commonly used for sequential data processing. Thus, to process sequences of images in this study, a deep learning model that combines CNN [21] and Long Short-Term Memory (LSTM) based RNN [22] was used (Figure 7). The CNN-LSTM architecture uses CNN layers for feature extraction on input data combined with LSTMs to support sequence prediction.
Since the best architecture for our problem was not known at the start of model development, a lightweight baseline model (with fewer trainable parameters) was developed, which was later improved using hyperparameter tuning. For the CNN part of the baseline model, a Visual Geometry Group (VGG) style architecture [23] was used, which consisted of a three-block network with two convolutional layers per block followed by a max pooling layer. Batch Normalization [24] and a Rectifier Linear Unit (ReLU) activation function [25] were used after each convolutional layer. The baseline (initial) values selected for the number of convolutional filters for the three blocks were 16, 32 and 64 respectively. A global average pooling layer was added as the last layer of the CNN model to obtain the feature vector. Since each input sample is a sequence of images, the CNN part of the model was wrapped in a Time Distributed layer [26] to get feature vector corresponding to the entire sequence. The Time Distributed wrapper helps apply the same CNN network to every temporal slice (image) of the input. The output from CNN was used as an input to the LSTM network.
For the LSTM part of the baseline model, one LSTM layer with a hidden size of 128 was used. Since input sequence has a length of 76 and the goal is to predict the time history of angular velocity, the output was obtained at each recurring timestep from the LSTM layer. The output of the LSTM was then used as an input for a fully-connected layer with the ReLU activation function, followed by a Dropout layer [27] to control for overfitting. The output of the dropout layer was then fed to a fully-connected layer with a linear activation function to generate the final output, i.e. the predicted time history of angular velocity. Linear activation generates continuous numerical values and hence was used in the final output layer as angular velocity time history prediction was solved as a regression task.
The Mean Squared Error (MSE) between the actual and predicted time history was used as the loss function for training the entire model. Adaptive moment estimation (Adam) optimizer [28] was utilized for optimization. Since the ReLU activation was used in the network, He-Normal initializer [29] was used to initialize the trainable weights of the model. The model was developed using Tensorflow v2.4 [26].
Individual deep learning models and Training: Since the three components of angular velocity (ωx, ωy, ωz) are independent of each other, three separate deep learning models were trained one for each component of angular velocity ωx, ωy, and ωz (as opposed to training one deep learning model for predicting time history of all three components of angular velocity. The same training and validation inputs were used for training all three models. Only the “ground truth” targets were changed depending on the model. The baseline models for ωx, ωy, and ωz were trained with a learning rate of 0.0001 and with a batch size of 8 for a maximum of 80 epochs. Early stopping [26] with a patience of 10 and model checkpointing [26] callbacks were used to save the best model based on validation loss. Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. For this purpose, ReduceLROnPlateau callback [26] was utilized. This callback monitors the validation loss and if no improvement is seen for 5 epochs, the learning rate is reduced.
The hyperparameter values chosen for the CNN, LSTM, and the extended part of the baseline model were selected at random and did not necessarily correspond to the best architecture for the problem. To improve the models, hyperparameter tuning was carried out to find the set of hyperparameter values that give the best results for our problem. Validation loss was tracked to find the best set of hyperparameters. Table 2 shows the hyperparameters that were varied along with their corresponding range.
Hyperparameters
|
Baseline value
|
Range explored
|
CNN Based
|
Number of VGG blocks
|
3
|
1 – 5
|
Number of convolutional filters per block
|
16,32,64
|
16 - 64
|
Pooling type
|
Max
|
Max, Average
|
LSTM Based
|
Number of LSTM layers
|
1
|
1-2
|
Number of LSTM units
|
128
|
64 - 256
|
Others
|
Number of units for fully-connected layer
|
80
|
64 - 128
|
Dropout rate
|
0.5
|
0.0 – 0.5
|
Learning rate
|
1e-4
|
1e-4 – 1e-2
|
Table 2. Hyperparameters
Keras-Tuner [30] was used for hyperparameter tuning using Bayesian Optimization [31]. Because of resource limitations, hyperparameter tuning was only performed for the ωx model to find the best set of hyperparameters. This set of hyperparameters was then used to train the final deep learning models for all three components of angular velocity.
Combined Model: The three individually trained models for ωx, ωy, and ωz were combined into a single deep learning model as shown in Figure 8. To predict the time history of the three components of angular velocity from a video input of any view, the video (preprocessed as sequence of images) is passed into the combined model. It is then propagated (forward pass) through the individually trained networks that output time history of the three components of angular velocity ωx, ωy, and ωz.
Model evaluation: The three individually trained deep learning models for ωx, ωy, and ωz were evaluated on the test dataset to see how well they generalize on unseen data. The actual and predicted time histories for cases from the test dataset were compared quantitatively using CORA [32]. While time histories of angular velocities are important to assess overall head kinematics, for computing brain injury metrics peak values are usually used. For example, Brain Injury Criteria (BrIC) [4] is computed using absolute peaks of ωx, ωy, and ωz (equation (1)).
To evaluate prediction of the peak angular velocity, correlation coefficient between the actual and predicted peaks was computed for all three models using the test dataset.
The combined model was also evaluated for a few cases from the test dataset. For the combined model, in addition to comparing the time histories using CORA, the actual and predicted BrIC values were also compared.