Background
Automatic facial landmark localization in videos is an important first step in many computer vision applications, including the objective assessment of orofacial function. Convolutional neural networks (CNN) for facial landmarks localization are typically trained on faces of healthy and young adults, so model performance is inferior when applied to faces of older adults or people with diseases that affect facial movements, a phenomenon known as algorithmic bias. Fine-tuning pre-trained CNN models with representative data is a well-known technique used to reduce algorithmic bias and improve performance on clinical populations. However, the question of how much data is needed to properly fine-tune the model remains.
Methods
In this paper, we fine-tuned a popular CNN model for automatic facial landmarks localization using different number of manually annotated photographs from patients with facial palsy and evaluated the effects of the number of photographs used for model fine-tuning in the model performance by computing the normalized root mean squared error between the facial landmarks positions predicted by the model and those provided by manual annotators. Furthermore, we studied the effect of annotator bias by fine-tuning and evaluating the model with data provided by multiple annotators.
Results
Our results showed that fine-tuning the model with as little as 8 photographs from a single patient significantly improved the model performance on other individuals from the same clinical population, and that the best performance was achieved by fine-tuning the model with 320 photographs from 40 patients. Using more photographs for fine-tuning did not improve the model performance further. Regarding the annotator bias, we found that fine-tuning a CNN model with data from one annotator resulted in models biased against other annotators; our results also showed that this effect can be diminished by averaging data from multiple annotators.
Conclusions
It is possible to remove the algorithmic bias of a\textbf{depth} CNN model for automatic facial landmark localization using data from only 40 participants (total of 320 photographs). These results pave the way to future clinical applications of CNN models for the automatic assessment of orofacial function in different clinical populations, including patients with Parkinson’s disease and stroke.