Alzheimer’s disease and related dementias (ADRD) is a set of conditions, most often occurring in later life, that impair cognition and functioning. Nearly 6 million people in the U.S. and 50 million people worldwide are affected.1,2 With the population worldwide aging, the number of people living with ADRD is expected to increase dramatically in the coming years,3 posing substantial challenges to the families and health care systems that support them. Because there is no single test for ADRD, multiple approaches have been developed to screen for dementia. Since the 1990s, the clock-drawing test (CDT), which asks subjects to draw a clock (typically with hands showing ten after 11), has been recognized as a reliable tool to evaluate a wide range of cognitive functions either by itself or as part of a brief battery and has been used in clinical research, epidemiologic studies and panel surveys.4 The CDT has been favored because it requires little training for clinicians and researchers to implement, can be administered in less than two minutes and requires only paper and a writing utensil.5 Several scoring algorithms for the CDT exist, but generally the clock is assigned an overall ordinal score or count of ordinal subscores (points) for various elements.6,7
Nevertheless, barriers exist to taking fuller advantage of the CDT. For example, because the CDT requires manual coding, when used in large studies, biases may be introduced if there are systematic differences in how coders implement rules (i.e., “coder effects”). This issue is especially a concern in longitudinal studies in which clock drawings by the same individual may be coded inconsistently over time by different coders. In addition, some researchers argue that existing CDT scoring algorithms are generally not suitable for detecting mild cognitive impairment (MCI).8,9 Qualitative CDT coding (i.e. subjective evaluation of error sources) can better distinguish MCI but is time-consuming and subject to larger coder effects than standard coding.10
The rapid development of deep learning neural networks (DLNN) in the past several years and advancement of image coding11 in particular make this technique ripe for exploring application to CDT-coding with large-scale, representative samples, which traditionally relies mainly on manual coding.
The application of DLNN to CDT-coding has two potential advantages over manual-coding of CDT. First, DLNN has the potential to reduce resources needed to incorporate clock codes into large-scale studies. Manual coding currently requires recruitment of coders, training, evaluation, and certification, coding (in which the coders need to review, interpret and score each CDT), and quality checks. Given its critical role in dementia detection and the wide range of research areas to which CDT data have contributed,5,12–19 it would be advantageous if the resources needed to include the CDT in large-scale studies could be reduced. Second, DLNN has the potential to produce codes with higher reliability and validity than manual coding. Manual coding is subject to errors and inconsistencies because coder interpretations of CDT and the scoring system may be different or applied inconsistently. In longitudinal studies, errors in manual coding can be inconsistent over time. In contrast, studies have shown that well-established machine learning methods can classify images more reliably and more accurately than humans.20–23
Applications to automate CDT-coding have been mostly attempted with small and non-representative respondent pools but thus far have yielded inconsistent accuracy rates and have mainly focused on binary coding: normal vs. abnormal.24–31 While a scarce number of studies have explored the coding of CDT into ordinal categories and have some promising results, 32–34 these studies are limited in three important ways. First, despite the ordinal nature of CDT scores where class labels include information about the relative ordering between the labels, none of the previous studies have considered this ordinal feature; instead, these studies all used the standard nominal classification loss functions such as multi-category cross-entropy to code DLNN. Moreover, CDT coding studies attempting classification into more than two categories have yielded relatively low accuracy. Second, with very few exceptions, 35–37 most of these studies have been conducted using nonrepresentative and small-scale samples, with limited racial/ethnic diversity and narrow ranges of cognitive impairment. Third, many of these studies assume that manual-coded CDT scores are error-free, and do not acknowledge the possibility of coder effects.
DLNN models for Enhanced CDT-Coding
This study investigated three advanced DLNN models to code CDT– ResNet101, EfficientNet and Vision Transformer (ViT) combined with transfer learning. ResNet101 and EfficientNet belong to the Convolutional Neural Network (CNN) family, the primary deep learning architecture in recent years for computer vision tasks. While other deep learning neural network (DLNN) models such as GoogLeNet and Inception-v3 aim to reduce computational cost by 'going deeper' to decrease parameters while maintaining performance, ResNet stands out for its ability to train extremely deep neural networks (comprising hundreds or even thousands of layers) without encountering issues like vanishing gradients or degradation. EfficientNet was designed to achieve better performance by scaling the network's depth, width, and resolution simultaneously, resulting in models that are both efficient and effective across a wide range of tasks 38,39. Based on previous research, ResNet101 and EfficientNet outperformed other well-known NN models (e.g., VGG, GoogLeNet, Inception, MobileNets, Densenets and NASNet) in a number of computer vision tasks in terms of achieving better accuracy and increased computational efficiency. 39,40 Further details are provided in the Methods section.
Integrating Vision Transformer in CDT-coding
Unlike ResNet101 and EfficientNet, ViT is a type of neural network architecture that uses a pure transformer applied directly to sequences of images patches for image classification tasks. Previous research41–43 demonstrated that the reliance on CNNs is not necessary, and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), ViT attained excellent results compared to the state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. To the best of our knowledge, this is the first study applying ViT to CDT-coding.
An innovative multi-class ordinal coding
The current practice of image classification through deep learning has predominantly relied on a nominal classification approach, assigning exclusive, non-ordering labels to images. However, this classification methodology overlooks the intrinsic ordering of categories in CDT scores. Consequently, this method, despite employing a more general model, may encounter challenges such as conflicting probabilities for various categories and the possibility of overfitting.
In this study, we introduce structured ordering into the coding system. The approach has the potential to minimize classification errors as it better mirrors the comparative thinking employed by human coders evaluating the CDT images. The ordinal coding approach also allows researchers to control the direction of errors (minimizing over- or under-estimation).
The ordinal CDT-coding belongs to a particular type of supervised learning problems called ranking or ordinal classification, which considers the inherent order of outcomes. Specifically, a nominal classification problem can be defined as building a system that maps an input space X to a class label space, \(\:L\:=\:\{{l}_{1},{l}_{2}\:,\:\dots\:,{l}_{k}\}\). Unlike traditional nominal classification, an ordinal classification system maps an input space \(\:X\) to an ordinal class label space, \(\:OL\:=\:\{\:{l}_{1},{l}_{2}\:,\:\dots\:,{l}_{k}\:|\:{l}_{1}<{l}_{2}<\:\dots\:<{l}_{k}\:\}\). While the field of machine learning has developed many powerful algorithms for predictive modeling, most of these algorithms were designed for nominal classification tasks, where the commonly-used loss functions in these algorithms is multi-class cross-entropy, which do not capture the ordering properties contained in the label space.
Although no previous studies have applied ordinal DLNN approach to CDT-coding, a number of ML techniques have been developed to address ordinal classification problems. Li and Lin (2007) proposed a reduction framework from ordinal regression to binary classification based on extended examples.44 The extended binary classification approach forms the basis of many ordinal regression implementations. However, neural network-based implementations of this approach commonly suffer from classifier inconsistencies among the binary rankings.45 This ordering property cannot be captured by commonly used loss functions such as multi-category cross-entropy in DLNN classification systems in some cases, as shown in Cao et al. (2020).46 In this paper, we investigate innovative methods for placing constraints of classifier consistency that can easily be implemented in various deep learning neural network architectures, and new metrics for evaluating ordinal classification systems.
Research questions
The nature of ordinal data in CDT scores prompts consideration between deploying a more general classification model with fewer constraints and a less general ordinal model with more constraints. Our research pioneers a comparative study of image classification and ordinal coding using DLNN for assigning CDT scores, providing insights into the potential applications of our developed ordinal coding for other image classification tasks with ordinal outcomes.
Using what is believed to be the world’s largest publicly available repository of CDT images, from the 2011–2019 National Health and Aging Trends Study (NHATS), this study attempts to address the following research questions:
-
Which DLNN models are most effective in coding CDT scores using 1) binary (impaired, non-impaired) scoring and 2) ordinal scoring with 6 classes? To address this question, we explored state-of-the-art DLNN technology for CDT classification and compared performances of ResNet101, EfficientNet and Vision Transformer.
-
Does ordinal DLNN coding outperform traditional nominal classification? To address this question, we modified DLNN models using an ordinal-coding approach and compared the standard nominal approach with the newly developed ordinal approach. We also performed a sensitivity analysis to demonstrate the ability of the ordinal approach to allow researchers to shift errors toward over or under estimation of cognitive function.
-
How do the DLNN coded CDT scores compare to manually coded scores? To address this question, we evaluate the performance of the ResNet101, EfficientNet and Vision Transformer techniques relative to human coders for both binary and ordinal scoring.
DATA
NHATS CDT data collection. We used 9 rounds of data from NHATS, a nationally representative panel study of adults ages 65 and older living in the U.S. NHATS was initiated by the National Institute on Aging in 2008 to guide efforts to reduce disability, maximize health and independent functioning and enhance quality of life at older ages.47,48 The sample was drawn from the Medicare enrollment file, which covers approximately 96% of older adults in the U.S. Respondents were sampled for Round 1 (2011) and replenished in Round 5 (2015) and Round 12 (2022) using a stratified three-stage sample design in which individuals at older ages and Black individuals were oversampled.49 In total, 8,245 respondents participated in Round 1. The response rate was 71% in Round 1, 77% in Round 5, 59% in Round 12 and exceeded 85% in other rounds.48
NHATS has collected pen-and-paper-based CDT items annually, where respondents have two minutes to draw a clock showing 10 past 11. In total, more than 47,000 CDT images are available for Rounds 1 to 9. On average across these rounds, about 50% of clocks were drawn by respondents age 80 or older, 60% by females, and about 30% by non-White (21% Black) individuals. Based on the full cognitive battery,50 11% were drawn by someone classified as having probable dementia and 11% as having possible dementia.
CDT Coding. Once collected, clocks were scanned into an online database for coding. Clocks were coded by trained lay coders on an ordinal scale as follows: 0 Not recognizable as a clock, 1 Severely distorted depiction, 2 Moderately distorted depiction, 3 Mildly distorted depiction, 4 Reasonably accurate depiction, and 5 Accurate depiction of a clock, using a coding system developed by Psychological Assessment Resources, Inc.51 Illustrations of clocks coded with various scores are included in Appendix Fig. 1. Each round, coders participated in a two-hour CDT training session and were asked to code 219 training clocks. The 219 clocks were also coded by two neuropsychology fellows, which were considered the gold standard given their clinical background in cognitive assessment. The Cohen’s weighted Kappa for the inter-coder reliability between each lay coder and the neuropsychology fellows was calculated and used to select qualified coders. The final number of coders selected for each round and the minimum Cohen’s weighted Kappa score to select coders for each round are included in Table 1.
Table 1
NHATS Clock Repository by Data Collection Year
Year | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
Final # of coders* | 3 | 4 | 3 | 4 | 5 | 4 | 5 | 5 | 4 |
Minimum Cohen’s weighted Kappa scores for coders | 0.73 | 0.69 | 0.75 | 0.75 | 0.75 | 0.76 | 0.72 | 0.72 | 0.73 |
Number of clocks available | 6,918 | 5,504 | 4,459 | 3,671 | 7,076 | 5,997 | 5,256 | 4,658 | 4,186 |
*Some coders coded clocks at multiple rounds. In total, there are 15 unique coders across 9 rounds. |
Data Description. Two datasets were used in this study. The first dataset contains CDT data collected from NHATS respondents from Round 1 to Round 9. This dataset is used to train the DLNN models and evaluate the performances of these models. As shown in Table 1, in total, over 30,000 CDT images were collected in Rounds 1 to 9. To ensure the use of high-quality images and coded scores, we selected CDT images based on the following two criteria: 1) Images categorized as having good clarity by coders, and 2) CDT images coded by the eight top-ranked coders, all of whom have maintained an average Kappa score exceeding 0.75 across the years. In total, 25,872 CDT images were selected. Selected coders and their Kappa scores are presented in Appendix Table 1.
The second dataset includes the aforementioned 219 CDT images, which were used to evaluate the performances of DLNN models compared to NHATS coders. We refer this dataset as “benchmark data” in this paper. Notably, all coders coded the 219 CDT images in each year that they participated in coding, and two neuropsychology fellows also coded them. This dataset provides an opportunity to evaluate coder effects and compare DLNN models with human coders.