A spatially guided machine learning method to classify and quantify glomerular patterns of injury in histology images

doi:10.21203/rs.3.rs-2337818/v1

INTRODUCTION

Pathology diagnosis of glomerular diseases is primarily based on visual assessment of histologic patterns. Semi-quantitative scoring of active and chronic lesions is often required to assess individual characteristics of the disease. Reproducibility of the visual scoring systems remains debatable while digital and machine learning technologies open opportunities to detect, classify and quantify glomerular lesions, also considering their inter- and intraglomerular heterogeneity.

MATERIALS AND METHODS

We performed a cross-validated comparison of three modifications of a convolutional neural network (CNN)-based approach for recognition and intraglomerular quantification of nine main glomerular patterns of injury. Reference values provided by two nephropathologists were used for validation. For each glomerular image, visual attention heatmaps were generated with a probability of class attribution for further intraglomerular quantification. Quality of classifier-produced heatmaps was evaluated by an intersection over union metrics (IoU) between predicted and ground truth localization heatmaps.

RESULTS

A proposed spatially guided modification of CNN classifier achieved the highest glomerular pattern classification accuracies with AUC values up to 0.981. With regards to heatmap overlap area and intraglomerular pattern quantification, spatially guided classifier achieved significantly higher generalized mean IoU value, compared with single-multiclass and multiple-binary classifiers.

CONCLUSIONS

We propose a spatially guided CNN classifier which in our experiments reveals the potential to achieve high accuracy for intraglomerular pattern localization.

Biological sciences/Computational biology and bioinformatics

Biological sciences/Computational biology and bioinformatics/Image processing

Pathology diagnosis of glomerular diseases is based on visual assessment of histological patterns of injury, commonly represented as categories in classifications of glomerulonephritis (GN).[1–6] A broad spectrum of histological glomerular lesions was represented with 47 different definitions by Haas et al. [7] Besides deciding on a predominant glomerular pattern of injury in the tissue sample, pathologist has to take into account many details which may have focal and segmental distributions and disclose important diagnostic and/or prognostic features. The task becomes further complicated by the occurrence of mixed patterns of injury and potential variance of the findings in consecutive tissue sections. This can be regarded as a phenomenon of intra- and inter- glomerular heterogeneity, which obscures accuracy and precision of the assessment.

Increasing need for renal pathology diagnosis to guide therapy decisions has led to implementation of pathology scoring schemes for different types of GN; in particular, lupus nephritis, ANCA GN, IgA nephropathy.[8, 9] For example, lupus nephritis is categorized into class II, III, or IV based in the presence and spread of global and segmental endocapillary lesions while chronicity is represented by global/segmental glomerulosclerosis, often in the same glomeruli. This semi-quantitative assessment then converges into broader categories of the disease activity and chronicity. [1, 10] In the case of lupus nephritis, the categories are associated with the clinical consequences: class II representing a rather indolent renal disease, whereas classes III and IV being related to increasingly aggressive course. [11]

GN scoring systems have been applied by renal pathologists world-wide; however, with medium- to-low reproducibility. [12, 13] A recent systemic review by Dasari et al. indicated poor inter-pathologist agreement in assessing lupus nephritis activity index [14], raising doubts about accurate representation of the level of the disease activity for patient management. Furthermore, published consensus of glomerular injury pattern definitions only moderately improved interobserver agreement in identification of glomerular lesions (from 65.2–74.8%). [15]

Recent progress in digital image analysis and machine learning applications for kidney tissue segmentation [16–34] has opened new perspectives for automated renal pathology assays. In particular, convolutional neural networks are applied for glomerular segmentation and classification tasks. Ginley et al. defined a set of digital characteristics that quantify the structural progression of diabetic nephropathy. [35] Zeng et al. proposed CNN-aided quantitative analysis of glomerular pathological features in IgA nephropathy. [36] Weis et al. developed CNN-based approach to simultaneously assess various glomerular lesions. [32] Yang et al. explored the possibilities of an integrated classification model to determine various patterns of glomerular disease in whole slide images. [30]

To the best of our knowledge, until now no CNN-based approach has been reported that is focused on intraglomerular classification and quantification of various injury patterns to be tested against manually predefined regions. Previous experiments achieved acceptable accuracies for glomerular injury classification with a gradient-weighted class activation mapping technique (Grad-CAM) to visualize the performance of the classifier. [30, 32, 37] This technique is a post hoc neural network attention method, that is not utilized for classifier training. Regarding to glomerular histological pattern segmentation, it can show an accurate feature recognition [30, 32, 37], or sometimes might present false negative intraglomerular segmentation results. [37] However, Grad-CAM heatmaps contain a valuable spatial information that could potentially be utilized to train CNN-based classifier.

In this study, we present a cross-validated comparison of different modifications of CNN-based glomerular pattern classification models and demonstrate that model performance could be improved by spatially focused guidance. We propose a method of spatially guided multiclass CNN classifier for accurate intraglomerular pattern recognition and quantification, which is validated by comparing classifier-produced attention heatmaps to manual annotations provided by nephropathologists.

Patient specimens, digital image acquisition and image pre-processing

The study is based on retrospective collection of 695 routine renal biopsies performed and tested at the National Center of Pathology (Vilnius, Lithuania) from 2016 to 2021. The cohort was balanced by a final pathology diagnosis that contained: 100 cases of IgA nephropathy, 99 cases of membranoproliferative glomerulonephritis, 100 cases of crescent glomerulonephritis, 100 cases of membranous nephropathy, 96 cases of minimal change disease, 35 cases of secondary focal segmental glomerulosclerosis and 92 biopsies of endocapillary glomerulonephritis. Formalin-fixed paraffin embedded routine renal biopsies were cut at 3µm-thick sections and stained by modified Picrosirius Red stain. Digital whole slide images (WSI) were recorded using a ScanScope XT Slide Scanner (Leica Aperio Technologies, Vista, CA, USA) under 20x objective magnification and 0.5-µm resolution) and subsequently subjected to digital image analysis by using HALO™ software (version 3.5.3577.140 and HALO AI 3.5.3577; Indica Labs, Corrales, New Mexico, United States) for glomerular segmentation. Based on manual annotations, HALO AI DenseNet classifier was trained to recognize and segment glomeruli containing all types of injury patterns. HALO classifier prediction masks were used to create a collection of glomeruli cropped from original biopsy WSI into 1024 × 1024 pixel-sized images. A total of 27,156 glomerular images were extracted and pre-processed by replacing surrounding renal cortex tissue with a black background.

Ethics declarations

All tissue samples originated from the Lithuanian National Center of Pathology and the study was performed under permission of Vilnius Regional Biomedical Research Ethics Committee No. 2019/6-1148-637. Informed consent was waived by Vilnius Regional Biomedical Research Ethics Committee and all methods were performed in accordance with the relevant guidelines and regulations.

Defining glomerular injury patterns and datasets for classification

All extracted glomerular images were reviewed and categorized into nine main injury patterns: mesangioproliferative, endocapillary, membranoproliferative GN, membranous, crescentic, segmental glomerulosclerosis, hypertrophy, global glomerulosclerosis, and normal glomeruli (Fig. 1). Glomeruli that represented mixed or ambiguous patterns of injury were not included in the training set. The second step of preselecting glomerular images for CNN-based approach was carried out as a consensus of two nephropathologists who blindly reviewed the preselected glomerular images and classified the glomerular injury patterns as pure (uniform) as possible. Images of cropped glomeruli representing pure patterns were randomly assigned to testing and training sets.

9 classes of glomerular histological patterns were used: a. Crescentic, b. Endocapillary, c. Mesangioproliferative, d. Membranoproliferative, e. Segmental sclerosis (FSGS), f. Membranous, g. Hypertrophy, h. Normal glomeruli, i. Global sclerosis.

We doubled the number of glomeruli images in a training set by rotational augmentation – each original image of a cropped glomerulus was rotated by individually selecting random rotation angle in 90° steps (one of 90°, 180°, 270°). The complete composition of both training and testing sets is given in the Table 1.

Table 1

Composition of training and testing data sets.
Glomerular injury pattern	Testing set	Training set		Total original glomeruli
Glomerular injury pattern	original	original	augmented	Total original glomeruli
Crescentic	29	81	81	110
Endocapillary	37	81	81	118
FSGS	54	81	81	135
Hypertrophy	33	81	81	114
Membranoproliferative	46	81	81	127
Membranous	35	81	81	116
Mesangioproliferative	42	81	81	123
Normal	96	81	81	177
Sclerosed	45	81	81	126
Total	417	1458		1146

Multi-class classification of glomerular injury patterns by a single artificial neural network-based classifier

We utilized the training set develop classifiers for the 9 patterns of glomerular injury. First, we built a single nine-class classifier based on an ImageNet-pretrained Xception neural network architecture. We reconfigured Xception model to accept input images of 1024 × 1024 pixel size. For each instance of glomerulus, the final classification layer of the model was set to generate nine-class probability output via the softmax activation function. The classifier was trained with a balanced dataset by feeding the model with four image batches. The training was guided by stochastic gradient descent (SGD) optimizer that minimizes categorical cross entropy loss.

One-vs-Rest classification of glomerular injury patterns by multiple binary classifiers

Secondly, we performed a binary classification of glomerular injury patterns in a one-vs-rest setting by splitting the nine-class dataset into nine binary classification problems. For this task, we built nine distinct binary Xception-based classifiers, each aimed to discriminate a particular glomerular injury type from remaining classes. For each binary classifier, we again reconfigured the original Xception architecture for 1024 × 1024 pixel-sized input, but the final classification layer was set to generate single-class probability output via the sigmoid activation function. One-vs-rest classifiers were trained by SGD optimizer to minimize the binary cross entropy loss. Binarization of training labels introduces class imbalance; therefore, we balanced the training set by sampling all the glomeruli from the positive (target) class and proportionally subsampling remaining classes to collect equivalent number of glomeruli to represent negative class. During inference for a particular glomerulus, each binary classifier predicts a class membership probability score. The argmax of these scores defines the overall predicted class of glomerulus injury pattern.

Visualizing classifier attention maps with Grad-CAM

To interpret and explain the classification criteria used by classifiers during inference, we visualized discriminative regions of glomeruli images relevant to distinct injury patterns. For this task, we employed gradient-weighted class activation mapping technique that captures spatial information that is preserved through convolutional layers of the trained classifier. [38] The localization heatmap is calculated as a weighted sum of feature maps in the final convolutional layer of the classifier and up-sampled to match the size of the original image. For display, the up-sampled Grad-CAM localization heatmaps are overlayed on top of glomerular images.

Spatially guided multiclass classification of glomerular injury patterns

Grad-CAM is a post hoc neural network attention method, which means that it does not participate in the classifier training. Localization heatmaps are not learned specifically and are not influenced by any particular model training parameters. Moreover, in Grad-CAM, up-sampling is achieved in a single step going from tiny final convolutional layer-sized localization heatmap up to an input image-sized final heatmap by an integer factor that typically is of the order of dozens, meaning that the final heatmap is coarse and noisy.

Proposed neural network architecture

Therefore, to improve classifier focus on essential parts of the image and increase attention localization heatmap granularity, we attempted building a trainable attention mechanism. For this task, we employed the U-Net-like encoder-decoder structure. We again used Xception architecture as the base model; however, we modified it with U-Net style decoder and skip connections. We designed the network with three output layers. The final convolutional layer of the Xception architecture on the encoder branch is connected to an aggregation block consisting of a global average pooling operation and an intermediate densely connected layer (${int}_{1}$). This block feeds the first (auxiliary) densely connected output layer (${clf}_{aux}$) that has a softmax activation function to produce nine-class probability output. A similar block (global average pooling followed by dense intermediate (${int}_{2}$)) is added at the end of the decoder branch. The second (main) dense classification layer (${clf}_{main}$) receives concatenated output of intermediate dense layers (${int}_{1}$) and (${int}_{2}$) to produce another nine-class probability output via softmax activation function. In parallel to the (${clf}_{main}$) branch, the single neuron, 1 × 1 2D convolutional layer acts as the third output layer ($loc$) that produces a localization heatmap exactly matching input image size. Therefore, main classification branch (${clf}_{main}$) is conditioned to depend upon both localization ($loc$) and auxiliary classification (${clf}_{aux}$) branches. The detailed schema of the proposed neural network architecture is given in Fig. 2.

During the training, network learns to assign features of the glomeruli images to the corresponding ground truth class labels of distinct glomerular injury patterns (${clf}_{aux}$ and ${clf}_{main}$) and a functional mapping between pixels in glomeruli images and the pixels in corresponding ground truth localization heatmaps ($loc$).

The model is configured to accept 1024 × 1024 pixel-sized glomeruli images and is trained by an optimizer independently minimizing three weighted loss functions – categorical cross entropy loss functions for ${clf}_{aux}$ and ${clf}_{main}$ classification outputs, and a binary cross-entropy for localization heatmap output ($loc$). Weighting loss functions enables prioritization of network tasks, e.g., main classification over auxiliary classification (${w}_{{clf}_{main}}>>{w}_{{clf}_{aux}}$), localization slightly over main classification (${w}_{{clf}_{main}}>{w}_{loc}$). We trained the model with following constraints: ${w}_{{clf}_{main}}+{w}_{{clf}_{aux}}+{w}_{loc}=1$ and ${{w}_{loc}>w}_{{clf}_{main}}\gg {w}_{{clf}_{aux}}$ .

Ground-truth localization heatmaps

First, to produce ground truth localization heatmaps, nephropathologists were asked to highlight hotspots of segmental glomerular injury patterns by placing a simple free-form annotation (as a small as single pixel) in a copy of the glomeruli training set. Multiple hotspots were allowed. Diffuse glomerular injury patterns and normal glomeruli were automatically annotated by a single pixel annotation at a center of the glomerulus mask.

These annotations were then transformed into heatmaps where every pixel in a heatmap gets a value through a non-linear distance-based function:

$${h}_{x,y}=\frac{1}{1+{e}^{-({c}_{1}\cdot {d}_{x,y}+{c}_{2})}}$$

where ${h}_{x,y}$ is a value of a pixel at $x,y$ position in an image plane, ${d}_{x,y}$ is Euclidean distance from that pixel to the nearest pixel in an annotation, ${c}_{1}$ and ${c}_{2}$ are preselected constants. Briefly, pixels closer to the annotation get values closer to 1.0, and pixels further from the annotation get values closer to 0.0. All pixels outside the glomerulus contour get 0.0 values. Example localization heatmaps are shown in Fig. 3.

Left ($-7$,$6)$, middle ($-9$, $6),$ and right $(-11$, $6)$. Transformation coefficients for use in this paper (${c}_{1}=-9$, ${c}_{2}=6)$ were selected by visual appreciation of resulting ground truth localization heatmaps.

Metrics

Accuracy was used to monitor all classifier models during the training phase (see Supplementary Fig. 1 for model training metrics). During inference, classifier performance was compared by the area under the ROC curve metrics (AUC), and multi-class confusion matrices were used to get deeper insights of classification errors and biases. We compare classifier experiments in a multiclass classification setting, therefore mean classification accuracy over cross-validation folds reported per class in Table 2 in Results section is calculated as true positive rate. The amount of variance in the classifier performance is reported by the standard deviation of accuracies (coefficients of variance are employed to identify values exceeding average). The generalized classification accuracy is calculated over the diagonal of an aggregated multiclass confusion matrix. The quality of predicted localization heat maps was evaluated by an intersection over union metrics between predicted and ground truth localization heat maps at a threshold value. Furthermore, injury pattern area quantification was measured by a percentage of predicted intraglomerular pattern heatmap.

Implementation

All image data manipulation steps (preprocessing of glomeruli images, generation of augmented images and ground truth heatmaps, as well as figures in a manuscript) were done in Python 3.8.10 using ‘scikit image’ v.0.19.1, ‘matplotlib’ v.3.5.1, ‘numpy’ v.1.20.0 and ‘scipy’ v. 1.7.3 libraries. The classifiers were built, trained, and evaluated in Python 3.8.10 using ‘tensorflow’ v.2.7.0 and ‘scikit-learn’ v.1.0.2 on a high-performance graphical processing unit (Nvidia GeForce RTX 3090).

We conducted all classifier experiments in a five-fold cross-validation setting. Trained classifiers were evaluated in a testing set. Averaged ROC curves are presented in Fig. 4. Mean classification metrics were obtained by averaging class-specific results for all folds in each of the classification experiments and are given in the Table 2.

Glomeruli pattern classification

Generalized classification accuracies in different classifier experiments range from 0.677 for the ‘multiple-binary’ classifier up to 0.728 for the ‘spatially guided’ classifier. The highest accuracy per class was observed for sclerosed glomeruli pattern by all classifiers (mean of all ~ 0.985). Similarly, the lowest accuracy for among the classifiers was obtained for FSGS pattern (mean of all ~ 0.473). We observed an overall tendency towards higher classification accuracy for diffuse glomeruli (eg., membranoproliferative, membranous, hypertrophy) patterns (mean ~ 0.817), while segmental injury patterns were more difficult to discriminate (mean ~ 0.626). The ‘spatially guided’ classifier achieved the highest mean classification accuracies for diffuse (~ 0.830) and segmental (~ 0.659) patterns, while ‘multiple-binary’ classifier achieved lowest accuracies for both diffuse (~ 0.806) and segmental (0.560).

Table 2

Mean classification accuracy metrics per class. The mean IoU and the corresponding standard deviation metrics were calculated over five cross-validation folds. Individual IoU scores were computed at a 0.5 threshold (see Supplementary tables 1, 2, and 3 for detailed cross-validation classification metrics).
Classifier experiment	Crescentic	Endocapillary	FSGS	Mesangioproliferative	Membranoproliferative	Membranous	Hypertrophy	Normal	Sclerosed	Generalized multiclass
	Segmental injury				Diffuse					Generalized multiclass
	Mean classification accuracy (standard deviation)
Single-multiclass	0.841 (0.046)	0.730 (0.118)	0.478 (0.076)	0.586 (0.148)	0.765 (0.060)	0.817 (0.066)	0.879 (0.057)	0.640 (0.025)	0.978 (0.000)	0.719 (0.010)
Multiple-binary	0.745 (0.072)	0.573 (0.147)	0.437 (0.025)	0.486 (0.080)	0.757 (0.082)	0.840 (0.048)	0.830 (0.051)	0.625 (0.039)	0.978 (0.000)	0.677 (0.006)
Spatially-guided	0.814 (0.076)	0.676 (0.128)	0.504 (0.072)	0.643 (0.154)	0.739 (0.063)	0.840 (0.082)	0.927 (0.046)	0.644 (0.120)	1.000 (0.000)	0.728 (0.028)
	Mean AUC (standard deviation)
Single-multiclass	0.971 (0.005)	0.964 (0.006)	0.840 (0.014)	0.920 (0.010)	0.965 (0.005)	0.970 (0.004)	0.970 (0.005)	0.943 (0.007)	0.995 (0.000)	0.949 (0.002)
Multiple-binary	0.935 (0.013)	0.919 (0.006)	0.767 (0.006)	0.886 (0.012)	0.948 (0.003)	0.953 (0.001)	0.970 (0.003)	0.935 (0.006)	0.991 (0.000)	0.923 (0.003)
Spatially-guided	0.971 (0.003)	0.971 (0.003)	0.863 (0.020)	0.915 (0.010)	0.956 (0.005)	0.972 (0.003)	0.981 (0.003)	0.964 (0.003)	0.995 (0.000)	0.954 (0.004)
	Mean IoU (standard deviation)
Single-multiclass	0.061 (0.012)	0.050 (0.006)	0.042 (0.003)	0.041 (0.003)	n/a	n/a	n/a	n/a	n/a	0.049 (0.003)
Multiple-binary	0.060 (0.006)	0.052 (0.007)	0.034 (0.012)	0.049 (0.016)	n/a	n/a	n/a	n/a	n/a	0.048 (0.007)
Spatially-guided	0.404 (0.174)	0.379 (0.138)	0.263 (0.116)	0.235 (0.114)	n/a	n/a	n/a	n/a	n/a	0.320 (0.133)

While classification accuracy is a straightforward and intuitive metrics that directly indicates proportion of correct predictions, AUC metrics adds to the evaluation a probabilistic component of ranking predictions. The mean AUC scores indicate that the ‘spatially guided’ classifier has a higher prediction confidence for most (seven out of nine) glomeruli patterns compared to other approaches. AUC scores generalized to all glomeruli patterns in all classifier experiments exceeded 0.900, with the ‘spatially guided’ classifier being the most accurate and confident (generalized AUC score of 0.954).

The blue curve indicates the performance of single-multiclass classifier, the orange curve indicates the performance of multiple-binary classifier, and the green curve indicates the spatially guided CNN classifier.

The consistency of all classifiers can be inferred from the standard deviations reported for the classification metrics. Highlighted cells in the Table 2 identify values that exceed the mean coefficient of variance for the given experiment (9.90%, 9.59%, and 12.06% for the ‘single-multiclass’, ‘multiple-binary’, and ‘spatially guided’ classifiers, respectively). The higher amount of variance in classifier performance can be observed for segmental injury patterns.

Evaluation of localization heatmaps and pattern quantification

Ground truth localization heatmaps were generated for glomeruli images in a training set (both original and augmented glomeruli images) as well as for ones in a testing set, which allowed us to conduct an analysis of concordance between ground truth localization heatmaps and classifier-produced localization heatmaps. An overview of concordance measured by intersection over union is given in the Table 2. In general, gradient-based heat maps merely overlap expert-annotated areas inside the glomeruli with mean IoU values below 0.05 for both ‘single-multiclass’ and ‘multiple-binary’ classifiers. In fact, quite often these classifiers tend to reason glomeruli pattern classification in areas outside glomeruli contour, thus likely capturing shape and size characteristics. Importantly, the ‘spatially guided’ classifier with trainable localization heatmaps achieved the 0.320 generalized mean IoU value. More in-depth analysis revealing special cases of classifier localization heat maps is presented in Tables 3, 4 and 5.

The intraglomerular injury pattern area quantification revealed miscellaneous results depending on a fold of cross validation set. The detailed comparison of pattern quantification results per glomerular basis is presented in Supplementary Fig. 2.

A confusion matrix highlights the most confused patterns (averaged over cross-validation folds).

In this study we exploited several novel opportunities that the application of CNN offers for the recognition and quantification of glomerular injury patterns. Firstly, a new spatially guided modification of CNN classifier for improved glomerular injury pattern classification is proposed. Second, we estimated the accuracy of intraglomerular injury pattern localization compared to ground-truth annotations produced by nephropathologists and show a potential of intraglomerular pattern quantification.

Our spatially guided CNN classifier showed the best classification results for most of the investigated glomerular injury patterns, compared with single-multiclass and multiple binary classifiers. Importantly, the classification precision was high also for segmental patterns (crescentic, endocapillary, FSGS), which are usually more complicated for automated segmentation but are necessary for comprehensive assessment of glomerular pathology. The AUC values for spatially guided classifier are very close to previously reported [30, 37] AUC measurements for the same patterns. However, our glomerular image datasets used for training purposes are relatively small (only 81 glomeruli per each type of injury without augmentation and 1146 in total) to compare with previous studies (for example, a total number of 32,267 glomeruli were used by Yang et al. [30]). This could be seen as a limitation of our study; however, on the contrary, it may indicate an added value of our spatial guidance model for CNN architecture enabling satisfactory results from significantly smaller datasets.

Imbalanced datasets, in general, present a major issue for machine learning, computer vision, and pattern recognition tasks. Data class imbalances occur when there is a significant inequality between the number of examples of different classes and if not addressed, this imbalance greatly impairs classifier detection accuracy.[39] In the field of CNN applications for glomerular pattern recognition usually the imbalanced data sets were used [30, 32, 37] due to low incidence of some patterns like pure, non-overlapping endocapillary hypercellularity or crescent formation that are relatively rare compared with other pathological features. In this study, in total 695 native renal biopsies and 27,156 glomeruli were used to create a balanced data set for each type of glomerular class in both the training and testing subsets (1146 glomeruli in total). Therefore, we handled so called ‘foreground-foreground class’ imbalance [39] at the early sampling stage in the object detection pipeline.

We preprocessed glomerular images by replacing surrounding renal tissue with black background to focus our classifiers exclusively on glomerular structures. On the contrary, previous studies included surrounding tissues for classifier training and testing purposes. [29, 30, 32] Black background was chosen to avoid any extraglomerular context. To explore the impact of this approach, further investigations are needed measuring the effects of classifier performance by applying various background preprocessing procedures: removal/inclusion, background color, cropping, scaling, the amount of background.

In this study, we compared 3 different modifications of CNN classifier in terms of classification and localization accuracy. Although single-multiclass and multiple binary classifiers showed high accuracy of glomerular label prediction (Table 2), the concordance between ground-truth localization heatmaps and classifier-produced localization heatmaps were considerably worse compared to the spatially guided CNN. Surprisingly, the differences in heatmap overlap area were distinct: the ‘spatially guided’ classifier achieved 0.320 generalized mean IoU value, compared with only 0.048 and 0.049 for single-multiclass and multiple binary classifiers accordingly. These results indicate the importance of providing manually annotated images for classifier training if the pattern segmentation and quantification are sought. It should be noted that the heatmap validation procedure was not explored in other studies that investigated the application of CNN classifiers for the glomerular pattern segmentation. [29, 30, 32]

The analysis of classifier produced heatmaps and its comparison to manually annotated images on validation set revealed several exciting observations. First, regardless the correct label prediction results, single-multiclass and multiple binary classifiers usually produced poor pattern localization heat maps (Table 3, row 3; Table 4 rows 1–3, Table 5 row 1). Some of these heatmaps even focused on black background ignoring the glomerulus itself. Secondly, the attention heatmaps of some glomeruli that were labeled incorrectly by spatially guided CNN, yet contained precise annotation of another/less important pattern in the glomerulus. For example, some glomeruli which were labelled as FSGS by an expert were marked as crescentic by CNN. Further investigation revealed that these glomeruli were taken from the crescentic glomerulonephritis case and were chosen to present a FSGS pattern, potentially contained a segmental sclerosed crescentic lesion. Another example was of small mesangioproliferative areas picked up by a spatial CNN heat map in the glomerulus presenting FSGS in an IgA nephropathy case. These findings indicate that our spatially guided CNN classifier is already recognizing several patterns within a glomerulus, therefore a procedure to extract these data from the classifier remains to be explored. Weis et al. already developed CNN classifiers that could deal with complex cases and recognize several patterns occurring in the same glomerulus. In contrast to our experiment, this study explored CNN performance in a more systematic way: a different image data set was used with complex changes that cannot be attributed to any single category by the pathologist. [32]

Full validation of the performance of our trained models is hampered by several points. First, our training and validation sets were composed of glomerular images that contained only as pure as possible histological patterns within glomeruli. Further testing in ‘real-life’ cases, overlapping patterns and entire slideshow images should be performed. Second, the relevance of quantification results should be tested on whole slides rather than individual glomeruli and compared to clinical parameters that were lacking in our datasets. Furthermore, unlike other studies [29] our spatially guided approach is completely supervised in terms of previous knowledge of glomerular structures and labels to be supplied for the network training.

Significant pairwise misclassification rates (Fig. 5) between endocapillary pattern and crescentic/membranoproliferative patterns, FSGS and mesangioproliferative changes, membranoproliferative and endocapillary, mesangioproliferative and FSGS, membranous and hypertrophy could be explained by some similarities of the histological patterns. However, classifier attention heatmaps revealed that some glomeruli, incorrectly assigned to ‘normal’ class, showed perfect detection of glomerular structures that contain areas with normal capillary loops on heatmap visualizations (Table 5). We suggest that it could be related to some imbalances in our training data set. Data sets were balanced by number of glomeruli but unbalanced by the area of different patterns inside the histology images. For example, an area of normal capillaries was significantly higher than the FSGS area and is found in several pattern subsets (like hypertrophy, FSGS, mesangioproliferative, and crescentic). This might have had an influence on the classifier training results. This phenomenon has been noted previously by Selvaraju et al. and defined as an inherent bias in data sets, when CNN merely focuses on the frequently occurring feature, which is not always a true pathological lesion. [38] This should be taken into account when planning such experiments and could be avoided by subdividing the glomerular image into several parts according to the presented histological lesions and structures. [40] A recent study proposed by Sato et al. illustrates that on the patch-based analysis CNN classifier could correctly give higher attention to the structure in the images like cellular components, sclerotic regions or crescent regions compared with the segmentation results performed on entire glomerular images. [29]

In conclusion, this study established a novel spatially guided CNN classifier for successful recognition of glomerular patterns in relatively small data sets. We also performed validation of automatically produced classifier heat maps for glomerular lesion localization and demonstrated the superiority of spatially guided CNN performance compared to classical architecture convolutional neural networks.

CNN - convolutional neural network

IoU - intersection over union metrics

GN - glomerulonephritis

Grad CAM gradient - weighted class activation mapping technique

WSI - whole slide images

FSGS – Focal segmental glomerular sclerosis

SGD - Stochastic gradient descent

Acknowledgements

This research was funded by the European Social Fund project No 09.3.3-LMT-K-712-19-0186 under grant agreement with the Research Council of Lithuania.

Authors’ contributions

J.B., A.L, M.M. designed, implemented and coordinated experiments. J.B. developed image analysis algorithms for glomerular segmentation. J.B., A.L. performed manual glomerular image sorting and created image datasets for experiments. M.M. designed and implemented CNN classifier models. M.M. coordinated and performed data analyses, digital image processing and created figures. M.M., A.L. provided assistance in writing the manuscript. All authors critically revised and approved the final version of the manuscript.

Competing interests
The authors declare no competing interests.

Data availability

Glomerular images used for training and testing purposes could be provided upon request.

Code for the proposed method is available at: https://github.com/mindEM/glomnet

Weening, J.J., et al., The classification of glomerulonephritis in systemic lupus erythematosus revisited. Kidney Int, 2004. 65(2): p. 521–30.
Tervaert, T.W., et al., Pathologic classification of diabetic nephropathy. J Am Soc Nephrol, 2010. 21(4): p. 556–63.
D'Agati, V.D., et al., Pathologic classification of focal segmental glomerulosclerosis: a working proposal. Am J Kidney Dis, 2004. 43(2): p. 368–82.
Berden, A.E., et al., Histopathologic classification of ANCA-associated glomerulonephritis. J Am Soc Nephrol, 2010. 21(10): p. 1628–36.
Sethi, S., et al., Mayo Clinic/Renal Pathology Society Consensus Report on Pathologic Classification, Diagnosis, and Reporting of GN. J Am Soc Nephrol, 2016. 27(5): p. 1278–87.
Trimarchi, H., et al., Oxford Classification of IgA nephropathy 2016: an update from the IgA Nephropathy Classification Working Group. Kidney Int, 2017. 91(5): p. 1014–1021.
Haas, M., et al., Consensus definitions for glomerular lesions by light and electron microscopy: recommendations from a working group of the Renal Pathology Society. Kidney Int, 2020. 98(5): p. 1120–1134.
Bertsias, G.K., et al., Joint European League Against Rheumatism and European Renal Association-European Dialysis and Transplant Association (EULAR/ERA-EDTA) recommendations for the management of adult and paediatric lupus nephritis. Ann Rheum Dis, 2012. 71(11): p. 1771–82.
Rovin, B.H., et al., Executive summary of the KDIGO 2021 Guideline for the Management of Glomerular Diseases. Kidney Int, 2021. 100(4): p. 753–779.
Bajema, I.M., et al., Revision of the International Society of Nephrology/Renal Pathology Society classification for lupus nephritis: clarification of definitions, and modified National Institutes of Health activity and chronicity indices. Kidney Int, 2018. 93(4): p. 789–796.
Gasparotto, M., et al., Lupus nephritis: clinical presentations and outcomes in the 21st century. Rheumatology (Oxford), 2020. 59(Suppl5): p. v39-v51.
Bellur, S.S., et al., Reproducibility of the Oxford classification of immunoglobulin A nephropathy, impact of biopsy scoring on treatment allocation and clinical relevance of disagreements: evidence from the VALidation of IGA study cohort. Nephrol Dial Transplant, 2019. 34(10): p. 1681–1690.
Restrepo-Escobar, M., P.A. Granda-Carvajal, and F. Jaimes, Systematic review of the literature on reproducibility of the interpretation of renal biopsy in lupus nephritis. Lupus, 2017. 26(14): p. 1502–1512.
Dasari, S., et al., A Systematic Review of Interpathologist Agreement in Histologic Classification of Lupus Nephritis. Kidney Int Rep, 2019. 4(10): p. 1420–1425.
Haas, M., et al., Impact of Consensus Definitions on Identification of Glomerular Lesions by Light and Electron Microscopy. Kidney Int Rep, 2022. 7(1): p. 78–86.
Hermsen, M., et al., Deep Learning-Based Histopathologic Assessment of Kidney Tissue. J Am Soc Nephrol, 2019. 30(10): p. 1968–1979.
Sheehan, S.M. and R. Korstanje, Automatic glomerular identification and quantification of histological phenotypes using image analysis and machine learning. Am J Physiol Renal Physiol, 2018. 315(6): p. F1644-F1651.
Wilbur, D.C., et al., Using Image Registration and Machine Learning to Develop a Workstation Tool for Rapid Analysis of Glomeruli in Medical Renal Biopsies. J Pathol Inform, 2020. 11: p. 37.
Bouteldja, N., et al., Deep Learning-Based Segmentation and Quantification in Experimental Kidney Histopathology. J Am Soc Nephrol, 2021. 32(1): p. 52–68.
Jiang, L., et al., A Deep Learning-Based Approach for Glomeruli Instance Segmentation from Multistained Renal Biopsy Pathologic Images. Am J Pathol, 2021. 191(8): p. 1431–1441.
Kannan, S., et al., Segmentation of Glomeruli Within Trichrome Images Using Deep Learning. Kidney Int Rep, 2019. 4(7): p. 955–962.
Kawazoe, Y., et al., Faster R-CNN-Based Glomerular Detection in Multistained Human Whole Slide Images. Journal of Imaging, 2018. 4(7).
Li, X., et al., Deep learning segmentation of glomeruli on kidney donor frozen sections. J Med Imaging (Bellingham), 2021. 8(6): p. 067501.
Marsh, J.N., et al., Deep Learning Global Glomerulosclerosis in Transplant Kidney Frozen Sections. IEEE Trans Med Imaging, 2018. 37(12): p. 2718–2728.
Bukowy, J.D., et al., Region-Based Convolutional Neural Nets for Localization of Glomeruli in Trichrome-Stained Whole Kidney Sections. J Am Soc Nephrol, 2018. 29(8): p. 2081–2088.
Bueno, G., et al., Glomerulosclerosis identification in whole slide images using semantic segmentation. Comput Methods Programs Biomed, 2020. 184: p. 105273.
Barros, G.O., et al., PathoSpotter-K: A computational tool for the automatic identification of glomerular lesions in histological images of kidneys. Sci Rep, 2017. 7: p. 46769.
Jayapandian, C.P., et al., Development and evaluation of deep learning-based segmentation of histologic structures in the kidney cortex with multiple histologic stains. Kidney Int, 2021. 99(1): p. 86–101.
Sato, N., et al., Evaluation of Kidney Histological Images Using Unsupervised Deep Learning. Kidney Int Rep, 2021. 6(9): p. 2445–2454.
Yang, C.K., et al., Glomerular disease classification and lesion identification by machine learning. Biomed J, 2022. 45(4): p. 675–685.
Altini, N., et al., Semantic Segmentation Framework for Glomeruli Detection and Classification in Kidney Histological Sections. Electronics, 2020. 9(3).
Weis, C.A., et al., Assessment of glomerular morphological patterns by deep learning algorithms. J Nephrol, 2022. 35(2): p. 417–427.
Altini, N., et al., A Deep Learning Instance Segmentation Approach for Global Glomerulosclerosis Assessment in Donor Kidney Biopsies. Electronics, 2020. 9(11).
Ginley, B., et al., Automated Computational Detection of Interstitial Fibrosis, Tubular Atrophy, and Glomerulosclerosis. J Am Soc Nephrol, 2021.
Ginley, B., et al., Computational Segmentation and Classification of Diabetic Glomerulosclerosis. J Am Soc Nephrol, 2019. 30(10): p. 1953–1967.
Zeng, C., et al., Identification of glomerular lesions and intrinsic glomerular cell types in kidney diseases via deep learning. J Pathol, 2020. 252(1): p. 53–64.
Yamaguchi, R., et al., Glomerular Classification Using Convolutional Neural Networks Based on Defined Annotation Criteria and Concordance Evaluation Among Clinicians. Kidney Int Rep, 2021. 6(3): p. 716–726.
Selvaraju, R.R., et al., Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. International Journal of Computer Vision, 2020. 128(2): p. 336–359.
Oksuz, K., et al., Imbalance Problems in Object Detection: A Review. Ieee Transactions on Pattern Analysis and Machine Intelligence, 2021. 43(10): p. 3388–3415.
Lateef, F. and Y. Ruichek, Survey on semantic segmentation using deep learning techniques. Neurocomputing, 2019. 338: p. 321–348.

Tables 3-5 is available in the Supplementary Files section.

No competing interests reported.

A spatially guided machine learning method to classify and quantify glomerular patterns of injury in histology images

Status:

Version 1

Abstract

Figures

Introduction

Methods

Results

Discussion

Abbreviations

Declarations

Acknowledgements

Authors’ contributions

Data availability

References

Tables

Additional Declarations

Supplementary Files

Status:

Version 1