Data curation
All demographic and clinical parameters for example, age, diagnosis, quantitative trait values were reviewed for validity and erroneous data was cleaned or flagged and removed. For each of the quantitative traits, the range of expected values were defined, and values falling outside the range were corrected after consultation with the clinician, or in case of ambiguity were deleted. Unstructured data, for example, disease diagnoses were reclassified using ICD10 code, personal history and family history of ophthalmic and systemic diseases was annotated with structured terminologies. Ophthatome™ is populated with extensively annotated data that were checked for accuracy and uniformity (Fig. 1). A brief overview of data types captured in Ophthatome™ are summarized below.
i. Age
The age was computed based on the date of birth and therefore reflects age of subjects at different visits when longitudinal data is captured. This enables catching errors for example, more than 110 years were marked as outliers, verified for accuracy of the data and corrected.
ii. Disease diagnosis
The disease diagnosis in the EMR is based on ICD10 codes.(20) The ICD diagnosis was categorised into disease types and subtypes for comprehensive and informative querying of the data. For example, all types of cataract; cortical, nuclear, anterior subcapsular etc. were first grouped as cataract and then subclassified into their respective subtypes as shown in Table 1. Further, disease diagnoses entries not as per the ICD10 codes, or present in an unstructured format, or ‘free text’ in the EMR were assigned unique codes and grouped under appropriate disease types and subtypes. The unstructured disease diagnoses entries that specified two or more diseases, e.g, a) “Primary open angle glaucoma (POAG) with cataract” or b) “early cataract with mild non-proliferative diabetic retinopathy (NPDR)” were grouped under both the diseases, a) glaucoma and cataract, and b) cataract and diabetic retinopathy and appropriate endophenotypes, respectively (Table 2). Each of the disease diagnosis was also annotated with additional data for example, (i) affected ocular organ (cornea, conjunctiva, retina etc), (ii) age of onset of the disease (as congenital, age-related, unspecified etc) and (iii) possible causes of the disease such as a) infectious (viral, bacterial, fungal or parasitic) b) genetic (complex, monogenic, somatic, mitochondrial, chromosomal) c) age-related d) developmental e) systemic causes f) secondary to other ocular diseases g) trauma i) related to eye surgery j) inflammation k) contact lens l) medications m) autoimmune disease n) environmental/allergy o) unknown. These categorizations and labelling allow advanced search functions for identifying specific, well-defined cohorts and to raise specific queries.
Table 1
Categorization of diseases with ICD codes into disease types and subtypes for comprehensive and information querying in the Ophthatome™
ICD_code | Diagnosis_name | Disease | Sub_type |
H25011 | Cortical age-related cataract, right eye | Cataract | Cortical age-related cataract |
H25012 | Cortical age-related cataract, left eye | Cataract | Cortical age-related cataract |
H25013 | Cortical age-related cataract, bilateral | Cataract | Cortical age-related cataract |
H25019 | Cortical age-related cataract, unspecified eye | Cataract | Cortical age-related cataract |
H25031 | Anterior subcapsular polar age-related cataract, right eye | Cataract | Anterior subcapsular age-related cataract |
H25032 | Anterior subcapsular polar age-related cataract, left eye | Cataract | Anterior subcapsular age-related cataract |
H25033 | Anterior subcapsular polar age-related cataract, bilateral | Cataract | Anterior subcapsular age-related cataract |
H25039 | Anterior subcapsular polar age-related cataract, unspecified eye | Cataract | Anterior subcapsular age-related cataract |
H2504 | Posterior subcapsular cataract | Cataract | Posterior subcapsular cataract |
H25041 | Posterior subcapsular polar age-related cataract, right eye | Cataract | Posterior subcapsular age-related cataract |
H25042 | Posterior subcapsular polar age-related cataract, left eye | Cataract | Posterior subcapsular age-related cataract |
H25043 | Posterior subcapsular polar age-related cataract, bilateral | Cataract | Posterior subcapsular age-related cataract |
H25049 | Posterior subcapsular polar age-related cataract, unspecified eye | Cataract | Posterior subcapsular age-related cataract |
Table 2
Categorization of unstructured disease diagnosis (non-ICD diagnosis) into disease types and subtypes for comprehensive and information querying in the Ophthatome™
ICD_code | Diagnosis_name | Disease | Sub_type |
UK1486 | Traumatic cataract LE | Cataract | Traumatic cataract |
UK2123 | Partially absorbed traumatic cataract | Cataract | Traumatic cataract |
iii. Quantitative traits
(a) Refractive error
Myopia and hyperopia were classified based on the degree as low, medium and high. The types of myopia are; low myopia (<-3.00D), medium myopia (-3.00D to -6.00D) and high myopia (>-6.00D) and hyperopia; low hyperopia (+ 2.00D), medium hyperopia (+ 2.25D to + 5.00D) and high hyperopia ( > + 5.00D) based on the spherical power in diopters (D).(21, 22) Entries other than the accepted integer in positive or negative values representing hyperopia and myopia respectively, were marked and removed.
(b) Intraocular pressure
Intraocular pressure (IOP) was classified as low, normal and high for the measurement values, < 12 mmHg, 12–21 mmHg and > 21 mmHg, respectively.(23) IOP values greater than 80 mm Hg were considered erroneous and were excluded as maximum measurement that can be recorded on applanation and NCT (noncontact tonometer) are 80 and 60 mmHg, respectively.
(c) Central corneal thickness
Central corneal thickness (CCT) was defined as very thin, thin, average, thick and very thick for the values < 510µ m, 510–540 µm, 541–560 µm, 561–600 µm and > 600 µm respectively (24). Values < 300 µm and > 700 µm were marked as outliers and corrected appropriately with reference to the longitudinal data if available or flagged and removed.
iv. Systemic diseases
The details of the systemic disease(s) were converted into standard terminology such as, diabetes mellitus, hypertension, coronary artery disease, renal stones, asthma etc. These labels improve search and selection functions in the web portal.
v. Prescription drugs
The EMR database contains a list of different prescribed ophthalmic and non-ophthalmic drugs as per their brand names. Generic names were appended for all the prescribed drug along with their brand names. Further, the drugs were also grouped as per their therapeutic or pharmacological classification based on WHO/ATC classification. For combination drugs too, the generic, and the therapeutic or pharmacological classification was appropriately appended. This was done to identify cohorts who may not have been treated with the same generic or brand name drugs, but drugs with similar mechanism of action or pharmacological properties.
(vi) Family history
Details of family history of disease(s) was available for 0.02% (12,465/561,466) of the cohort. The diseases were defined as either ophthalmic or systemic to choose cases with family history of either or both type of disease(s).
vii. Diagnostic procedural images=
The diagnostic procedural images that include, electroretinogram (ERG), fundus photograph, fundus fluorescein angiography (FFA), optical coherence tomography (OCT), frequency doubling technology perimetry (FDT), Heidelberg retina tomograph (HRT3), nerve fibre analysis (GDX), confocal microperimetry (MAIA), visual field analysis (HFA), Pentacam (anterior eye segment tomography), amplitude scan (ASCAN), specular microscopy, color vision, electrocardiograph (ECG), are all integrated into the Ophthatome™. The images along with the visit dates are presented as deidentified data retaining the same resolution as that of the original image.
(a) Quantitative values within the procedural images
The quantitative values like the visual field index (VFI) as percentage, median deviation (MD), pattern standard deviation (PSD) available in the visual field analysis image, average retinal nerve fibre layer (RNFL) thickness in micrometer, cup/disc ratio, available in the OCT image were programmatically read, and the values integrated in the database.
viii. Visual impairment
The cases were defined either as normal vision or visually impaired based on latest unaided or best corrected visual acuity. First, the values in the unaided and best corrected visual acuity were classified as normal, visual impairment, severe visual impairment or blind based on standard WHO definition.(25) Cases are defined normal if the unaided refraction is ‘normal’. When the unaided refraction was classified either as visual impairment, severe visual impairment or blind, then the classification under the best corrected visual acuity of the better eye was considered to classify the visual impairment status of the individual.
ix. Longitudinal data
Approximately fifty percent of the overall cohort in the knowledgebase have longitudinal data, i.e. have some clinical data recorded for more than one visit. In the Ophthatome™, for any visit > 1, an internal check of at least two datapoints for either refraction or glass prescription or slit-lamp examination was considered to short list cohort with longitudinal data.