The literature search yielded 23 studies that met the inclusion criteria. Figure 1 presents the flow of studies through the screening process. Of note, one study was excluded as it explored internet search use in patients after diagnosis during a period of treatment.18
An overview of the risk of bias assessments with the Newcastle-Ottawa Scale and PROBAST can be found in Figs. 2 and 3, respectively. All diagnostic and predictive studies were rated to have ‘high’ or ‘unclear’ overall risk of bias or applicability concerns by the Prediction model Risk Of Bias Assessment Tool.
Study participants
Characteristics of included studies can be found in Table 1. The number of participants with a condition of interest ranged from 20 to 11050.19,20 Most studies involved US-based participants, with the exception of one study in Japan, one in the UK, and one involving international participation.21–23 Age of participants in reported studies ranged from 20.6 to 53 years.23,24 The proportion of males in reported studies ranged from 0–81%.22,25
A wide range of conditions of interest was examined. Broadly, included conditions encompassed mental health (i.e., schizophrenia spectrum disorder, suicidality, mood disorders, psychosis, eating disorders),19,22,24,26–30 neurologic conditions (i.e., Parkinson’s disease, amyotrophic lateral sclerosis, stroke, sleeping disorders),31–34 malignancies (i.e., pancreatic cancer, lung cancer, paediatric oncology, breast cancer, colon cancer, gynaecological malignancies),23,25,35–37 miscellaneous conditions (i.e., diabetes, coeliac disease, biliary atresia, pyloric stenosis),20,21,38,39 and a variety of primary and secondary care presentations.40,41
Ten studies recruited participants from inpatient or outpatient settings, and consented participants face-to-face.19,23–27,29,30,38,40 Three studies recruited participants from online research crowdsourcing platforms or advocacy groups.22,28,34 Two studies recruited participants with targeted advertisements on search engines.33,37 Nine studies extracted search data from unique, non-identifiable users from back-end databases (8 Bing, 1 Yahoo! Japan).20,21,31,32,34–36,39,41 Investigators then identified users who searched for a ‘diagnosis ascertaining query’ (e.g., ‘I have lung cancer’, ‘I have been diagnosed with diabetes’) and extracted their search data from a period of time prior to the diagnosis ascertaining query.
The consent rate for the use of participant personal search data in studies from individuals providing explicit consent ranged from 20–70%.25,34
Search engines and extraction
Search engine, data extraction, ground truth determination and outcomes from included studies are detailed in Table 2.
Ten studies used Bing (Microsoft Corporation),20,31–37,39,41 nine studies used Google Search (Google LLC),19,24–26,26,29,30,38,40 one study used Yahoo! Search Japan (Yahoo Inc.),21 one study utilised data from both Google and Bing,23 and two studies used unspecified general web browsing history, including search history data.22,28 All Bing search data was extracted using proprietary backend techniques by Microsoft employees or investigators. All Google data was downloaded by research participants themselves using Google Takeout and then provided to academic investigators with participants’ explicit informed consent. Two studies extracted general web browsing data using a commercial executable downloaded and installed by participants on their computers (Browsing History View, Nirsoft);22,28 participants then uploaded the extracted data onto a study website.
The total number of search queries per study ranged from > 30,000 to 591,421.19,40 The duration of time over which individuals’ search data was examined ranged from seven days to two years.23,26,40
Use of machine learning algorithms
Machine learning was utilised in the included studies to (1) develop diagnostic or predictive classifiers, based on participant internet search data or (2) natural language processing (NLP) for categorisation of search queries. 15 studies developed machine learning-driven diagnostic or predictive classifiers.20–23,27,28,31–39 A variety of machine learning algorithms were used, including decision tree, gradient boosted, linear regression, logistic regression, random forest, and support vector machine. Two studies utilised NLP for categorisation and linguistic thematic analysis of search queries .29,38
Ground truth and control
The “ground truth” (i.e., disease status) for included diagnostic or predictive studies was determined by three distinct methods. Six studies ascertained diagnoses from patient health records.19,23,24,26,27,40 Ten studies ascertained a “ground truth” from a user’s diagnosis ascertaining query.20,21,23,31,32,34–36,39,41 Finally, seven studies ascertained ground truth from participant survey scores.22,28–30,33,37,38
Regarding control populations, two studies recruited healthy volunteers as a control group for studies of participants with schizophrenia, spectrum disorders, and/or mood disorders.26,27 Four studies which included unique, non-identifiable Bing users for conditions of interest (stroke, Parkinson’s disease, amyotrophic lateral sclerosis and gynaecological malignancies) also recruited matched control populations.23,31,32,34
Outcomes
Measures of predictive model performance were heterogeneous, and studies reported area under the receiver-operator characteristic curve values (AUC), positive predictive value (PPV, i.e., diagnostic precision), sensitivity, and F1 score.
Table 2
Search engine, data extraction, ground truth determination and outcomes of included studies. Note: API - application programming interface; AUC - area under the receiver operator characteristic curve; LIWC - linguistic inquiry and word count; PPV - positive predictive value.
Study | Search Engine | Extraction | Total no. of search queries | Duration of search queries | Ground truth | Use of machine learning | Data extraction and labelling of content | Outcomes |
Paparrizos (2016)35 | Bing | Bing proprietary backend extraction | 479787 | Mean 109 days | Diagnosis ascertaining query for pancreatic cancer | Gradient boosted statistical classifier. | Predictive model trained on extracted features including demographics, search characteristics, symptom characteristics, temporal characteristics and risk factors. | AUC. |
White (2017)36 | Bing | Bing proprietary backend extraction | Not stated. | 1 to 52 weeks | Diagnosis ascertaining query for lung carcinoma | Statistical classifier. | Predictive model trained on extracted features including risk factors and symptoms. Symptoms identified via query terms matching a symptom set defined in literature review. | AUC. |
Asch (2019)40 | Google | Google Takeout | 591421 | 7 days | Diagnosis from medical records | No | Manual thematic analysis of search queries. | Health-relatedness of searches and associations between clinical and demographic characteristics and internet search volume and search content. |
Hochberg (2019)20 | Bing | Bing proprietary backend extraction | Not stated. | Up to 1 year | Diagnosis ascertaining query for diabetes | Four prediction models: linear regression, logistic regression, decision trees, and random forest. | Prediction model trained on extracted queries related to 24 diabetes symptoms. | AUC and PPV. |
Lebwohl (2019)39 | Bing | Bing proprietary backend extraction | Not stated. | 10 months | Diagnosis ascertaining query for coeliac disease | Two predictive models: linear regression and random forest. | Extracted queries related to a list of 195 symptoms and their synonyms. | AUC. |
Phillips (2019)25 | Google | Google Takeout | 81725 | 6 months | Not stated. | No | Manual thematic analysis of search queries. | Description of health-related queries prior to diagnosis. |
Youngmann (2019)31 | Bing | Bing proprietary backend extraction | 327492 | Not stated. | Diagnosis ascertaining query for Parkinson’s disease | Automated supervised long short-term memory model and boosted decision forest classification model. | Manual and automatic feature extraction | AUC. |
Birnbaum (2020)27 | Google | Google Takeout | Not stated. | 4 weeks | Psychosis hospitalisation from medical records | Three prediction models: support vector machine, random forest and gradient boosting. | LIWC extraction of linguistic features. Predictive model trained on 123 extracted linguistic and search behaviour features. | AUC. |
Birnbaum (2020)26 | Google | Google Takeout | 405523 | 52 weeks | First psychiatric hospitalisation from medical records | No | LIWC extracted 93 linguistic features. | Between-group and within-group comparisons of linguistic and search behaviour features. |
Hochberg (2020)41 | Bing | Bing proprietary backend extraction | Not stated. | Not stated. | Diagnosis ascertaining query for 20 medical conditions | No | Extracted queries related to relevant symptoms before DAQ. List of relevant symptoms defined by investigators. | Correlation between queries of symptoms and DAQ. |
Kirschenbaum (2020)19 | Google | Google Takeout | > 30000 | 6 months | First episode psychosis from medical records | No | Manual thematic analysis of search queries. | Qualitative description of search queries |
Sadeh-Sharvit (2020)22 | General browsing history | Nirsoft BrowsingHistoryView | Not stated. | Up to 6 months | Stanford-Washington University Eating Disorder Screen Questionnaire | Three prediction models: multiclass predictor, logistic regression and random forest. | Feature extraction in conjunction with model generation. Extracted features include keywords, demographic data and search behaviour. | Sensitivity. |
Schueller (2020)28 | General browsing history | Nirsoft BrowsingHistoryView | Not stated. | At least 10 days | Perceived Barriers to Psychological Treatment Scale Patient Health Questionnaire | Three predictive models: linear, classification decision tree and random forest. | Predictive model trained on extracted features including online search behaviour, demographics and keywords. | AUC. |
Yom-Tov (2020)37 | Bing | Bing proprietary backend extraction | Not stated. | 3 months | Suspected Cancer Recognition and Referral Questionnaire | Random forest predictive model. | Predictive model trained on extracted features including query terms, frequency of medical symptoms. Symptoms defined by a list of 195 symptoms and layperson descriptions. | AUC. |
Zhang (2020)29 | Google | Google Takeout | 115493 in first round 122353 in second round | 2 rounds of 2 months | Patient Health Questionnaire-9 General Anxiety Disorder-7 Questionnaire | Google natural language processing API to categorise search queries. | LIWC extraction of linguistic features from search queries and YouTube video titles. Other features extracted included online behaviours and distribution of online activity over a day. | Changes in online behaviours, LIWC attributes, search categories. |
Arean (2021)30 | Google | Google Takeout | 349922 | Not stated. | Suicide Attempt and Self-Injury Count Questionnaire | No | Vector representations of queries derived. Cue terms were manually iteratively derived. | Changes in online search behavior prior to a suicide attempt. |
Moon (2021)24 | Google | Google Takeout | 37738 | 3 months | Mood disorder diagnosis from medical records | No | Manual thematic analysis of search queries. | Qualitative description of search queries. |
Shaklai (2021)32 | Bing | Bing proprietary backend extraction | Not stated. | 30 days | Diagnosis ascertaining query for stroke | Random forest predictive model | Predictive model trained on extracted features chosen to represent cognitive ability including search behaviour, frequency, spelling and mouse positioning. | AUC. |
Yamaguchi (2021)21 | Yahoo! Japan | Not stated. | Not stated. | Not stated. | Diagnosis ascertaining query for biliary atresia or pyloric stenosis | Logistic regression predictive model. | Predictive model trained on extracted search queries. | Sensitivity. |
Zaman (2021)38 | Google | Google Takeout | Not stated. | Not stated. | Promote Health Survey | Two predictive models: logistic regression and support vector machine Google natural language processing API to classify queries. | LIWC extraction of linguistic features. Predictive models trained on extracted linguistic features and search category features. | F1 score and differences in linguistic attributes and search behaviours. |
CohenZion (2022)33 | Bing | Bing proprietary backend extraction | Not stated. | Up to 1 year | Digital Sleep Questionnaire | Random forest predictive model. | Predictive model trained on search activity and keywords and demographics. | AUC.34 |
Yom-Tov (2023)34 | Bing | Bing proprietary backend extraction | Not stated. | Not stated. | Diagnosis ascertaining query for ALS. Online individuals who self-identified to have ALS. | Random forest predictive model. | Predictive model trained on session duration, use of autocorrect and search query characteristics. | AUC. |
Barcroft (2024)23 | Google | Google Takeout Bing proprietary backend extraction | 519,048 | 2 years | Malignant or benign diagnosis from medical records. | Gradient boosting predictive model. | Predictive models trained on number of search terms in each keyword category or vector-space model of words and word pairs. | AUC. |
Diagnostic or predictive value
The reported predictive or diagnostic value of included studies can be found in Table 3.
AUC was reported by ten studies with a wide variability in values, ranging from < 0.53 (identifying users with sustained interest in coeliac disease) to > 0.99 (detection of impending stroke).32,39 Three studies noted that classifier AUC increased with search data closer to time of diagnosis of malignancies, including pancreatic cancer (0.832 to 0.911 from 21 weeks to 1 week prior to the diagnosis ascertaining query),35 lung cancer (0.855 to 0.942 from 52 weeks to 1 week prior to the diagnosis ascertaining query),36 and differentiating between malignant and benign gynaecological disease (0.64 to 0.74 from 1 year to 60 days prior to specialist referral).23
Sensitivity was reported by three studies, and ranged from 0.44 for predicting eating disorder risk status to 0.81 for the detection of pyloric stenosis in infants from their parents’ search data.21,22 F1 score was reported by two studies, and ranged from 0.36 for the detection of relapse of schizophrenia spectrum disorders to 0.80 for the likelihood of the presence of intimate partner violence.27,38
Seven studies utilised more than one machine learning algorithm for the same dataset, noting variability in classification performance.20,22,27,28,31,38,39 For example, Birnbaum et al. noted differences in performance of support vector machine, gradient boosted, and random forest classifier models for both diagnosis and relapse prediction of schizophrenia spectrum disorders (AUC 0.66 to 0.74 and 0.69 to 0.71, for diagnosis and relapse, respectively).27
Associations and predictive features
Results of the linguistic, temporal and other associations in the included studies can be found in Supplementary Table 1.
Linguistic associations
20 studies observed specific linguistic themes among study participants' health- and symptom-related searches prior to diagnosis. Asch et al. noted that 63% of people presenting to the emergency department searched proportionally more health-related queries prior to presentation compared to baseline.40 Similarly, Philips et al. found that 13% of total Google searches by parents of paediatric cancer patients were health-related (of which 31% was symptom or disease related, 29% was hospital logistics related, and 18% was cancer-specific).25 Additionally, the study authors noted that parents’ health-related searches increased prior to diagnosis, from mean 0.2 searches/day six months prior to diagnosis to 1.2 searches/day at diagnosis.
Search queries for physical symptoms appeared related to conditions of interest. Hochberg et al. noted search keywords related to symptoms (”impotence”, ”malaise”, ”polyuria” and ”thirst”) were significantly associated with a diagnosis of diabetes.20 Similarly, Lebwohl et al. found that search queries related to the symptom ”diarrhoea” were most strongly associated with coeliac disease.39 Finally, Barcroft et al. noted that for patients with suspected gynaecological malignancies, gastrointestinal and pain-related symptoms were apparent up to one year prior to specialist referral and further investigation, whereas urinary, bleeding-related, bloating and other gynaecological and menopausal symptoms became apparent later, up to 70 days prior to referral and diagnosis.23
Qualitative thematic assessment of search queries were also noted in studies of individuals with mental health conditions. Kirschenbaum et al. noted that certain search keyword themes, including ”delusions”, ”negative symptoms”, ”thought process”, ”mental health”, ”illicit drugs”, ”help seeking”, and ”suicide”, were searched by a significant proportion of patients prior to a first presentation of psychosis.19 Moon et al. identified themes in the searches of participants hospitalised with suicidal thoughts and behaviours which included “suicide”, “help seeking”, “substance abuse”. and “mood and anxiety symptoms”.24 Furthermore, Birnbaum et al. demonstrated keyword themes most significantly predictive for diagnosis and relapse of schizophrenia spectrum disorders included “inhibition”, “positive affect”, “anxiety”, “sexual”, “health”, “hear”, “anger”, “sadness”, and “perception”.27 Additionally, Sadeh-Sharvit et al. noted search queries containing themes related to ”mental health”, ”treatment”, ”eating disorder”, “diet”, “body”, “imagery”, ”food”, ”sex”, ”clothing”, “allergy”, and ”stimulant drugs” were queried by between 12.2–98.6% of users classified with clinical or subclinical eating disorders.22 Finally, Arean et al. noted that searches related to suicide were found in some individuals, but with very wide ranging proximity to their suicide attempts.30
Finally, significant linguistic shifts were noted in several included studies. Birnbaum et al. found significant differences in the use of pronouns and lack of punctuation prior to hospitalisation for participants with schizophrenia spectrum disorders and mood disorders, compared to healthy volunteers.26 Similarly, Zaman et al. noted significantly higher “I” and “pronoun” usage for participants experiencing intimate partner violence compared to participants that showed no signs of intimate partner violence.38
Temporal associations
Differences in frequency of search activity were also noted by studies. Birnbaum et al. found significantly lower overall search frequency for participants with schizophrenia and non-psychotic mood disorders up to one year prior to hospitalisation, compared to healthy volunteers.26 Sadeh-Sharvit et al. also noted median web browsing frequency of participants as a predictive feature for eating disorders.22
Changes in timing of search activity were also noted by studies. Birnbaum et al. noted significant shifts in timing of search activity in participants with schizophrenia spectrum disorders and mood disorders prior to hospitalisation compared to healthy volunteers.26 Similarly, Zhang et al. found significant correlations between increased night-time search activity and both PHQ-9 and GAD-7 scores for depression and anxiety, respectively.29 Additionally, Schueller et al. noted diurnal activity was a predictive feature for increased PHQ-9 score.28 Finally, Zaman et al. noted increased search frequency during the weekends was associated with participants at higher risk of experiencing intimate partner violence compared to participants who showed no signs of intimate partner violence.38
Other associations
Shaklai et al. noted the most predictive features for impending stroke were associated with cognitive function, including the number of queries per session, number of spelling mistakes and the number of repeated queries.32
Table 3
Findings of predictive or diagnostic value of included studies. Note. DAQ - diagnosis ascertaining query; DT - decision tree; FPR - false positive rate; GB - gradient boosted; PBPT - perceived barriers to psychological treatment; PHQ - patient health questionnaire; RF - random forest; SVM - support vector machine.
Study | Condition of interest | Area Under the Receiver Operator Characteristic Curve | Positive Predictive Value | Sensitivity | F1 score |
Paparrizos (2016)35 | Pancreatic cancer | 0.91 (1 week before DAQ) 0.83 (21 weeks before DAQ) | Not stated. | Not stated. | Not stated. |
White (2017)36 | Lung cancer | 0.94 (1 week before DAQ) 0.87 (52 weeks before DAQ) | Not stated. | Not stated. | Not stated. |
Hochberg (2019)20 | Diabetes | 0.93 (RF) 0.92 (logistic regression) 0.89 (linear regression) 0.86 (DT) | 0.16–0.28 (at FPR 0.01) | Not stated. | Not stated. |
Lebwohl (2019)39 | Coeliac disease | < 0.53 (both linear regression and RF) | Not stated. | Not stated. | Not stated. |
Youngmann (2019)31 | Parkinson's disease | 0.93 | 0.25 | Not stated. | Not stated. |
Birnbaum (2020)27 | Schizophrenia spectrum disorders | 0.74 (diagnostic RF) 0.68 (diagnostic GB) 0.66 (diagnostic SVM) 0.69 (relapse RF) 0.71 (relapse GB) 0.71 (relapse SVM) | Not stated. | 0.73 (diagnostic RF) 0.65 (diagnostic GB) 0.65 (diagnostic SVM) 0.61 (relapse RF) 0.65 (relapse GB) 0.63 (relapse SVM) | 0.54 (diagnostic RF) 0.47 (diagnostic GB) 0.49 (diagnostic SVM) 0.53 (relapse RF) 0.57 (relapse GB) 0.36 (relapse SVM) |
Sadeh-Sharvit (2020)22 | Eating disorders | Not stated. | Not stated. | 0.526 (multiclass) 0.44 (logistic regression) | Not stated. |
Schueller (2020)28 | Perceived barriers to psychological treatment | 0.86 (PBPT score) 0.65 (PHQ score) | Not stated. | Not stated. | Not stated. |
Yom-Tov (2020)37 | Lung cancer Breast cancer Colon cancer | 0.66 (total) 0.74 (colon cancer) 0.56 (lung cancer) 0.50 (breast cancer) | Not stated. | Not stated. | Not stated. |
Shaklai (2021)32 | Impending stroke | > 0.99 | Not stated. | Not stated. | Not stated. |
Yamaguchi (2021)21 | Biliary atresia and pyloric stenosis | Not stated. | Not stated. | 0.79 (biliary atresia) 0.81 (pyloric stenosis) | Not stated. |
Zaman (2021)38 | Intimate partner violence | Not stated. | Not stated. | Not stated. | 0.66 (logistic regression) 0.80 (SVM) |
CohenZion (2022)33 | Sleep disorders | 0.62 (insufficient sleep syndrome) 0.65 (delayed sleep phase syndrome) 0.69 (obstructive sleep apnoea) 0.69 (insomnia) | Not stated. | Not stated. | Not stated. |
Yom-Tov (2023)34 | Amyotrophic lateral sclerosis | 0.69 (90 days before DAQ) 0.73 (30 days before DAQ) 0.81 (all data) 0.74 (validation with prospective data) | Not stated. | Not stated. | Not stated. |
Barcroft (2024)23 | Gynaecological cancer | 0.64 (1 year) 0.74 (60 days) 0.82 (sample size-adjusted) | Not stated. | Not stated. | |