Keyword characteristics
A total of 15,855 articles from 1975 to 2020 were used to extract keywords, and 5,639 keywords were primarily screened, among which 1,027 keywords were analyzed. The analyzed keywords were selected according to the steps shown in Supplementary Fig. S1. Finally, 330 keywords (n = 16,213) that were preprocessed over 12 frequencies were thoroughly analyzed. The detailed procedure of keyword selection is described in the Methods section and Fig. S1. The keywords were classified into six academic and twelve technical categories. The keywords are described in the SI. Among the six academic categories, the Genetics category had the most keywords, with 8,522 keywords (52.1%), followed by the Medicine category with 3,344 keywords (20.4%). The third category was the Proteomics category (n = 1,510, 9.24%), followed by the General (n = 1,313, 8.03%), Biology (n = 1,025, 6.27%), and Statistics (n = 624, 3.81%) categories. The 12 technical categories were the Gene (n = 3276, 20.2%), Genetics terminology (n = 3019, 18.6%), Methods (n = 1725, 10.6%), Database/Software (n = 1393, 8.59%), Disease (n = 1204, 7.42%), Clinical (n = 1103, 6.8%), Proteomics (n = 1034, 6.37%), Pathogen (n = 1006, 6.2%), Statistics (n = 720, 4.44%), Metabolite/Biologicals (n = 707, 4.36%), Company/Consortium (n = 536, 3.3%), and Organism (n = 474, 3.02%) categories.
Network full phase
The network is displayed in Fig. 1 with keywords derived from studies published from 1975 to 2020, using eight colors following a modularity of 0 to 7. According to the modularity value, full-phase keywords were clustered in different colors (Fig. 1, Table S2).
In the middle of the network, the group with modularity 0 was clustered in the order of the PageRank top ten as “genome” (PR 0.0167), “SNP” (PR 0.0148), “disease” (0.0128), “allele” (0.0119), “clinician”, “genomics”, “Illumina”, “Bayesian”, “genetics”, and “bioinformatics”. For modularity value 1, terms of the (pink group) gene and gene analysis terms were clustered. “gene” term ranked top (PR 0.0244) in the order of PageRank. In the Gene category, “mRNA” ranked highest (PR 0.0067). “cDNA” and “miRNA” were listed in the top ten. Terms related gene analysis techniques included (0.001 < PR < 0.01, Table S2) “qPCR”, “microarray”, “geNorm”, “gene normalization”, and “NormFinder” in the order of the PageRank. The terms of medicine and oncology were clustered in the dark brown group with a modularity of 2. The top-ranked terms were “ng” (0.0159), “tumor” (PR 0.0105), under 0.01 PR, “therapy”, “EGFR”, “IHC”, “KRAS”, “NSCLC”, “targeted therapy”, “amplicon”, and “tumor DNA”. For modularity value 3, pathogen and analysis-related terms were clustered in the sky-blue group, in the order of highest PR (0.001 < PR < 0.01, Table S2), namely, “WGS”, “Escherichia”, “bacteria”, “pathogen”, “Mycobacterium”, “MLST”, “NCBI”, “MiSeq”, “Pseudomonas”, and “Streptococcus”. For modularity value 4, DNA methylation-related medical terms were clustered in the reddish brown group (0.001 < PR < 0.0055). Keywords included “CpG”, “WHO”, “DNA methylation”, “methylation”, “MGMT”, “inhibitor”, “AML”, “ROC”, “TMZ”, and “IDH”. For modularity value 5, which is represented by the green cluster in the upper left side of Fig. 1, the phylogenetic terms were “rRNA”, “nucleotide”, “GenBank”, “codon”, “genotyping”, “mitochondrial genome”, “mtDNA”, “phylogenetic”, “tRNA”, and “RNA”. For modularity value 6, proteomics terms were clustered in the light purple group, with the highest PR keyword being protein (0.0152). Other keywords included “biomarker”, “proteomics”, “algorithm”, “peptide”, “database”, “knowledge”, “reproducibility”, “FDR”, and “measurement” (PR < 0.01, Table S2). The keywords with modularity value 7 were located between the center (group of modularity 0) and the right of Fig. 1, which is oncology, and the keywords included “diagnosis”, “CNV”, “genomic hybridization”, “genomic DNA”, “STR”, “chromosome”, “BAC”, “aCGH”, “MLPA”, and “haplotype” (PR < 0.01).
Similarity by phase
In order to observe when the research trend changed, a similarity analysis was performed between years. Through similarity analysis, we set five phases of different similarity patterns for the keyword research based on the inflection point as similarity results between 1-year (Fig. 2a., e.g., 2000–2001), between 1-year interval similarities (Fig. 2b., e.g., 2000–2001; 2001–2002), and between 2-years interval similarities (Fig. 2c., e.g., 2000–2002; 2001–2003). From 1980 to 2000, named phase 0, the frequency was extremely low; hence, the similarity analysis was performed from 2000. For phase 0, the year frequency was less than 10 from 1975 to 1989. Further, the year frequency from 1990 to 1999 was between 10 frequencies to a maximum of 72 frequencies. The 1-year similarity results showed a steep slope at the similarity between 2003–2004 and–2004–2005, with an increase of approximately 0.1, from 0.294 to 0.396 (Fig. 2a, Table S3). The graph indicates a fluctuating trend of decreasing (2011–2012, 0.396) and increasing (2012–2013, 0.484) similarities. The similarity graph shows a smooth curve up to 0.541 in 2017–2018, and then steeply decreases from 0.544 in 2018–2019 to 0.349 in 2019–2020. The similarity graph of the 1-year interval similarities (Fig. 2b) shows the lowest trough (0.518) at 2002-2003-2003-2004 similarity, increasing up to 0.736 at 2016:2017–2017:2018, and decreasing to 0.675 at 2018:2019–2019:2020. In Fig. 2c, the similarity between 2-year interval similarities also showed the lowest point at 2001:2003 − 2002:2004, and then increased to 0.770 at 2010:2012 − 2011:2013, and the next smooth curve shows a slight decrease of 0.774 at the end of the graph 2017:2019 − 2018:2020. In 2-year interval similarity results, the lowest point was 2001:2003 − 2002:2004, and the graph shows an increasing trend until 2010:2012 − 2011:2013. For 2015:2017 − 2016:2018, the similarity value decreased to 0.774. In the three graphs, starting with the similarity value including 2003:2004, the similarity graph of the decreasing trend increased. Based on the aforementioned results, we set four phases: phase 1 (2000–2003), phase 2 (2004–2012), phase 3 (2013–2016), and phase 4 (2017–2020).
Keyword frequency by period
From 1975 to 2019, an annual increase was observed in most of the analyzed keywords, even though Fig. 1 is not a cumulative graph. However, in 2020, due to the COVID-19 pandemic, all keyword-related research has decreased to over half of the keyword frequencies in 2019 (Fig. 3., upper). Despite the decreased frequency in 2020, linear regression square values of all categories had a minimum value of 0.587 in the General category, and a maximum 0.764 in Biology in the middle class. The Genetics category had the second-highest regression square value (R2 = 0.717). In the middle class, the minimum and maximum values were 0.586 and 0.741, respectively, in the Company/Consortium category (Table 1). The phase-frequency results show a continuously increasing trend in Medicine and General in the large class, and Genetics term, Clinical and Disease in the middle-class categories. Among the large classes, the highest frequency category is Genetics, the second Medicine, and the third Proteomics. Among the subcategory results, the Gene category (R2 = 0.664) showed the highest value in phase 2. However, from phase 2 to phase 4, Genetics term exhibited an increasing trend and the highest keyword frequency in phases 3 and 4. Organism (R2 = 0.741), Genetics term (R2 = 0.740), and Pathogen (R2 = 0.737) categories were the highest fitting linear models throughout the total phase. In Fig. 3c, Genetics (n = 2776) and Proteomics (n = 656) recorded the highest frequency in phase 2 in the large class; Gene (n = 1094), Methods (n = 599), and Proteomics (n = 483) recorded the highest frequency in phase 2 (Fig. 3d) in the subcategory and Database/Software category recorded the highest frequency in phase 3.
Table 1
Linear regression of keyword frequency in each category in the whole phase
|
Category
|
R2
|
Large class
|
Biology
|
0.764
|
General
|
0.587
|
Genetics
|
0.717
|
Medicine
|
0.653
|
Proteomics
|
0.673
|
Statistics
|
0.666
|
Middle class
|
Clinical
|
0.657
|
Company/Consortium
|
0.586
|
Database/software
|
0.684
|
Disease
|
0.625
|
Gene
|
0.664
|
Genetics term
|
0.740
|
Metabolite/Biologicals
|
0.652
|
Methods
|
0.736
|
Organism
|
0.741
|
Pathogen
|
0.737
|
Proteomics
|
0.678
|
Statistics
|
0.648
|
Phase correlation
To estimate the correlation between the four phases, we analyzed the phase correlation using keyword frequency. According to the results (Fig. 4, Table S4), the General category did not show any correlation between phases, the Proteomics category showed a significant correlation only between phases 3 and 4 (R2 = 0.898), and the Statistics category (pink filled circle in Fig. 4) shows a significant correlation between phases 2 and 3 (R 2 =1.000). Biology, Genetics, and Medicine showed strong linear correlations throughout all the phases (Table S4). In these three categories, phase 0 showed a correlation with phases 1 and 2, and no correlation was found with phases 3 and 4 in any category.
General linear model
As the linear regression results in the whole phase were significant in all categories (R2 > 0.586), we analyzed the general linear model within a phase. The frequency linear models for each category and each phase are shown in Fig. 5. There was no linear correlation among academic categories (Table S5), while linear correlations were observed in several technical categories (Table 2, Table 3); the Gene (p = 0.003**) and Pathogen (p = 0.030*) categories were statistically significant in phase 0 (Table 2) and Gene (p = 0.004**) and Proteomics (p = 0.044*) were statistically significant in phase 1. In phase 2, only the Proteomics (p = 0.001**) category was significant in the general linear model. In phase 3 (Table 3), Proteomics (p = 0.045*) and Software (p = 0.004**) were significant, and in phase 4, only Genetics term was significant (p = 0.039*).
Table 2
General linear model results of the technical categories from phase 0 to phase 2. *p < 0.05, **p < 0.01
Phase
|
Category
|
B
|
SE
|
t
|
Sig.
|
95% Confidence Interval
|
Lower
|
Upper
|
Phase 0
|
Biologicals
|
0.647
|
0.911
|
0.710
|
0.478
|
-1.146
|
2.439
|
Clinical
|
0.004
|
0.875
|
0.005
|
0.996
|
-1.717
|
1.726
|
Company/Institute
|
0.022
|
1.078
|
0.020
|
0.984
|
-2.100
|
2.143
|
Data related
|
0.089
|
0.991
|
0.090
|
0.928
|
-1.861
|
2.040
|
Disease
|
0.739
|
0.868
|
0.851
|
0.395
|
-0.969
|
2.446
|
Gene
|
2.347
|
0.781
|
3.007
|
0.003**
|
0.812
|
3.883
|
Genetics term
|
0.299
|
0.783
|
0.382
|
0.703
|
-1.242
|
1.841
|
Methods
|
0.504
|
0.763
|
0.661
|
0.509
|
-0.996
|
2.005
|
Organism
|
2.338
|
1.226
|
1.907
|
0.057
|
-0.074
|
4.749
|
Pathogen
|
2.036
|
0.933
|
2.182
|
0.030*
|
0.200
|
3.873
|
Proteomics
|
1.160
|
1.180
|
0.983
|
0.326
|
-1.161
|
3.481
|
Software
|
-0.315
|
1.281
|
-0.246
|
0.806
|
-2.835
|
2.205
|
Phase 1
|
Biologicals
|
0.146
|
1.305
|
0.112
|
0.911
|
-2.422
|
2.714
|
Clinical
|
0.876
|
1.254
|
0.698
|
0.485
|
-1.591
|
3.342
|
Company/Institute
|
0.935
|
1.544
|
0.606
|
0.545
|
-2.103
|
3.974
|
Data related
|
0.849
|
1.420
|
0.598
|
0.550
|
-1.944
|
3.643
|
Disease
|
1.320
|
1.243
|
1.062
|
0.289
|
-1.125
|
3.765
|
Gene
|
3.256
|
1.118
|
2.912
|
0.004**
|
1.056
|
5.456
|
Genetics term
|
0.798
|
1.122
|
0.711
|
0.477
|
-1.410
|
3.006
|
Methods
|
0.487
|
1.093
|
0.445
|
0.656
|
-1.663
|
2.636
|
Organism
|
2.653
|
1.756
|
1.511
|
0.132
|
-0.801
|
6.108
|
Pathogen
|
2.320
|
1.337
|
1.735
|
0.084
|
-0.311
|
4.951
|
Proteomics
|
3.420
|
1.690
|
2.024
|
0.044*
|
0.095
|
6.745
|
Software
|
-0.555
|
1.835
|
-0.303
|
0.762
|
-4.165
|
3.055
|
Phase 2
|
Biologicals
|
-0.318
|
8.863
|
-0.036
|
0.971
|
-17.757
|
17.120
|
Clinical
|
-1.951
|
8.514
|
-0.229
|
0.819
|
-18.704
|
14.801
|
Company/Institute
|
3.622
|
10.490
|
0.345
|
0.730
|
-17.017
|
24.260
|
Data related
|
3.631
|
9.644
|
0.376
|
0.707
|
-15.343
|
22.605
|
Disease
|
1.089
|
8.441
|
0.129
|
0.897
|
-15.519
|
17.697
|
Gene
|
13.437
|
7.594
|
1.769
|
0.078
|
-1.504
|
28.377
|
Genetics term
|
8.486
|
7.622
|
1.113
|
0.266
|
-6.511
|
23.483
|
Methods
|
1.530
|
7.421
|
0.206
|
0.837
|
-13.070
|
16.131
|
Organism
|
8.160
|
11.925
|
0.684
|
0.494
|
-15.303
|
31.623
|
Pathogen
|
3.541
|
9.080
|
0.390
|
0.697
|
-14.325
|
21.407
|
Proteomics
|
38.460
|
11.478
|
3.351
|
0.001**
|
15.877
|
61.043
|
Software
|
11.910
|
12.461
|
0.956
|
0.340
|
-12.607
|
36.427
|
Table 3
General linear model results of the technical categories from phase 3 to phase 4. *p < 0.05, **p < 0.01
Phase
|
Category
|
B
|
SE
|
t
|
Sig.
|
95% Confidence Interval
|
Lower
|
Upper
|
Phase 3
|
Biologicals
|
0.421
|
7.619
|
0.055
|
0.956
|
-14.570
|
15.412
|
Clinical
|
1.493
|
7.319
|
0.204
|
0.838
|
-12.908
|
15.894
|
Company/Institute
|
3.468
|
9.017
|
0.385
|
0.701
|
-14.274
|
21.209
|
Data related
|
1.631
|
8.290
|
0.197
|
0.844
|
-14.680
|
17.941
|
Disease
|
3.874
|
7.256
|
0.534
|
0.594
|
-10.403
|
18.151
|
Gene
|
11.117
|
6.528
|
1.703
|
0.090
|
-1.726
|
23.961
|
Genetics term
|
11.921
|
6.552
|
1.819
|
0.070
|
-0.971
|
24.813
|
Methods
|
-0.081
|
6.379
|
-0.013
|
0.990
|
-12.632
|
12.471
|
Organism
|
4.938
|
10.251
|
0.482
|
0.630
|
-15.232
|
25.107
|
Pathogen
|
4.255
|
7.806
|
0.545
|
0.586
|
-11.103
|
19.614
|
Proteomics
|
19.860
|
9.867
|
2.013
|
0.045*
|
0.446
|
39.274
|
Software
|
30.785
|
10.712
|
2.874
|
0.004**
|
9.709
|
51.861
|
Phase 4
|
Biologicals
|
2.348
|
8.725
|
0.269
|
0.788
|
-14.818
|
19.514
|
Clinical
|
9.000
|
8.381
|
1.074
|
0.284
|
-7.490
|
25.490
|
Company/Institute
|
4.385
|
10.326
|
0.425
|
0.671
|
-15.931
|
24.700
|
Data related
|
6.294
|
9.493
|
0.663
|
0.508
|
-12.383
|
24.971
|
Disease
|
7.179
|
8.309
|
0.864
|
0.388
|
-9.170
|
23.527
|
Gene
|
10.745
|
7.475
|
1.437
|
0.152
|
-3.963
|
25.452
|
Genetics term
|
15.587
|
7.503
|
2.077
|
0.039*
|
0.824
|
30.350
|
Methods
|
1.537
|
7.305
|
0.210
|
0.833
|
-12.835
|
15.909
|
Organism
|
5.778
|
11.738
|
0.492
|
0.623
|
-17.318
|
28.873
|
Pathogen
|
6.952
|
8.938
|
0.778
|
0.437
|
-10.634
|
24.539
|
Proteomics
|
11.700
|
11.299
|
1.036
|
0.301
|
-10.530
|
33.930
|
Software
|
15.750
|
12.256
|
1.284
|
0.200
|
-8.384
|
39.884
|
The Gene category showed a good fit in the general linear model in phases 0 and 1. The Proteomics category fitted significantly in the linear model in phases 1 and 2, and the Genetics term category significantly fitted only with a linear model in the latest phase 4.
Frequency analysis in a keyword
Detailed frequency analysis results for each keyword are presented in Fig. S2, S3, and S4. Among the biology keywords, “Escherichia” showed the highest frequency, especially in phase 2. “Mycobacterium” including Mycobacterium tuberculosis and M. tuberculosis complex recorded the second-highest frequency in phase 4 (n = 41), which was close to “Escherichia” (n = 44). As Escherichia coli and M. tuberculosis cause several infections, they are the most popular species in medical research. Further, “Escherichia” was the most popular research subject as a bacteria model in functional genomics. Arabidopsis thaliana has been widely studied as a model plant in genomics and has recorded the highest frequency in phase 2. Keywords in the Organism/Pathogen graph in Fig. S2 including “Escherichia” had the highest frequency in phase 2 and showed a decreasing pattern from phase 2 to phase 4. Other keywords of the Organism/Pathogen in high frequency were “bacteria,” “animal,” “microorganism,” “HeLa,” “taxon” including “Candida,” “Streptococcus”.
In the Statistics category, “Bayesian” and “algorithm” had the highest frequency in phase 2. However, the frequency of “algorithm” continuously decreased up to phase 4, and that of “Bayesian” increased from phase 3 to phase 4. Among general terms, “nanogram” had the highest frequency, and “database” had the second-highest frequency, showing a drastic decrease in phase 3. Further, “dataset” showed a linear increasing trend over “database” in phase 4. “Measurement,” “knowledge” and “software” displayed high frequency in phase 2, followed by a decrease in phase 4. In the lower level of general terms, “NCBI,” “workflow,” “susceptibility,” and “pubmed” presented a continuously increasing trend throughout the phases. In the Proteomics graph, “protein,” “proteomics,” “peptide,” “proteome” had the highest frequency in phase 2, in order of frequency (Fig. S2, Terminology-A).
In the Company/Consortium graph in the Genetics category (Fig. S3), keywords “Illumina,” “Taqman” were the highest, and “Illumina” and “ACMG” in particular showed an increasing trend during a whole period. In the “Database” graph, “bioinformatics” shows the highest frequency and a continuously increasing trend. In the “Genetics” category, “Gene” (A) graph showed that “gene” and “genome” keywords were the highest and they had the highest frequency in phase 2. In the “Genetics” term (A) graph, “SNP,” “genomics,” “gene normalization,” and “genetics” were the highest in order of frequency. In the “Genetics” term graph, “genomics,” “genetics,” “DNA methylation,” “methylation,” and “metagenomics” show a continuously increasing pattern. In the other graph, sequencing-related software “NormFinder,” “geNorm,” “BestKeeper” were highest in order of frequency, and they had the highest frequency in phase 3. “CNV” and “ClinGen” keywords showed a continuous increasing trend. In the “Methods” graph, “WGS,” “GWAS,” and “MiSeq” exhibited the highest frequency in phase 4; “microarray,” “genomic hybridization,” and “gene microarray” were the highest in phase 2; and “qPCR” was the highest in phase 3. In the “Medicine” category, “disease,” “tumor,” “clinician,” “therapy,” “diagnosis,” “biomarker,” and “EGFR” keywords were highest in order of frequency (Graphs Disease (A), Clinical (A), and Metabolites/Biologicals in Fig. S4). Many keywords show the maximum frequency in phase 2, in “Biology,” “Statistics,” “General,” “Proteomics,” “Genetics.” Most keywords in the “Medicine” category show an increasing trend throughout the five phases.