The network analysis
A total of eight clusters were created as shown in Fig. 1, and a sample of PageRank scores of each cluster is listed in Table 1.
Table 1
Cluster 0 (C0)
|
C1
|
C2
|
C3
|
Keyword
|
PageRank
|
Keyword
|
PageRank
|
Keyword
|
PageRank
|
Keyword
|
PageRank
|
Genome
|
0.0167
|
gene
|
0.0244
|
nano gram
|
0.0159
|
WGS
|
0.0087
|
SNP
|
0.0148
|
mRNA
|
0.0067
|
tumor
|
0.0105
|
Escherichia
|
0.0084
|
Disease
|
0.0128
|
qPCR
|
0.0066
|
therapy
|
0.0076
|
bacteria
|
0.0061
|
Allele
|
0.0119
|
microarray
|
0.0052
|
EGFR
|
0.0056
|
pathogen
|
0.0053
|
Clinician
|
0.0097
|
Arabidopsis
|
0.0048
|
IHC
|
0.0043
|
Mycobacterium
|
0.0051
|
Genomics
|
0.0084
|
geNorm
|
0.0048
|
KRAS
|
0.0041
|
MLST
|
0.0047
|
Illumina
|
0.0075
|
gene normalization
|
0.0047
|
NSCLC
|
0.0041
|
NCBI
|
0.0039
|
Bayesian
|
0.0072
|
NormFinder
|
0.0045
|
targeted therapy
|
0.0040
|
MiSeq
|
0.0038
|
genetics
|
0.0058
|
cDNA
|
0.0042
|
amplicon
|
0.0037
|
Pseudomonas
|
0.0038
|
bioinformatics
|
0.0058
|
miRNA
|
0.0035
|
tumor DNA
|
0.0037
|
Streptococcus
|
0.0037
|
C4
|
C5
|
C6
|
C7
|
Keyword
|
PageRank
|
Keyword
|
PageRank
|
Keyword
|
PageRank
|
Keyword
|
PageRank
|
CpG
|
0.0053
|
rRNA
|
0.0042
|
protein
|
0.0152
|
diagnosis
|
0.0077
|
WHO
|
0.0051
|
nucleotide
|
0.0038
|
biomarker
|
0.0095
|
CNV
|
0.0056
|
DNA methylation
|
0.0040
|
GenBank
|
0.0037
|
proteomics
|
0.0087
|
genomic hybridization
|
0.0054
|
methylation
|
0.0037
|
codon
|
0.0036
|
algorithm
|
0.0075
|
genomic DNA
|
0.0047
|
MGMT
|
0.0036
|
genotyping
|
0.0033
|
peptide
|
0.0063
|
STR
|
0.0032
|
inhibitor
|
0.0036
|
mitochondrial genome
|
0.0031
|
database
|
0.0060
|
chromosome
|
0.0031
|
AML
|
0.0036
|
mtDNA
|
0.0031
|
knowledge
|
0.0047
|
BAC
|
0.0030
|
ROC
|
0.0034
|
phylogenetic
|
0.0031
|
reproducibility
|
0.0046
|
aCGH
|
0.0029
|
TMZ
|
0.0032
|
tRNA
|
0.0031
|
FDR
|
0.0040
|
MLPA
|
0.0028
|
IDH
|
0.0031
|
RNA
|
0.0029
|
measurement
|
0.0040
|
haplotype
|
0.0028
|
In C0, terms related to genetic materials (i.e., “genome”, “SNP”), clinical terminology (i.e., “disease”), and technology (i.e., “Illumina”, “Bayesian”) are clustered. In C1, genetic materials (“gene”, “mRNA”, and “cDNA”.) and gene analysis techniques (“qPCR”, and “Normfinder”) are clustered. In C2, the term “nano gram” and oncology-related keywords (“tumor”, “IHC”, “NSCLC” and “tumor DNA”) are grouped together. In C3, terms related to pathogens are clustered (“Escherichia”, “bacteria”, “pathogen”, and “Mycobacterium”). In C4, DNA methylation-related terms (“CpG”, “DNA methylation”, “methylation”, “MGMT”) and in C5, gene-related or phylogenetic terms (“mitochondrial genome”, “mtDNA”, “phylogenetic”, “tRNA”, and “RNA”) were clustered. In C6, proteomics terms are clustered (“protein”, “biomarker”, “proteomics”, “algorithm”, and “peptide”). The last cluster, C7, genetic technology and clinic-relevant terms are present (“diagnosis”, “CNV”, “genomic hybridization”).
Period Analysis based on publication years
We identified three main inflection points for each similarity analysis. The inflection points signify changing trends: in Fig. 2A, the infection points emerged in 2003:2004 (Similarity = .294), 2012:2013 (S = .485), and 2017:2018 (S = .541) where the trend has started to plateau (Table 2). And in Fig. 2B, inflection points in 2002–2003:2003–2004 (S = .518); 2011–2012:2012–2013 (S = .684); and 2016–2017:2017–2018 (S = .736) were identified, while, in Fig. 2C, 2001–2003:2002–2004 (S = .612); 2010–2012:2011–2013 (S = .770); and 2015–2017:2016–2018 (S = .798) were observed. The similarity scores for each period analysis are shown in Table 2.
Table 2
Similarity results based on different year ranges
Similarity (1 year)
|
Similarity (2 years)
|
Similarity (3 years)
|
Year
|
Similarity
|
Year
|
Similarity
|
Year
|
Similarity
|
2000–2001
|
0.267
|
2000:2001–2001:2002
|
0.515
|
2000:2002 − 2001:2003
|
0.649
|
2001–2002
|
0.274
|
2001:2002–2002:2003
|
0.572
|
2001:2003 − 2002:2004
|
0.612
|
2002–2003
|
0.298
|
2002:2003–2003:2004
|
0.518
|
2002:2004 − 2003:2005
|
0.652
|
2003–2004
|
0.294
|
2003:2004–2004:2005
|
0.566
|
2003:2005 − 2004:2006
|
0.695
|
2004–2005
|
0.396
|
2004:2005–2005:2006
|
0.622
|
2004:2006 − 2005:2007
|
0.735
|
2005–2006
|
0.377
|
2005:2006–2006:2007
|
0.624
|
2005:2007 − 2006:2008
|
0.710
|
2006–2007
|
0.389
|
2006:2007–2007:2008
|
0.605
|
2006:2008 − 2007:2009
|
0.717
|
2007–2008
|
0.387
|
2007:2008–2008:2009
|
0.628
|
2007:2009 − 2008:2010
|
0.732
|
2008–2009
|
0.393
|
2008–2009–2009:2010
|
0.651
|
2008:2010 − 2009:2011
|
0.745
|
2009–2010
|
0.434
|
2009:2010–2010:2011
|
0.640
|
2009:2011 − 2010:2012
|
0.708
|
2010–2011
|
0.439
|
2010:2011–2011:2012
|
0.624
|
2010:2012 − 2011:2013
|
0.770
|
2011–2012
|
0.397
|
2011:2012–2012:2013
|
0.684
|
2011:2013 − 2012:2014
|
0.761
|
2012–2013
|
0.485
|
2012:2013–2013:2014
|
0.692
|
2012:2014 − 2013:2015
|
0.757
|
2013–2014
|
0.486
|
2013:2014–2014:2015
|
0.682
|
2013:2015 − 2014:2016
|
0.766
|
2014–2015
|
0.480
|
2014:2015–2015:2016
|
0.689
|
2014:2016 − 2015:2017
|
0.775
|
2015–2016
|
0.491
|
2015:2016–2016:2017
|
0.716
|
2015:2017 − 2016:2018
|
0.798
|
2016–2017
|
0.503
|
2016:2017–2017:2018
|
0.736
|
2016:2018 − 2017:2019
|
0.796
|
2017–2018
|
0.541
|
2017:2018–2018:2019
|
0.737
|
2017:2019 − 2018:2020
|
0.774
|
2018–2019
|
0.544
|
2018:2019–2019:2020
|
0.675
|
|
|
2019–2020
|
0.349
|
|
|
|
|
Content Analysis
The combined frequencies of keywords belonging to each category of CAT1 and CAT 2 are computed. Each keyword belongs to only one category.
Genetics in CAT 1 has the highest frequency (n = 8,777, 54.1%) in CAT1, followed by. Medicine (n = 2,856, 17.6%), Proteomics (n = 2,257, 13.9%), General (n = 992, 6.1%), Biology (n = 707, 4.3%), and Statistics (n = 624, 3.81%).
Gene in CAT2 has the highest frequency (n = 3276, 20.2%), followed by Genetics terminology (n = 3019, 18.6%), Methods (n = 1725, 10.6%), Database/Software (n = 1393, 8.59%), Disease (n = 1204, 7.42%), Clinical (n = 1103, 6.8%), Proteomics (n = 1034, 6.37%), Pathogen (n = 1006, 6.2%), Statistics (n = 720, 4.44%), Biologicals (n = 707, 4.36%), Company/Consortium (n = 536, 3.3%), and Organism (n = 490, 3.02%).
We examined the trend of each term from phase 0 to phase 4 in CAT2 as follows:
In [Supplementary file, Figure S2], “Escherichia” showed the highest frequency in phase 2, and “Mycobacterium” in phase 4. In Statistics, “Bayesian” and “algorithm” were of the highest frequency in phase 2, while the frequency of the latter steadily decreased until phase 4. The frequency of “Bayesian” increased from phase 3 to 4.
In the Company/Consortium graph, “Illumina,” “Taqman” were of the highest frequency at phase 4, and “Illumina” and “ACMG” showed an increasing trend during the whole period. In Database, the term “bioinformatics” showed the highest frequency at phase 4. In Gene, the terms “gene”, “genome”, “allele”, “codon”, “cDNA”, “chromosome”, “DNA”, and “mtDNA” exhibited the highest frequencies at phase 2 and started to decrease in frequency from phase 3 to phase 4.
Terms denoting relatively smaller gene fragments, such as “RNA”, “miRNA”, “rRNA”, “exome”, “tRNA”, showed an increasing trend from phase 3 to 4. In Software, terms referring to gene quantification software, “NormFinder”, “geNorm”, and “BestKeeper”, were highest in frequency at phase 3 and “ClinGen” showed an increasing trend from phase 3 to 4. In Methods, “WGS”, “GWAS”, and “MiSeq” exhibited an increasing trend from phase 2 and peaked in frequency at phase 4.
On the other hand, “microarray,” “genomic hybridization,” and “gene microarray” showed the highest frequency in phase 2, and “qPCR” peaked in frequency in phase 3. In Clinical, “Clinician”, “therapy”, “diagnosis”, “precision”, “targeted therapy”, and “biopsy” all showed an increasing trend until phase 4; and in Disease, the term “disease” and oncology-related terms, such as “tumor”, “NSCLC”, “AML”, “GBM”, “tumor DNA”, and “adenocarcinoma” showed an increasing trend throughout the phases.
Statistical Analysis
Linear Regression without Phase
To evaluate linear trends, linear regression was conducted with keyword frequencies for publication years from 1975 to 2020. Although 2020 showed a decreasing trend in CAT1 and CAT2, all the categories in CAT1 and CAT 2 showed high regression values (from 0.586 (Company/Consortium) to 0.764 (Biology)) as shown in Table 3, Fig. 3. All the categories showed an increasing linear correlation between keyword frequencies and publication years.
Table 3
Linear regression based on keyword frequency in CAT1 and CAT2
|
Category
|
R2
|
CAT1
|
Biology
|
0.764
|
General
|
0.587
|
Genetics
|
0.717
|
Medicine
|
0.653
|
Proteomics
|
0.673
|
Statistics
|
0.666
|
CAT2
|
Clinical
|
0.657
|
Company/Consortium
|
0.586
|
Database/software
|
0.684
|
Disease
|
0.625
|
Gene
|
0.664
|
Genetics terminology
|
0.740
|
Metabolite/Biologicals
|
0.652
|
Methods
|
0.736
|
Organism
|
0.741
|
Pathogen
|
0.737
|
Proteomics
|
0.678
|
Statistics
|
0.648
|
Generalized Linear Model within Phase
The linear regression analysis without phase demonstrated a high correlation (R2 > .586). To analyze phase-based linear analysis for each category, we performed General Linear Model (GLM) evaluation based on phases (Fig. 4, Table 4). There was no linear correlation found in CAT1 categories (Table S1) while correlations were observed in several CAT2 categories (Table 4): Gene (P = .003) and Pathogen (P = .030) were statistically significant in phase 0, and Gene (P = .004) and Proteomics (P = .044) were statistically significant in phase 1. In phase 2, only Proteomics (P = .001) was significant. In phase 3, Proteomics (P = .045) and Software (P = .004) were significant, and in phase 4, only Genetics terminology was significantly fitted with the linear model (P = .039).
Table 4
Generalized linear model results of CAT2 from phase 0 to phase 4.
Phase
|
Category
|
B
|
SE
|
t
|
Sig.
|
95% Confidence Interval
|
Lower
|
Upper
|
Phase 0
|
Biologicals
|
0.647
|
0.911
|
0.710
|
0.478
|
-1.146
|
2.439
|
Clinical
|
0.004
|
0.875
|
0.005
|
0.996
|
-1.717
|
1.726
|
Company/Institute
|
0.022
|
1.078
|
0.020
|
0.984
|
-2.100
|
2.143
|
Data related
|
0.089
|
0.991
|
0.090
|
0.928
|
-1.861
|
2.040
|
Disease
|
0.739
|
0.868
|
0.851
|
0.395
|
-0.969
|
2.446
|
Gene
|
2.347
|
0.781
|
3.007
|
0.003**
|
0.812
|
3.883
|
Genetics terminology
|
0.299
|
0.783
|
0.382
|
0.703
|
-1.242
|
1.841
|
Methods
|
0.504
|
0.763
|
0.661
|
0.509
|
-0.996
|
2.005
|
Organism
|
2.338
|
1.226
|
1.907
|
0.057
|
-0.074
|
4.749
|
Pathogen
|
2.036
|
0.933
|
2.182
|
0.030*
|
0.200
|
3.873
|
Proteomics
|
1.160
|
1.180
|
0.983
|
0.326
|
-1.161
|
3.481
|
Software
|
-0.315
|
1.281
|
-0.246
|
0.806
|
-2.835
|
2.205
|
Phase 1
|
Biologicals
|
0.146
|
1.305
|
0.112
|
0.911
|
-2.422
|
2.714
|
Clinical
|
0.876
|
1.254
|
0.698
|
0.485
|
-1.591
|
3.342
|
Company/Institute
|
0.935
|
1.544
|
0.606
|
0.545
|
-2.103
|
3.974
|
Data related
|
0.849
|
1.420
|
0.598
|
0.550
|
-1.944
|
3.643
|
Disease
|
1.320
|
1.243
|
1.062
|
0.289
|
-1.125
|
3.765
|
Gene
|
3.256
|
1.118
|
2.912
|
0.004**
|
1.056
|
5.456
|
Genetics terminology
|
0.798
|
1.122
|
0.711
|
0.477
|
-1.410
|
3.006
|
Methods
|
0.487
|
1.093
|
0.445
|
0.656
|
-1.663
|
2.636
|
Organism
|
2.653
|
1.756
|
1.511
|
0.132
|
-0.801
|
6.108
|
Pathogen
|
2.320
|
1.337
|
1.735
|
0.084
|
-0.311
|
4.951
|
Proteomics
|
3.420
|
1.690
|
2.024
|
0.044*
|
0.095
|
6.745
|
Software
|
-0.555
|
1.835
|
-0.303
|
0.762
|
-4.165
|
3.055
|
Phase 2
|
Biologicals
|
-0.318
|
8.863
|
-0.036
|
0.971
|
-17.757
|
17.120
|
Clinical
|
-1.951
|
8.514
|
-0.229
|
0.819
|
-18.704
|
14.801
|
Company/Institute
|
3.622
|
10.490
|
0.345
|
0.730
|
-17.017
|
24.260
|
Data related
|
3.631
|
9.644
|
0.376
|
0.707
|
-15.343
|
22.605
|
Disease
|
1.089
|
8.441
|
0.129
|
0.897
|
-15.519
|
17.697
|
Gene
|
13.437
|
7.594
|
1.769
|
0.078
|
-1.504
|
28.377
|
Genetics terminology
|
8.486
|
7.622
|
1.113
|
0.266
|
-6.511
|
23.483
|
Methods
|
1.530
|
7.421
|
0.206
|
0.837
|
-13.070
|
16.131
|
Organism
|
8.160
|
11.925
|
0.684
|
0.494
|
-15.303
|
31.623
|
Pathogen
|
3.541
|
9.080
|
0.390
|
0.697
|
-14.325
|
21.407
|
Proteomics
|
38.460
|
11.478
|
3.351
|
0.001**
|
15.877
|
61.043
|
Software
|
11.910
|
12.461
|
0.956
|
0.340
|
-12.607
|
36.427
|
Phase 3
|
Biologicals
|
0.421
|
7.619
|
0.055
|
0.956
|
-14.570
|
15.412
|
Clinical
|
1.493
|
7.319
|
0.204
|
0.838
|
-12.908
|
15.894
|
Company/Institute
|
3.468
|
9.017
|
0.385
|
0.701
|
-14.274
|
21.209
|
Data related
|
1.631
|
8.290
|
0.197
|
0.844
|
-14.680
|
17.941
|
Disease
|
3.874
|
7.256
|
0.534
|
0.594
|
-10.403
|
18.151
|
Gene
|
11.117
|
6.528
|
1.703
|
0.090
|
-1.726
|
23.961
|
Genetics terminology
|
11.921
|
6.552
|
1.819
|
0.070
|
-0.971
|
24.813
|
Methods
|
-0.081
|
6.379
|
-0.013
|
0.990
|
-12.632
|
12.471
|
Organism
|
4.938
|
10.251
|
0.482
|
0.630
|
-15.232
|
25.107
|
Pathogen
|
4.255
|
7.806
|
0.545
|
0.586
|
-11.103
|
19.614
|
Proteomics
|
19.860
|
9.867
|
2.013
|
0.045*
|
0.446
|
39.274
|
Software
|
30.785
|
10.712
|
2.874
|
0.004**
|
9.709
|
51.861
|
Phase 4
|
Biologicals
|
2.348
|
8.725
|
0.269
|
0.788
|
-14.818
|
19.514
|
Clinical
|
9.000
|
8.381
|
1.074
|
0.284
|
-7.490
|
25.490
|
Company/Institute
|
4.385
|
10.326
|
0.425
|
0.671
|
-15.931
|
24.700
|
Data related
|
6.294
|
9.493
|
0.663
|
0.508
|
-12.383
|
24.971
|
Disease
|
7.179
|
8.309
|
0.864
|
0.388
|
-9.170
|
23.527
|
Gene
|
10.745
|
7.475
|
1.437
|
0.152
|
-3.963
|
25.452
|
Genetics term
|
15.587
|
7.503
|
2.077
|
0.039*
|
0.824
|
30.350
|
Methods
|
1.537
|
7.305
|
0.210
|
0.833
|
-12.835
|
15.909
|
Organism
|
5.778
|
11.738
|
0.492
|
0.623
|
-17.318
|
28.873
|
Pathogen
|
6.952
|
8.938
|
0.778
|
0.437
|
-10.634
|
24.539
|
Proteomics
|
11.700
|
11.299
|
1.036
|
0.301
|
-10.530
|
33.930
|
Software
|
15.750
|
12.256
|
1.284
|
0.200
|
-8.384
|
39.884
|