Network Analysis for Estimating Standardization Trends in Genomics

doi:10.21203/rs.3.rs-657429/v1

Download PDF

Research Article

Network Analysis for Estimating Standardization Trends in Genomics

https://doi.org/10.21203/rs.3.rs-657429/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

With the development of biotechnology in genomics, such as droplet digital PCR, sequencing device, gene analysis software, an increase in the clinical application of those new developed technologies in genomics is observed. However, the lack of established international standards regarding the use of genomics in clinics is of concern. To visualize trend of international standards in clinical genomics and explore high demanded sub-specific field, we performed a social network analysis. We searched 16,538 articles using the search keywords, “genomics and standard” and “clinical genomic sequence and standard”. All terms extracted from full text articles were classified into academic and technical categories and conducted general linear model analysis. Assuming from the results, research in the fields of software, proteomics and genetics terminology categories is likely to increase. From our results, such international critical issue of genomics, human genome project in 2003, primary US-FDA approval of sequencing device in 2013, and pandemic state of COVID-19 in 2020, were affected to research of standard in genomics. In further proposal and standardization of new items, summing critical social issues and research trends, we could suggest and considering promising sub-specific field of genomics.

Scientific Communication

Biophysics

Trend analysis

Standards

Genetic informatics

Bioinformatics

Keyword analysis

Network analysis

Genomic sequence analysis

In the early development of medical genetics in the first half of the 20th century, Mendelian inheritance disorders, such as albinism, were studied [1, 2]. Since the mid-twentieth century, human genetics has been developed to scientifically combat eugenics and has been extensively applied to human and medical genetics [3, 4], Remarkable development in medical genetics, especially in the field of cancer genetics, has occurred since 1980 [5]. In the 21st century, the use of genomic sequencing for diagnostics has rapidly grown after next-generation sequencing (NGS) was approved by the US FDA in 2013 [6]. However, there is a few international standards for the clinical applications of NGS have been established; only guidelines exist [7, 8]. According to Mason et al., after the rapid development of sequencing technologies, standards for DNA and RNA preparation have been well established; however, there are no established international standards or controls for single-cell methods which can applicable to clinics [9]. These authors reported that phased genomic data may be important for a specific type of cancer; however, the data cannot provide sufficient information to understand the disease and its related pathology. Due to the rapid development of genetic analysis technology and the increase in use due to the commercialization of advanced technologies, genomics and standard keyword network studies have been promoted to understand the demand for research items and trends detailing the emergence of international standardization in genomics.

Although network analysis was originally designed for use in physics, it has since been used in various academic fields for scientific and structural insights, including system biology, genomics [10], public health, medicine [11–15], and music [16]. As for genomics, a network displays an interaction between genes or genes and diseases. In keyword analysis, the network displays the closeness between keywords with edge and node connectivity. This network analysis study may identify a recent trend in the fluctuation of standards in genomics, thereby providing promising standards, especially in medical genetics. Most research trend studies use PubMed, which is the largest public electronic library and database that provides scientific information; therefore, we used MEDLINE data through PubMed to identify research keywords related to “genomics and standards.” To explore promising and high demanded specific fields and terms in standards of genomics, and to estimate the fluctuation of further research trends, we conducted this network study with general linear model analysis. Through this study, we highlighted the terms and subfields of increasing trends in genomics standards and evaluated a growing trend using a linear model.

Keyword characteristics

A total of 15,855 articles from 1975 to 2020 were used to extract keywords, and 5,639 keywords were primarily screened, among which 1,027 keywords were analyzed. The analyzed keywords were selected according to the steps shown in Supplementary Fig. S1. Finally, 330 keywords (n = 16,213) that were preprocessed over 12 frequencies were thoroughly analyzed. The detailed procedure of keyword selection is described in the Methods section and Fig. S1. The keywords were classified into six academic and twelve technical categories. The keywords are described in the SI. Among the six academic categories, the Genetics category had the most keywords, with 8,522 keywords (52.1%), followed by the Medicine category with 3,344 keywords (20.4%). The third category was the Proteomics category (n = 1,510, 9.24%), followed by the General (n = 1,313, 8.03%), Biology (n = 1,025, 6.27%), and Statistics (n = 624, 3.81%) categories. The 12 technical categories were the Gene (n = 3276, 20.2%), Genetics terminology (n = 3019, 18.6%), Methods (n = 1725, 10.6%), Database/Software (n = 1393, 8.59%), Disease (n = 1204, 7.42%), Clinical (n = 1103, 6.8%), Proteomics (n = 1034, 6.37%), Pathogen (n = 1006, 6.2%), Statistics (n = 720, 4.44%), Metabolite/Biologicals (n = 707, 4.36%), Company/Consortium (n = 536, 3.3%), and Organism (n = 474, 3.02%) categories.

Network full phase

The network is displayed in Fig. 1 with keywords derived from studies published from 1975 to 2020, using eight colors following a modularity of 0 to 7. According to the modularity value, full-phase keywords were clustered in different colors (Fig. 1, Table S2).

In the middle of the network, the group with modularity 0 was clustered in the order of the PageRank top ten as “genome” (PR 0.0167), “SNP” (PR 0.0148), “disease” (0.0128), “allele” (0.0119), “clinician”, “genomics”, “Illumina”, “Bayesian”, “genetics”, and “bioinformatics”. For modularity value 1, terms of the (pink group) gene and gene analysis terms were clustered. “gene” term ranked top (PR 0.0244) in the order of PageRank. In the Gene category, “mRNA” ranked highest (PR 0.0067). “cDNA” and “miRNA” were listed in the top ten. Terms related gene analysis techniques included (0.001 < PR < 0.01, Table S2) “qPCR”, “microarray”, “geNorm”, “gene normalization”, and “NormFinder” in the order of the PageRank. The terms of medicine and oncology were clustered in the dark brown group with a modularity of 2. The top-ranked terms were “ng” (0.0159), “tumor” (PR 0.0105), under 0.01 PR, “therapy”, “EGFR”, “IHC”, “KRAS”, “NSCLC”, “targeted therapy”, “amplicon”, and “tumor DNA”. For modularity value 3, pathogen and analysis-related terms were clustered in the sky-blue group, in the order of highest PR (0.001 < PR < 0.01, Table S2), namely, “WGS”, “Escherichia”, “bacteria”, “pathogen”, “Mycobacterium”, “MLST”, “NCBI”, “MiSeq”, “Pseudomonas”, and “Streptococcus”. For modularity value 4, DNA methylation-related medical terms were clustered in the reddish brown group (0.001 < PR < 0.0055). Keywords included “CpG”, “WHO”, “DNA methylation”, “methylation”, “MGMT”, “inhibitor”, “AML”, “ROC”, “TMZ”, and “IDH”. For modularity value 5, which is represented by the green cluster in the upper left side of Fig. 1, the phylogenetic terms were “rRNA”, “nucleotide”, “GenBank”, “codon”, “genotyping”, “mitochondrial genome”, “mtDNA”, “phylogenetic”, “tRNA”, and “RNA”. For modularity value 6, proteomics terms were clustered in the light purple group, with the highest PR keyword being protein (0.0152). Other keywords included “biomarker”, “proteomics”, “algorithm”, “peptide”, “database”, “knowledge”, “reproducibility”, “FDR”, and “measurement” (PR < 0.01, Table S2). The keywords with modularity value 7 were located between the center (group of modularity 0) and the right of Fig. 1, which is oncology, and the keywords included “diagnosis”, “CNV”, “genomic hybridization”, “genomic DNA”, “STR”, “chromosome”, “BAC”, “aCGH”, “MLPA”, and “haplotype” (PR < 0.01).

Similarity by phase

In order to observe when the research trend changed, a similarity analysis was performed between years. Through similarity analysis, we set five phases of different similarity patterns for the keyword research based on the inflection point as similarity results between 1-year (Fig. 2a., e.g., 2000–2001), between 1-year interval similarities (Fig. 2b., e.g., 2000–2001; 2001–2002), and between 2-years interval similarities (Fig. 2c., e.g., 2000–2002; 2001–2003). From 1980 to 2000, named phase 0, the frequency was extremely low; hence, the similarity analysis was performed from 2000. For phase 0, the year frequency was less than 10 from 1975 to 1989. Further, the year frequency from 1990 to 1999 was between 10 frequencies to a maximum of 72 frequencies. The 1-year similarity results showed a steep slope at the similarity between 2003–2004 and–2004–2005, with an increase of approximately 0.1, from 0.294 to 0.396 (Fig. 2a, Table S3). The graph indicates a fluctuating trend of decreasing (2011–2012, 0.396) and increasing (2012–2013, 0.484) similarities. The similarity graph shows a smooth curve up to 0.541 in 2017–2018, and then steeply decreases from 0.544 in 2018–2019 to 0.349 in 2019–2020. The similarity graph of the 1-year interval similarities (Fig. 2b) shows the lowest trough (0.518) at 2002-2003-2003-2004 similarity, increasing up to 0.736 at 2016:2017–2017:2018, and decreasing to 0.675 at 2018:2019–2019:2020. In Fig. 2c, the similarity between 2-year interval similarities also showed the lowest point at 2001:2003 − 2002:2004, and then increased to 0.770 at 2010:2012 − 2011:2013, and the next smooth curve shows a slight decrease of 0.774 at the end of the graph 2017:2019 − 2018:2020. In 2-year interval similarity results, the lowest point was 2001:2003 − 2002:2004, and the graph shows an increasing trend until 2010:2012 − 2011:2013. For 2015:2017 − 2016:2018, the similarity value decreased to 0.774. In the three graphs, starting with the similarity value including 2003:2004, the similarity graph of the decreasing trend increased. Based on the aforementioned results, we set four phases: phase 1 (2000–2003), phase 2 (2004–2012), phase 3 (2013–2016), and phase 4 (2017–2020).

Keyword frequency by period

From 1975 to 2019, an annual increase was observed in most of the analyzed keywords, even though Fig. 1 is not a cumulative graph. However, in 2020, due to the COVID-19 pandemic, all keyword-related research has decreased to over half of the keyword frequencies in 2019 (Fig. 3., upper). Despite the decreased frequency in 2020, linear regression square values of all categories had a minimum value of 0.587 in the General category, and a maximum 0.764 in Biology in the middle class. The Genetics category had the second-highest regression square value (R² = 0.717). In the middle class, the minimum and maximum values were 0.586 and 0.741, respectively, in the Company/Consortium category (Table 1). The phase-frequency results show a continuously increasing trend in Medicine and General in the large class, and Genetics term, Clinical and Disease in the middle-class categories. Among the large classes, the highest frequency category is Genetics, the second Medicine, and the third Proteomics. Among the subcategory results, the Gene category (R² = 0.664) showed the highest value in phase 2. However, from phase 2 to phase 4, Genetics term exhibited an increasing trend and the highest keyword frequency in phases 3 and 4. Organism (R² = 0.741), Genetics term (R² = 0.740), and Pathogen (R² = 0.737) categories were the highest fitting linear models throughout the total phase. In Fig. 3c, Genetics (n = 2776) and Proteomics (n = 656) recorded the highest frequency in phase 2 in the large class; Gene (n = 1094), Methods (n = 599), and Proteomics (n = 483) recorded the highest frequency in phase 2 (Fig. 3d) in the subcategory and Database/Software category recorded the highest frequency in phase 3.

Table 1

Linear regression of keyword frequency in each category in the whole phase
	Category	R²
Large class	Biology	0.764
	General	0.587
	Genetics	0.717
	Medicine	0.653
	Proteomics	0.673
	Statistics	0.666
Middle class	Clinical	0.657
	Company/Consortium	0.586
	Database/software	0.684
	Disease	0.625
	Gene	0.664
	Genetics term	0.740
	Metabolite/Biologicals	0.652
	Methods	0.736
	Organism	0.741
	Pathogen	0.737
	Proteomics	0.678
	Statistics	0.648

Phase correlation

To estimate the correlation between the four phases, we analyzed the phase correlation using keyword frequency. According to the results (Fig. 4, Table S4), the General category did not show any correlation between phases, the Proteomics category showed a significant correlation only between phases 3 and 4 (R² = 0.898), and the Statistics category (pink filled circle in Fig. 4) shows a significant correlation between phases 2 and 3 (R ² =1.000). Biology, Genetics, and Medicine showed strong linear correlations throughout all the phases (Table S4). In these three categories, phase 0 showed a correlation with phases 1 and 2, and no correlation was found with phases 3 and 4 in any category.

General linear model

As the linear regression results in the whole phase were significant in all categories (R² > 0.586), we analyzed the general linear model within a phase. The frequency linear models for each category and each phase are shown in Fig. 5. There was no linear correlation among academic categories (Table S5), while linear correlations were observed in several technical categories (Table 2, Table 3); the Gene (p = 0.003^**) and Pathogen (p = 0.030^*) categories were statistically significant in phase 0 (Table 2) and Gene (p = 0.004^**) and Proteomics (p = 0.044^*) were statistically significant in phase 1. In phase 2, only the Proteomics (p = 0.001^**) category was significant in the general linear model. In phase 3 (Table 3), Proteomics (p = 0.045^*) and Software (p = 0.004^**) were significant, and in phase 4, only Genetics term was significant (p = 0.039^*).

Table 2

General linear model results of the technical categories from phase 0 to phase 2. ^*p < 0.05, ^**p < 0.01
Phase	Category	B	SE	t	Sig.	95% Confidence Interval
Phase	Category	B	SE	t	Sig.	Lower	Upper
Phase 0	Biologicals	0.647	0.911	0.710	0.478	-1.146	2.439
	Clinical	0.004	0.875	0.005	0.996	-1.717	1.726
	Company/Institute	0.022	1.078	0.020	0.984	-2.100	2.143
	Data related	0.089	0.991	0.090	0.928	-1.861	2.040
	Disease	0.739	0.868	0.851	0.395	-0.969	2.446
	Gene	2.347	0.781	3.007	0.003^**	0.812	3.883
	Genetics term	0.299	0.783	0.382	0.703	-1.242	1.841
	Methods	0.504	0.763	0.661	0.509	-0.996	2.005
	Organism	2.338	1.226	1.907	0.057	-0.074	4.749
	Pathogen	2.036	0.933	2.182	0.030^*	0.200	3.873
	Proteomics	1.160	1.180	0.983	0.326	-1.161	3.481
	Software	-0.315	1.281	-0.246	0.806	-2.835	2.205
Phase 1	Biologicals	0.146	1.305	0.112	0.911	-2.422	2.714
	Clinical	0.876	1.254	0.698	0.485	-1.591	3.342
	Company/Institute	0.935	1.544	0.606	0.545	-2.103	3.974
	Data related	0.849	1.420	0.598	0.550	-1.944	3.643
	Disease	1.320	1.243	1.062	0.289	-1.125	3.765
	Gene	3.256	1.118	2.912	0.004^**	1.056	5.456
	Genetics term	0.798	1.122	0.711	0.477	-1.410	3.006
	Methods	0.487	1.093	0.445	0.656	-1.663	2.636
	Organism	2.653	1.756	1.511	0.132	-0.801	6.108
	Pathogen	2.320	1.337	1.735	0.084	-0.311	4.951
	Proteomics	3.420	1.690	2.024	0.044^*	0.095	6.745
	Software	-0.555	1.835	-0.303	0.762	-4.165	3.055
Phase 2	Biologicals	-0.318	8.863	-0.036	0.971	-17.757	17.120
	Clinical	-1.951	8.514	-0.229	0.819	-18.704	14.801
	Company/Institute	3.622	10.490	0.345	0.730	-17.017	24.260
	Data related	3.631	9.644	0.376	0.707	-15.343	22.605
	Disease	1.089	8.441	0.129	0.897	-15.519	17.697
	Gene	13.437	7.594	1.769	0.078	-1.504	28.377
	Genetics term	8.486	7.622	1.113	0.266	-6.511	23.483
	Methods	1.530	7.421	0.206	0.837	-13.070	16.131
	Organism	8.160	11.925	0.684	0.494	-15.303	31.623
	Pathogen	3.541	9.080	0.390	0.697	-14.325	21.407
	Proteomics	38.460	11.478	3.351	0.001^**	15.877	61.043
	Software	11.910	12.461	0.956	0.340	-12.607	36.427

Table 3

General linear model results of the technical categories from phase 3 to phase 4. ^*p < 0.05, ^**p < 0.01
Phase	Category	B	SE	t	Sig.	95% Confidence Interval
Phase	Category	B	SE	t	Sig.	Lower	Upper
Phase 3	Biologicals	0.421	7.619	0.055	0.956	-14.570	15.412
	Clinical	1.493	7.319	0.204	0.838	-12.908	15.894
	Company/Institute	3.468	9.017	0.385	0.701	-14.274	21.209
	Data related	1.631	8.290	0.197	0.844	-14.680	17.941
	Disease	3.874	7.256	0.534	0.594	-10.403	18.151
	Gene	11.117	6.528	1.703	0.090	-1.726	23.961
	Genetics term	11.921	6.552	1.819	0.070	-0.971	24.813
	Methods	-0.081	6.379	-0.013	0.990	-12.632	12.471
	Organism	4.938	10.251	0.482	0.630	-15.232	25.107
	Pathogen	4.255	7.806	0.545	0.586	-11.103	19.614
	Proteomics	19.860	9.867	2.013	0.045^*	0.446	39.274
	Software	30.785	10.712	2.874	0.004^**	9.709	51.861
Phase 4	Biologicals	2.348	8.725	0.269	0.788	-14.818	19.514
	Clinical	9.000	8.381	1.074	0.284	-7.490	25.490
	Company/Institute	4.385	10.326	0.425	0.671	-15.931	24.700
	Data related	6.294	9.493	0.663	0.508	-12.383	24.971
	Disease	7.179	8.309	0.864	0.388	-9.170	23.527
	Gene	10.745	7.475	1.437	0.152	-3.963	25.452
	Genetics term	15.587	7.503	2.077	0.039^*	0.824	30.350
	Methods	1.537	7.305	0.210	0.833	-12.835	15.909
	Organism	5.778	11.738	0.492	0.623	-17.318	28.873
	Pathogen	6.952	8.938	0.778	0.437	-10.634	24.539
	Proteomics	11.700	11.299	1.036	0.301	-10.530	33.930
	Software	15.750	12.256	1.284	0.200	-8.384	39.884

The Gene category showed a good fit in the general linear model in phases 0 and 1. The Proteomics category fitted significantly in the linear model in phases 1 and 2, and the Genetics term category significantly fitted only with a linear model in the latest phase 4.

Frequency analysis in a keyword

Detailed frequency analysis results for each keyword are presented in Fig. S2, S3, and S4. Among the biology keywords, “Escherichia” showed the highest frequency, especially in phase 2. “Mycobacterium” including Mycobacterium tuberculosis and M. tuberculosis complex recorded the second-highest frequency in phase 4 (n = 41), which was close to “Escherichia” (n = 44). As Escherichia coli and M. tuberculosis cause several infections, they are the most popular species in medical research. Further, “Escherichia” was the most popular research subject as a bacteria model in functional genomics. Arabidopsis thaliana has been widely studied as a model plant in genomics and has recorded the highest frequency in phase 2. Keywords in the Organism/Pathogen graph in Fig. S2 including “Escherichia” had the highest frequency in phase 2 and showed a decreasing pattern from phase 2 to phase 4. Other keywords of the Organism/Pathogen in high frequency were “bacteria,” “animal,” “microorganism,” “HeLa,” “taxon” including “Candida,” “Streptococcus”.

In the Statistics category, “Bayesian” and “algorithm” had the highest frequency in phase 2. However, the frequency of “algorithm” continuously decreased up to phase 4, and that of “Bayesian” increased from phase 3 to phase 4. Among general terms, “nanogram” had the highest frequency, and “database” had the second-highest frequency, showing a drastic decrease in phase 3. Further, “dataset” showed a linear increasing trend over “database” in phase 4. “Measurement,” “knowledge” and “software” displayed high frequency in phase 2, followed by a decrease in phase 4. In the lower level of general terms, “NCBI,” “workflow,” “susceptibility,” and “pubmed” presented a continuously increasing trend throughout the phases. In the Proteomics graph, “protein,” “proteomics,” “peptide,” “proteome” had the highest frequency in phase 2, in order of frequency (Fig. S2, Terminology-A).

In the Company/Consortium graph in the Genetics category (Fig. S3), keywords “Illumina,” “Taqman” were the highest, and “Illumina” and “ACMG” in particular showed an increasing trend during a whole period. In the “Database” graph, “bioinformatics” shows the highest frequency and a continuously increasing trend. In the “Genetics” category, “Gene” (A) graph showed that “gene” and “genome” keywords were the highest and they had the highest frequency in phase 2. In the “Genetics” term (A) graph, “SNP,” “genomics,” “gene normalization,” and “genetics” were the highest in order of frequency. In the “Genetics” term graph, “genomics,” “genetics,” “DNA methylation,” “methylation,” and “metagenomics” show a continuously increasing pattern. In the other graph, sequencing-related software “NormFinder,” “geNorm,” “BestKeeper” were highest in order of frequency, and they had the highest frequency in phase 3. “CNV” and “ClinGen” keywords showed a continuous increasing trend. In the “Methods” graph, “WGS,” “GWAS,” and “MiSeq” exhibited the highest frequency in phase 4; “microarray,” “genomic hybridization,” and “gene microarray” were the highest in phase 2; and “qPCR” was the highest in phase 3. In the “Medicine” category, “disease,” “tumor,” “clinician,” “therapy,” “diagnosis,” “biomarker,” and “EGFR” keywords were highest in order of frequency (Graphs Disease (A), Clinical (A), and Metabolites/Biologicals in Fig. S4). Many keywords show the maximum frequency in phase 2, in “Biology,” “Statistics,” “General,” “Proteomics,” “Genetics.” Most keywords in the “Medicine” category show an increasing trend throughout the five phases.

We have conducted this study to estimate a trend for standardization in genetics and clinical genetics, and according to the search keywords, “genomics and standard” and “clinical genomic sequence and standard”, most of the keyword outcomes were related to “Genetics” (related to “genomics” of “genomics and standard”) and “Medicine” (related to “clinical” of “clinical genomic sequence and standard’). The sum of the frequencies of the two categories was approximately 72.5%. Considering the network analysis results, oncology in medicine is a promising field of research in the field of genetics. As shown in Fig. 1, oncology accounts for a large portion. Specific terms in oncology, such as “biopsy”, “chemotherapy”, “metastasis”, “immunotherapy”, “tumor”, “NSCLC”, “tumor DNA”, “adenocarcinoma”, “metastatic”, “mCRC”, “KRAS”, and “IHC”, and oncology-related terms, such as “precision” and “targeted therapy” as well as “clinician”, “therapy”, “diagnosis”, and “disease” showed continuously increasing patterns throughout phases 0 to 4 (Fig. 1, Table S2, Fig. S4). Some studies have reported that the standard of clinical treatment for metastatic cancer, rare cancer, and sequencing methods, except for standard evaluation for those diseases, has not been established [17]. The lack of established international standards in clinical genetic diagnosis of oncology is expected to increase the demand for standardization in this field.

From 1975 to 2020, considering the similarity results, we assumed that the historical events that occurred in 2003, 2013, and 2017 are important to the genetics standard field, and we followed the historical events that affect the genetics of standards research. In April 2003, the Human Genome Project, the world’s largest collaborative biological project that started in 1990, was finally completed [18]. In Fig. S5, for keywords until 1999, standard terms in clinical genetics referred to several types of genomes, such as “haplotype,” “chromosome” “genomic DNA,” pathogens (HBV, bacteria, microorganism, Escherichia, Mycobacterium, Pseudomonas, Streptococcus) and methods (RFLP, genomic hybridization, PCR, electrophoresis, IHC, chromatography). Most of those keywords in phase 0 were clustered in a small size of two to three scattered words, which did not generate a large network. After completing Human Genome Project, the similarity of keyword frequency started to increase. In phase 2 started from 2004, by developing sequencing devices and genome analysis techniques throughout the duration of the Human Genome Project, the research trend shifted to related smaller sequence terms, such as miRNA and SNP, devices (Illumina, MiSeq), and methods (qPCR, microarray, WGS, gene normalization, geNorm, NormFinder). Based on the similarity results, we consider that research in genetics and clinical genetics standards has increased in 2013 after the Illumina's MiSeq device, which allowed numerous genome-based tests, was approved by the US-FDA [6]. In phase 4, as the cost of WGS decreased approximately to $1000 in 2015, keyword frequencies related to “WGS” increased considerably, although they decreased in 2020. The degree of similarity has decreased since 2017, and at the end of phase 4 in 2020, the similarity has sharply decreased due to the impact of the COVID-19 pandemic, which has spurred international research on COVID-19 and infectious viruses. This situation has led to a decline in other prospective research and clinical trials in the fields of biology and basic science. Because of the unexpected long-term pandemic situation, there was a steep decline in research similarity between 2019 and 2020, and the GLM model is not suitable for other categories of keywords in phase 4, except for the Genetics terms category. However, this rapid decline became a slight decreasing pattern in the 2-years interval similarity graph (Fig. 2c) and despite a sharp decline in research in 2020, only the Genetic term category was suitable for the general linear model in phase 4. The general linear model results show that the Proteomics and Software categories were evaluated as the best fit in the model in phase 3; Although in the network analysis results of phase 4 (Fig. S5), Proteomics and Software categories are not suitable in general linear model and both terms were highly ranked. The categories of Proteomics and Software are not fitted in a linear model in phase 4 due to the sharp decline trend in research due to the pandemic situation of COVID-19, however, given 2-years interval similarity results and the strong phase correlation of phases 3 and 4 results, research of these categories are estimated to increase significantly in a consecutive year of phase 4. In addition, critical international social issues as ongoing global COVID-19 vaccinations, could positively support our findings on future research trends.

In this study, we have approached the modern history of standardization trends in clinical genetics from 1975 to 2020. Until 2000, research on standards in genetics and/or clinical genetics, including genomics, shown considerably different patterns with phases 3 and 4. This study shows that gene analysis techniques and the genes to be analyzed are changing. For example, the tendency of object genes has shifted from “genomic DNA”, “haplotypes”, and “chromosomes” to relatively smaller gene fragments, such as “miRNA” and “SNP”. The terminology of gene analysis has evolved from “PCR”, “RFLP”, and “IHC” to “qPCR”, “MiSeq”, “microarray”, “WGS”, and this evolution has also been observed in gene analysis software such as “BestKeeper”, “geNorm”, “NormFinder”, and “proteomics”, such as “shotgun”, and “chromatography”. In addition, this study suggests specific fields of research, such as Genetic terms, that shows an increasing tendency even after the pandemic. According to our results, despite the COVID-19 pandemic, international standardization activities are expected to steadily increase in fields such as software, proteomics, and genetics terminology. The clinical significance of analysis software in bioinformatics and medicine is increasing in the current situation of international standards. In the era of the Fourth Industrial Revolution based on big data, nano-techniques and analysis techniques of software and methods and their markets have been growing faster, and the clinical significance of establishing a definition of these new technologies and standards of genetics terminology is increasing. In addition, through this study, international genetic issues, such as the completion of the human genome project in 2003, the approval of NGS by the US-FDA in 2013, and the outbreak of the COVID-19 pandemic in 2020, seem to have considerably influenced the standardization research in genomics. Through our results, we estimated and suggested future trends through comprehensive trend analysis and demonstrated subfields in clinical genetics that are in high demand for international standardization.

Keyword database

In PubMed, we searched 16,550 research articles published between 1975 and Sep. 2020 using “genomics and standard” and “clinical genomic sequence and standard” through the MEDLINE database. Among the searched articles, 10,000 contained “genomics and standard” and 6,550 contained "clinical genomic sequence and standard” terms. The number of articles containing each search keyword is indicated by the frequency. We extracted all the keywords from the full text of the articles, including reviews, research articles, perspectives except reports, and news articles. Extracted terms and publication year were used to generate a two-dimensional annual frequency matrix, which was displayed as 330 × 46.

Keyword extraction and selection

Primarily, 84,644 keywords were extracted from 15,855 articles. Using those keywords and the year of publication of the articles, we generated a two-dimensional matrix of the annual frequency matrix. We selected 5,639 keywords of deduplicated and combined keywords, as shown in Fig. S1. Excluded terms comprised of verbs, adjectives, adverbs, non-technical terms such as “scientist”, “concept”, “optimal”, “consensus” alongside common terms such as “April”. Compound nouns, such as “genomics proteomics”, “protein gene”, were removed. After the exclusion, 330 keywords with over 11 frequencies were analyzed. The entire procedure was performed and thoroughly reviewed by two authors. All keywords were extracted using the TextRank algorithm [19] using Corpus 16000 from full text articles.

Keyword classification

Each keyword represents a different research topic. Three hundred and thirty keywords were classified into their respective research areas (Fig. S1, Table S1). The keywords were sorted into six academic categories: Biology, General, Genetics, Medicine, Proteomics, and Statistics. Further, they were divided into 12 technical subcategories: Biologicals/Metabolics, Clinical, Company/Consortium, Database/ Software, Disease, Gene, Genetics term, Methods, Organism, Pathogen, Proteomics, and Statistics.

Network analysis

The overall network analysis was performed as previous studies [20–22]. In the network analysis of research articles, the frequency of keywords indicates the major research topic in a particular year. We obtained the similarity value between two keywords in the keyword list to evaluate the closeness between the keywords. For network analysis and temporal analysis, we calculated the similarity of the total frequency between the year of publication. Using the weighted Jaccard index, we obtained the similarity value between two keywords in the keyword list for network analysis and for the temporal analysis, and calculated the similarity of the total frequency between publication years. Using the weighted Jaccard index, we obtained the similarity value in the Pajek NET format. In the following equation, S and T represent each keyword and/or total frequency in the year, respectively.

Similar to commonly constructed networks, our network consists of nodes and edges. Nodes represent keywords, and edges represent closeness using the similarity value between two keywords. In our networks, node and edge colors are displayed by modularity, which is generated by a community detection algorithm. Node size was calculated using the PageRank (PR) algorithm. Generally, a social network reflects the relationship between components. In the present study, we visually provided a network model of the relationship between keyword related standards in genomics using Gephi 0.8.2.

Using the similarity between the 1-year interval similarities, we set the four phases from 1975 to 2020 as follows: phase 1 (2000–2003), phase 2 (2004–2012), phase 3 (2013–2016), and phase 4 (2017–2019). Through phase analysis, we identified the change point when the similarity graph was steeply curved. This will aid in the analysis of social events that affect research trends. Following each phase, we determined the research areas based on these keywords.

Statistical analysis

To statistically estimate research trends, we first performed correlation analysis via Pearson (Biology, General, Proteomics, and Statistics categories) and Spearman (Genetics and Medicine categories) correlations. Second, we generated a univariate generalized linear model for each academic and technical category. Further, we examined the effects between subjects and parametric estimates using SPSS Statistics ver.26, IBM.

Acknowledgements

We appreciate So-Young Shim for helping keyword preparation and we would like to thank Editage (www.editage.co.kr) for English language editing.

Author’s contributions

S-J.A. conceived the idea and funder of this study; E.B.B.& S.N designed methodology and prepared keyword data; S.N. conducted network analysis; E.B.B. conducted statistical analysis; E.B.B. interpreted all the results and led the writing of the manuscript.

Competing interests

The authors declare no competing interests.

Funding

This project was supported by the Korean Ministry of Trade, Industry, and Energy (No. 20011748, No. 20012610)

Additional information

Supplementary information is available for this paper at

Correspondence and requests for materials should be addressed to S-J.A.

Reprints and permissions information is available at www.nature.com/reprints.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

W.E., C. Mendel's Law of Heredity. Proceedings of the American Academy of Arts and Sciences 39, 223–242 (1903).
Wills, E. D. in Biochemical Basis of Medicine Ch. 41, 511–520(Elsevier, 1985).
Rose, N. Normality and Pathology in a Biomedical Age. Sociol Rev, 57, 66–83 (2009).
Koch, T. Eugenics and the Genetic Challenge, Again: All Dressed Up and Just Everywhere to Go. Camb Q Healthc Ethics, 20, 191–203 https://doi.org/10.1017/S0963180110000848 (2011).
Hodgson, S. Advances in cancer genetics. Clin Med, 9, 151–153 https://doi.org/10.7861/clinmedicine.9-2-151 (2009).
Collins, F. S. & Hamburg, M. A. First FDA Authorization for Next-Generation Sequencer. N Engl J Med, 369, 2369–2371 https://doi.org/10.1056/NEJMp1314561 (2013).
World Health Organization. (‎2018)‎. The use of next-generation sequencing technologies for the detection of mutations associated with drug resistance in Mycobacterium tuberculosis complex: technical guide. World Health Organization. https://apps.who.int/iris/handle/ 10665/274443. License: CC BY-NC-SA 3.0 IGO.
Matthijs, G. et al. Guidelines for diagnostic next-generation sequencing. Eur J Hum Genet, 24, 2–5 https://doi.org/10.1038/ejhg.2015.226 (2016).
Mason, C. E., Afshinnekoo, E., Tighe, S., Wu, S. & Levy, S. International Standards for Genomes, Transcriptomes, and Metagenomes. J Biomol Tech, 28, 8–18 https://doi.org/10.7171/jbt.17-2801-006 (2017).
Ernst, M. et al. FocusHeuristics - expression-data-driven network optimization and disease gene prediction. Sci Rep, 7, 42638 https://doi.org/10.1038/srep42638 (2017).
Groen, R. N., Wichers, M., Wigman, J. T. W. & Hartman, C. A. Specificity of psychopathology across levels of severity: a transdiagnostic network analysis. Sci Rep, 9, 18298 https://doi.org/10.1038/s41598-019-54801-y (2019).
Li, X., Liu, G., Chen, W., Bi, Z. & Liang, H. Network analysis of autistic disease comorbidities in Chinese children based on ICD-10 codes. BMC Med Inform Decis Mak, 20, 268 https://doi.org/10.1186/s12911-020-01282-z (2020).
Li, X. et al. Seven decades of chemotherapy clinical trials: a pan-cancer social network analysis. Sci Rep, 10, 17536 https://doi.org/10.1038/s41598-020-73466-6 (2020).
Li, X. et al. Identification of a histone family gene signature for predicting the prognosis of cervical cancer patients. Sci Rep, 7, 16495 https://doi.org/10.1038/s41598-017-16472-5 (2017).
Papachristou, N. et al. Network Analysis of the Multidimensional Symptom Experience of Oncology. Sci Rep, 9, 2258 https://doi.org/10.1038/s41598-018-36973-1 (2019).
Ortega, J. L. Cover versions as an impact indicator in popular music: A quantitative network analysis. PLoS One, 16, e0250212 https://doi.org/10.1371/journal.pone.0250212 (2021).
Colomer, R. et al. When should we order a next generation sequencing test in a patient with cancer? EClinicalMedicine 25, 100487, doi:10.1016/j.eclinm.2020.100487 (2020).
Riley, N. Out of date: genetics, history and the British novel of the 1990s. Med Humanit, https://doi.org/10.1136/medhum-2020-012022 (2021).
Mihalcea, R. T. Paul. in Empirical Methods in Natural Language Processing. 404–411 (Association for Computational Linguistics).
Ji, Y. A., Nam, S. J., Kim, H. G., Lee, J. & Lee, S. K. Research topics and trends in medical education by social network analysis. BMC Med Educ, 18, 222 https://doi.org/10.1186/s12909-018-1323-y (2018).
Son, Y. J., Lee, S-K., Nam, S. J. & Shim, J. L. Exploring Research Topics and Trends in Nursing-related Communication in Intensive Care Units Using Social Network Analysis. CIN, 36, 383–392 https://doi.org/10.1097/CIN.0000000000000444 (2018).
Kim, S. K., Oh, Y. & Nam, S. Research trends in Korean medicine based on temporal and network analysis. BMC Complement Altern Med, 19, 160 https://doi.org/10.1186/s12906-019-2562-0 (2019).

No competing interests reported.

210708SupplementaryInformationEBB.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Network Analysis for Estimating Standardization Trends in Genomics

Status:

Version 1

Abstract

Figures

Introduction

Results

Keyword characteristics

Network full phase

Similarity by phase

Keyword frequency by period

Phase correlation

General linear model

Frequency analysis in a keyword

Discussion

Conclusions

Methods

Keyword database

Keyword extraction and selection

Keyword classification

Network analysis

Statistical analysis

Declarations

Acknowledgements

Author’s contributions

Competing interests

Funding

Additional information

References

Additional Declarations

Supplementary Files

Status:

Version 1