Subject recruitment and sampling
In this study, a total of 43 patients with bilateral CRSwNP, 27 patients with NIP and 34 controls were recruited. All of these subjects underwent surgery, and tissue was retained during surgery for subsequent sequencing analysis. Patients with CRSwNP met the diagnostic criteria of the EPOS2020 guidelines.1 NIP was determined by postoperative histopathology. The turbinate mucosa tissues of patients with a deviated nasal septum and cranial base operation were included in the control group. The study was approved by the Ethics Committee of Tianjin First Central Hospital. All the subjects were informed in advance and signed informed consent forms.
Each patient underwent a physical examination by two rhinologists to confirm the diagnosis, and clinical information such as age, sex, smoking history, medication history, and surgical history was recorded.
The exclusion criteria were as follows: ① Patients with immune-related diseases, genetic disorders, pregnancy, clotting disorders, or cystic fibrosis. ② Patients with unilateral nasal polyps or infection or anatomic sinusitis. ③ Patients who received systemic antibiotics within 12 weeks prior to screening. ④ Patients who received systemic immunosuppressive therapy within 12 weeks prior to screening. ⑤ Patients who received systemic or local steroid hormones within 12 weeks prior to a screening history of relevant surgery.
All the patients underwent surgery. After anesthesia, nose hairs were trimmed, and the skin of the nose and nasal vestibule was disinfected to reduce contamination. Under the guidance of nasal endoscopy, polyp tissue, tumor tissue and normal turbinate mucosa were collected in sterile tubes, and the contaminants on the surface of the retained tissue were washed away with phosphate-buffered saline (PBS). The samples were stored in RNAlater (Sigma-Aldrich) in sterile empty tubes and immediately stored in a refrigerator for subsequent detection and analysis. In addition, simulated samples containing PBS alone were collected at each sampling to assess environmental contamination. Preliminary experiments were performed to assess and confirm the quality and uniformity of the collected samples, as well as the feasibility of the method, before starting a large sample collection. All the procedures were performed in the operating room under sterile conditions.
5R 16S rRNA gene sequencing and analyses
To better characterize the microbiota in nasal tissues, 5R 16S sequencing was used in this study.12 In brief, nasal tissue and negative controls were extracted via the Cetyl Trimethyl Ammonium Bromide method. All negative controls included sampling controls, DNA extraction controls and no-template PCR amplification controls. The modified 5R 16S rRNA gene was amplified. The modified 5R 16S rRNA gene is composed of five regions (the V2, V3, V5, V6, and V8 regions). The amplified products were then purified and quantified by standardized means. The purified products were then sequenced on the Illumina NovaSeq platform supported by Lc-Bio Technologies Co., Ltd. (Hangzhou, China). After the sample sequencing data were removed, the high-quality sequences that remained after the removal of low-quality sequencing results were used for subsequent analysis.
The reads were demultiplexed per sample, filtered and aligned to each of the five amplified regions based on the primer sequences. The SMURF (short multiple regions framework) method was applied to combine read counts from the five regions into coherent profiling results to solve the maximum likelihood problem.13 Then, the taxonomic identification and relative abundance calculation of bacteria were carried out. The database that was used in this project is optimized Greengenes (May 2013 version).
Subsequent microbiome diversity analysis and microbiome difference analysis were carried out on the basis of these data. The richness and uniformity of alpha diversity are mainly reflected by indices such as Chao1 and observed species. Beta diversity analysis usually begins by calculating the distance matrix between environmental samples, which includes the distance between any two samples. Beta diversity, together with alpha diversity, constitutes the overall diversity or the biological heterogeneity of a given environmental community. The Kruskal-Wallis test was used to compare multiple groups with biological duplicate samples.
5R 16S rRNA gene sequencing and analyses of negative controls
To reduce the impact of low-abundance noise on subsequent analysis, the sequence read number of each sample was normalized, and samples with total reads < 1000 (including negative controls) and bacterial data with a relative abundance < 10− 4 were removed. Then, according to the prevalence of bacteria in the negative control samples, we determined which bacteria were the contaminating bacteria at the sampling end and the experimental end. The principle and process of impurity removal were as follows. Five negative control samples were sequenced. The genus-level species that were present in more than 50% of the samples (more than 3 samples and more than 0.001% of the abundance) according to the sequencing results were identified as contaminating bacteria (or heterobacteria). Contaminating bacteria were removed from the nasal tissue sequencing results. The removed species were normalized again to obtain the relative abundance of real species in the sample.
Bioinformatics analysis of RNA-seq data
As previously mentioned14, the RNA was isolated and purified from the total sample using a standardized process. Then, the quantity and purity of the isolated RNA were controlled, and the integrity of the RNA was tested. The concentration was > 50 ng/µL, the RNA integrity number (RIN) was > 7.0, and the total RNA amount was > 1 µg. After two rounds of purification, mRNA with poly(A) (polyadenylate) was specifically captured. The captured mRNA was fragmented at high temperature with the NEBNextR RNA Fragmentation Module (NEB, cat. e6150, USA) at 94°C for 5–7 minutes. cDNA was synthesized from the fragmented RNA with Invitrogen SuperScriptTM II Reverse Transcriptase (Invitrogen, cat. 1896649, USA). E. coli DNA polymerase I (NEB, cat.m0209, USA) and RNase H (NEB, cat.m0297, USA) were then used for two-strand synthesis. To preserve the orientation information of the transcript during transcriptome sequencing, these complex double strands of DNA and RNA were converted into double-stranded DNA, and dUTP Solution (Thermo Fisher, cat. R0133, CA, USA) was added to the double strands to convert the ends of the double-stranded DNA into flat ends. Then, an A base is added to each end so that it could be connected to the terminal with T base joints, and the size of the fragment was screened and purified by magnetic beads. The double-stranded library was digested with UDG enzyme (NEB, cat. m0280, MA, US) and then formed by PCR with a fragment size of 300 bp ± 50 bp (chain-specific library). Finally, we used an Illumina NovaSeqTM 6000 (LC Bio Technology Co., Ltd.; Hangzhou, China) to perform double-end sequencing in PE150 mode according to standard procedures. The sequence information of the transcripts that were obtained by sequencing was only derived from the first strand.
After using Cutadapt to removed unqualified sequences (sequencing joints, low-quality sequences, etc.) from the original data to obtain valid data (clean data), reference genome alignment was performed using HISAT2. Based on the HISAT2 comparison results, Stringtie was used to reconstruct the transcripts and calculate the expression levels of all the genes in each sample. Gene expression level analysis mainly aimed to analyze protein-coding genes (mRNAs) that were annotated by the genome, and the expression levels of genes were statistically measured to evaluate the correlation of gene expression characteristics and differentially expressed genes within and between groups. When measuring gene expression levels, fragments per kilobase million (FPKM) values, which are standardized based on the original read counts of genes, were used as measures of gene expression levels, and gene expression levels in different samples were quantified. In this project, a fold change > = 2 (that is, the absolute value of log2FC > = 1) was considered the change threshold, and a q value < 0.05 (the q value is the correction value of the p value) was considered the criterion for identifying differentially expressed genes (|log2FC|>= 1&q < 0.05). The results of differential expression gene analysis, differential expression gene enrichment analysis and GSVA were obtained in the set comparison group.
Preprocessed microbiome and host transcriptome data
Based on the approach of Priya et al.,11 the following process was performed on microbiome data for each disease cohort separately. First, sequences of archaea, chloroplasts, known contaminants from laboratory reagents or kits, and environmental contaminants that were associated with soil or water were removed from the microbial abundance table. Next, microbial abundance tables were constructed at the phylum, class, order, family, genus, and species levels and filtered based on abundance and prevalence, retaining only taxa that had relative abundances above 0.001 in at least 20% of the samples. The microbiome data of each disease cohort were independently processed to obtain the taxon abundance matrix of each disease cohort, which included 134 taxa in CRSwNP group and 156 taxa in NIP group.
In host transcription data processing, genes with low expression were first removed to ensure that each gene was expressed in at least half of the samples in the disease cohort. We used the R package "DESeq2" (version 1.22.2) for variance stabilizing transformation of the filtered gene expression read count and to filter the 25% of genes that had the least variance. RNA-seq data from each disease cohort were independently subjected to these steps, producing a unique host gene expression matrix for the disease for downstream analysis, including 23,030 genes in CRSwNP group and 22,849 genes in NIP group.
Sparse CCA analysis
We used a machine learning framework that was developed by Priya et al. to integrate high-dimensional datasets of host gene expression and microbiome abundance.11 SparseCCA was used to identify host genes that are associated with microorganisms to characterize pathway-level associations. The analytical framework was applied to paired host gene expression data and microbiome data for each disease cohort and control group, respectively, to avoid potential batch effects. For each disease cohort dataset, we considered only associations that were observed in patients not in controls. Prior to the application of this analytical framework, host gene expression matrix and microbiome abundance matrix data were standardized and normalized to meet the distribution requirements of statistical models.
SparseCCA was used to determine group-level correlations between paired host gene expression and microbiome data in each disease cohort. sparseCCA performs feature selection via the L1 or lasso penalty while maximizing the correlation between the two datasets. Its objective function can be expressed as:
$$\:{maximize}_{u,v}{u}^{T}{X}^{T}Yv\:subject\:to\:{u}^{T}{X}^{T}Xu\le\:1,{v}^{T}{Y}^{T}Yv\le\:1,‖u‖{}_{1}\le\:{{\lambda\:}}_{1},,‖v‖{}_{1}\le\:{{\lambda\:}}_{1},$$
where X and Y represent two data matrices with the same number of samples but different numbers of features (representing microbiome taxa composition data and host gene expression data, respectively). u and v are the canonical load vectors of X and Y, respectively; λ1 and λ2 control the lasso penalty of U and V, respectively. τ represents the transpose of a matrix.
As with the original method, leave-one-out cross-validation was used for a grid-search approach to obtain the optimal hyperparameters. Using this method, λ1 and λ2 for group A were set to 0.2 and 0.15, respectively, and λ1 and λ2 for group B were set to 0.266 and 0.177, respectively. After the sparsity parameters were determined, the sparse CCA model was fitted to obtain a subset of the relevant host genes and microorganisms (called components), calculating only the top 10 components for each disease cohort. Additionally, the leave-one-out cross-validation approach was used to calculate the importance of each component. Cor.test was used to evaluate the true strength and significance of the association, and Benjamini‒Hochberg (FDR) was used to correct the P value of the multiple hypothesis test for each disease cohort. Only significant components with FDR < 0.1 were retained. Based on this method, three important components in group A were identified, with an average of 1739 host genes and 4.5 microorganisms. There were six significant components in group B, with an average of 2,786 host genes and 6.4 microbes. sparseCCA was applied separately for each disease cohort using the R language (version 4.2.0) package "PMA" (version 1.2.1). All the components were visualized using Cytoscape (version 3.10.0).
Inference of microbial-host interaction networks
In addition to identifying differential pathways, identifying important directional edges between pathways can also provide valuable insights when studying microbe-host interactions. The path interaction network (Bayesian Network analysis) was used with the "CBNplot" package to construct the gene interaction network15. Based on the host expression profile data, associated microorganisms were screened out in combination with sparseCCA, the interaction direction was calculated, the bnpathplot function was used to obtain the interaction direction and intensity, and Cytoscape was used to plot the network.
Fluorescence in situ hybridization (FISH)
FISH is performed using probes that target the 16S rRNA gene sequence for a specific bacterial taxon. According to previous methods16, the Geobacillus FISH probe was hybridized to tissue sections and labeled with FITC at the 5' and 3' ends (5’CCGAATCAAGGCAAGCCCCAATC-3’). This probe was designed and synthesized by Exonbio (Gungzhou, China). This probe targets Geobacillus stearothermophilus, and some Geobacillus sp. EUB330 proteins target a conserved domain of bacterial 16S rRNA. FISH images were captured with a Nikon 80i microscope. The images were analyzed and scored according to the fluorescence signal.
Western blotting
Tissues were homogenized in liquid nitrogen, and an appropriate amount of RIPA cell lysis was added. After centrifugation, the concentration of superalbumin was extracted and determined. The protein concentration was determined by the BCA method. The proteins were denatured by boiling after the original volume of buffer was added. After protein electrophoresis and membrane transfer, the membranes were blocked in BSA solution at room temperature, and the membranes were incubated with primary antibodies (rabbit anti-p65, ab32536; rabbit anti-p-p65, ab109458; rabbit anti-β-actin, ab8227) at 4°C overnight. The next day, after the membrane was washed with the washing liquid and incubated at room temperature with the secondary antibody for 1 hour. After the membrane was washed with the washing liquid, ECL was applied for color development. The original gels were showed in Supplymentary Fig. 1.