1. Normal Tissue Profiling
1.1) ddPCR of samples from TRACERx and PEACE studies
Tumour and normal lung tissue samples
This project leverages the infrastructure established by the national pan-cancer research autopsy programme (PEACE, NCT03004755) and the prospective, longitudinal cohort study (TRACERx) of non-small cell lung cancer (NCT01888601)1.
To explore whether clinical disparities in never smoker lung cancer were reflected in normal lung tissue EGFR mutation status, we sought to assemble a cohort comprising TRACERx patients that were as best as possible balanced for sex (males vs females), smoking status (never smoker vs ever smoker) and EGFR mutation status in tumour samples (EGFRm vs EGFRwt). To uncover if EGFR mutations were also found in normal lung tissue from patients who never acquire a lung cancer diagnosis in their lifetimes, we also assembled a cohort of PEACE patients.
Based on tissue that was available for study, our dataset consisted of 195 tumour and 195 normal lung tissues from 195 TRACERx patients, and 59 normal lung tissues from 19 PEACE patients (median 3 samples per patient (range 1 to 10)).
In TRACERx, tumour and normal lung tissue were obtained at surgery. Normal lung tissue was collected distally from the primary tumour tissue (at least approximately 2cm apart). All tissue was initially frozen and then a portion fixed and made into a FFPE block. A H&E section of each block was cut and stained and underwent pathology review. We use ‘normal’ to refer to non-malignant lung tissue. DNA was extracted from both the normal and tumor frozen tissue proximal to these sections. In PEACE, normal lung tissue was collected at post-mortem tissue harvest from patients who never acquire lung cancer in their lifetimes. Each piece of tissue collected was immediately bisected and one half snap frozen and the other fixed and then made into a FFPE block. H and E section of each block was cut and stained and underwent pathology review. DNA was then extracted from an adjacent normal frozen tissue sample.
All aforementioned H and E slides from tissues have undergone central pathology review. In particular, to exclude the possibility of contamination with tumour cells, thoracic pathologists have confirmed that all normal lung tissue samples do not contain any indication of tumour tissue or morphologically-defined pre-invasive disease. Thoracic pathologists also identified anthracotic pigment and reflected this in a binary score for its presence.
EGFR mutation profiling in normal samples (with ddPCR)
DNA was extracted from normal lung tissue samples as previously described1. DNA concentration was measured using Qubit, and up to 3,000 ng of DNA was fragmented to approximately 1,500 bp using the Covaris E220 evolution Focused-ultrasonicator following the manufacturer’s standard protocol. SAGAsafe assays2 for 5 EGFR target variant alleles (EGFR L858R, EGFR Exon 19 del, EGFR S768I, EGFR L861Q and EGFR G719S) were employed (SAGA Diagnostics AB). SAGAsafe is a digital PCR-based ultra-sensitive mutation detection technology utilizing an alternative chemistry alongside a modified thermocycling program, such that the true positive variant allele signal is enriched during a linear phase, and signals for both the variant and the wild-type alleles are amplified during the exponential phase. The method effectively suppresses the false positive variant allele signal rising from the polymerase base misincorporation errors and DNA damage, making reliable detection of rare-event mutations possible to exceedingly low limits of detection. The assays were performed on the Bio-Rad QX200 Droplet Digital PCR System. At least 3 positive droplets were required to call a sample positive. Using control experiments containing 265,000-381,000 copies of wild-type genome equivalents per test, the achievable limit of detection for the five EGFR SAGAsafe assays was determined to be at least 0.004% VAF. For each patient sample, 500ng of fragmented DNA (corresponding to ~150,000 copies of genome equivalents) was analyzed per assay across 4 reaction wells, with positive and negative control samples included in every run.
Calculation of copy number concentration of the variant and the wild-type alleles
Cvi is the copy number concentration of the target (variant or wild-type allele) in the input DNA sample
P is the number of positive droplets for the target
T is the number of total droplets analyzed
Vd is the volume a droplet (0.85×10-3 μL)
Vr is the total volume of a ddPCR reaction (20 μL)
Vi is the input volume per ddPCR reaction of the input DNA sample
Calculation of the variant allele frequency (VAF)
EGFR mutation profiling in corresponding tumour tissue (with MiSeq)
For each tumour region and matched germline, capture of a custom panel of genes (including the EGFR locus) was performed on 125ng DNA isolated from genomic libraries. The TruSeq Custom Amplicon Library Preparation method was used. Following cluster generation, samples were 100bp paired-end multiplex sequenced on the Illumina MiSeq at the GCLP lab at University College London, as described previously1. The generated data were aligned to the reference human genome (hg19) achieving a median sequencing depth of 3555X (Range: 1069-13084). Mutations were called as previously described1.
1.2) Duplex-seq of samples from the BDRE study
Normal lung tissue samples
All BDRE cohort patients were enrolled under Biomarker for Dysplastic Epithelium (BDRE) (NCT00900419). The cohort consisted of individuals recommended for CT scan based on age, smoking history or other indications. If a suspicious nodule was detected by CT scan, a navigational bronchoscopy was indicated. The nodule site was sampled for accurate diagnosis. For each patient, a brushing from a remote site in a contralateral lobe was also taken for research, as a representative sample of normal tissue and subsequently profiled for mutations using Duplex-seq. The absence of nodules or masses detected by chest CT scans was indicative of the non-tumor nature of these contralateral samples. To document that the brushings were peripheral, they were performed under fluoroscopic guidance with the brush advanced from the sheath only after documentation that the working channel was in the peripheral airways.
EGFR and KRAS mutation profiling (with Duplex-Seq)
Genomic DNA was extracted from brushings using Qiagen DNeasy Blood & Tissue kit according to manufacturer’s instructions. Duplex libraries were prepared using a commercially available kit from TwinStrand Biosciences, Inc. (Seattle, WA, USA), starting with 250ng of input DNA. Custom probes were designed for targeted capture of EGFR exons 18, 19, 20 and 21, and KRAS exons 2 and 3.
By independently capturing and sequencing the two strands of DNA for selected genomic regions, combined with the use of a common unique molecular identifier for both strands, DuplexSeq allows for the detection of rare mutations3,4 with a sensitivity of less than 1 in 107. After shearing and capturing of gDNA spanning the panel, primers are ligated that allow the two strands of DNA for each segment to be uniquely labelled and matched with its opposing strand. These strands are then amplified and libraries were sequenced on the NovaSeq 6000 Sequencing System (Illumina Inc. San Diego, CA, USA) and sequencing data were analyzed on the DNAnexus platform. Samples had an average number of 150,000,000 raw reads, yielding a mean on-target duplex depth of 4500. Duplex-seq reads were processed using an in-house pipeline adapted from Valentine et al.5 Additionally, we also profiled the involved lung of 15 of 20 cases where the suspicious nodule in the contralateral lung was cancerous, and where tissue was available. These data were processed by the bioinformatics pipeline provided by TwinStrand BioSciences. Using these, we were able to identify mutations that were present in both the involved and contralateral lung samples.
Data Availability
The MiSeq from the TRACERx and PEACE studies generated, used or analysed during this study are not publicly available and restrictions apply to the availability of these data. Such MiSeq data are available through the Cancer Research UK & University College London Cancer Trials Centre ([email protected]) for academic non-commercial research purposes upon reasonable request, and subject to review of a project proposal that will be evaluated by a TRACERx data access committee, entering into an appropriate data access agreement and subject to any applicable ethical approvals.
The Duplex-seq data for the BDRE study were generated using a larger panel of probes that covered ~50 kb of the genome, spanning hotspots frequently mutated in cancers. All of the data for the EGFR and KRAS regions queried are included in this manuscript. Data for the other regions are not publicly available and restrictions apply to the availability of these data. Such Duplex-seq data are available through Professor James DeGregori ([email protected]) for academic non-commercial research purposes upon reasonable request, entering into an appropriate data access agreement and subject to any applicable ethical approvals.
2. Epidemiological Studies
Study populations
2.1) UK Biobank dataset
The UK Biobank study comprises over 500,000 participants, aged between 40-69 who were recruited between 2006-2010. Participants provide detailed information regarding a comprehensive set of lifestyle factors, in addition to physical measurements and biological samples. Particulate matter air pollution levels (in 2010) are estimated for addresses within 400km of the Greater London monitoring area using a land-use regression model developed as part of the ESCAPE study6.
Following a similar method to that described in7, we first excluded all participants who had missing particulate matter or genetic principal components data. Multiple imputation with chained equations8 was used to impute missing values for the remaining 447,932 participants. The imputation model used the following variables: PM2.5, PM2.5-10, PM10, sex, BMI, ever smoking status, passive smoking (weekly hours of tobacco exposure at home), household income (dichotomised into “below” or “greater than or equal to” £31,000 annually), educational attainment (split into “below” or “degree level and above”), and the first 15 genetic principal components (to account for ethnicity). We imputed the dataset using predictive mean matching and logistic regression for continuous and binary variables, respectively, performing a maximum of 90 iterations. This yielded 5 complete versions of the original dataset in which the missing values have been imputed. Convergence was assessed through inspecting the resulting plot. Each imputed dataset was independently used in the same analysis protocol.
Participants were followed up from recruitment until either date of each cancer diagnosis or censoring, which was defined as the time of death or latest date of cancer diagnosis, whichever was earlier. We created a multivariate Cox regression model for each imputed dataset and primary cancer type with >= 100 cases, and pooled results across these models, which were consistent for each cancer type, into a single set using Rubin’s rules8. These models included the same covariates as in the imputation model, with the addition of age at the end of follow-up for each cancer. For cancers of the larynx or lip, oral cavity and pharynx, we further corrected for alcohol consumption, excluding those participants with missing alcohol data due to the high missingness of these variables. Schoenfeld residuals were examined to assess the proportional hazards assumption and variables that failed to satisfy this assumption were modelled as time-dependent. Cancer types for which this could not reliably be performed were excluded. Individual models that failed to converge were not included, and if all models for a particular cancer type failed, then that cancer type was excluded. In total, we thus excluded uterine, acute myeloid leukaemia, melanoma, and non-melanoma skin cancers, as well as 4 models from CRC, 3 from renal (excluding pelvis), and 1 from malignant immunoproliferative disease.
An interaction test between PM2.5 and smoking was performed for lung cancer. The approach described above was used to create individual multivariate Cox regression models for each imputed dataset and aggregate the results.
2.2) Within-country datasets
2.2.1) England dataset (Public Health England)
Air pollution, lung cancer incidence and EGFR mutation status could be estimated for 20 cancer alliance regions in England. This was the geographical level at which all three factors could be quantified.
Air pollution: Annual PM2.5 air pollution data (μg/m3) from 2008 to 2017 was obtained at the grid code level (1km x 1km) from DEFRA9. Postal code coordinates were sourced from the ONS 2018 Postal Code Directory10. To link every postal code to a grid code with pollution data, the coordinates of every postal code centroid was mapped to those of the nearest grid code centroid using the RANN package in R. The postal codes with pollution data were binned into 1 of 20 Cancer Alliance regions. Then, PM2.5 concentration estimates were then aggregated to the Cancer Alliance region level and then averaged over the period 2008 to 2017- these were selected because they represented the 10 years prior to a lung cancer diagnosis in 2018. The air pollution levels in each Cancer Alliance region were broadly stable (within 5 μg/m3) in this time period.
Lung cancer incidence: Data on 39290 lung cancers (International Classification of Diseases codes C33 to C34) diagnosed in England between 1 January 2018 and 31 December 2018 were extracted from the National Cancer Registration Dataset (NCRD) [AV2018 in CASREF01 (end of year snapshot)], held by the National Disease Registration and Analysis Service at Public Health England. Lung cancer incidence for each Cancer Alliance region was calculated based on these cases. This represented a predominantly Caucasian cohort - White: 92.03%, Asian: 1.47%, Chinese: 0.26%, Black: 1.19%, Mixed: 0.29%, Other: 1.10%, Unknown: 3.68%.
The age-standardised lung cancer incidence (using population counts obtained from the Office of National Statistics 2019 (2018 mid-year estimates)) was obtained according to each five-year age group and sex. Incidences were then combined across age and sex to yield a single value for each alliance region.
Lung cancer incidence = (sum(wi*xi/di)/sum(wi)) * 100000
wi = European population standard
di = Population Count
xi = Case Count
Standardised rates are standardised according to the 2013 European Standard Population. Confidence intervals for ASR point estimates were calculated using the Dobson method.
EGFR mutation proportion: For lung cancer diagnoses listed above, EGFR mutation statuses were extracted from the NCRD [AT_GENE_ENGLAND table in the CAS2107 monthly snapshot]. Only cases with “Overall: TS” as “a:abnormal” and “b:normal” for EGFR were used in the calculation for EGFR mutation rate (n=8585). The EGFR mutation rate was calculated for each Cancer Alliance region.
EGFR mutation rate =<# a:abnormal> / (<# a:abnormal> + <# b:normal>)
2.2.2) South Korea dataset (Samsung Medical Center)
Air pollution, lung cancer incidence and EGFR mutation status could be estimated for 16 geographical regions in South Korea. This was the geographical level at which all three factors could be quantified.
Air pollution: PM2.5 air pollution data were obtained from Air Korea11 for the years 2015 to 2017 for 16 standard geographical regions across Korea. Within each of the geographical regions, we averaged PM2.5 levels across the 2-year period prior to the year of lung cancer diagnosis. PM2.5 levels between 2015 to 2017 were broadly stable. We were only able to include PM2.5 data for a 2-year period for 2017 and 2018 diagnoses, as air pollution data per Korean region was only available starting from 2015.
Lung cancer incidence: Lung cancer incidence data were obtained from the Korean National Cancer Center12 for the years 2017 to 2018 for 16 geographical regions across Korea. Sex and smoking data were not available. Lung cancer incidence was obtained separately for each year and considered independently in Pearson correlations that are described below.
EGFR mutation proportion: Lung cancer EGFR mutation status was obtained from Samsung Medical Center lung cancer diagnoses for the years 2017 to 2018 for 16 geographical regions across Korea. (n=2563)
EGFR mutation rate = <# EGFRm>/(<# EGFRm> + <# EGFRwt>)
2.2.3) Taiwan dataset (Chang Gung Medical Foundation)
Air pollution, lung cancer incidence and EGFR mutation status could be estimated for 12 standard geographical regions in Taiwan. This was the geographical level at which all three factors could be quantified.
Air pollution: Annual PM2.5 air pollution data was obtained for 12 standard geographical regions in Taiwan from the Environmental Protection Administration Executive Yuan R.O.C. (Taiwan)13. PM2.5 (μg/m3) concentration estimates were available for each county in Taiwan from 2006 to 2017. We averaged PM2.5 levels across the 5-year period (before a 2 year washout period) prior to the year of lung cancer diagnosis. Eg. For a diagnosis in 2017, 2006-2015 aggregated air pollution levels were used for analysis. A 2 year washout period was necessary to account for dramatic decreases in air pollution levels after 2013.
Lung cancer incidence: Institutional lung cancer incidence and EGFR mutation rates for each of 12 different counties in Taiwan were obtained from the Chang Gung Research Database for the years 2011-2017 (n=4599). Lung cancer incidence was obtained separately for each year and considered independently in Pearson correlations that are described below.
Institutional lung cancer incidence was estimated based on recorded lung cancer diagnoses in all of Chang Gung Medical Foundation hospitals (CGMH), and the age-standardlized rates (ASR) per 100,000 were calculated using the world (WHO 2000) standard population of lung cancer incidence.
EGFR mutation proportion: EGFR mutation testing data were available for all of these cases. However, only 9 counties had at least 10 cases with EGFR mutation tested per year and comprised of more than 5% of the total population, these were the counties that were retained for analysis.
EGFR mutation rate = <# EGFRm>/(<# EGFRm> + <# EGFRwt>)
Relationship between EGFRm lung cancer incidence and PM2.5
Analyses were performed separately for each of the four cohorts: England, South Korea, and Taiwan.
For each geographical region (eg. each country; the 20 cancer alliances in England), EGFR mutant lung cancer incidence was calculated by multiplying the total lung cancer incidence by the EGFR mutation rate (as reported as a proportion out of 1).
EGFRm lung cancer incidence = <lung cancer incidence>*<EGFR mutation rate>
EGFR mutant lung cancer incidence values were compared with mean PM2.5 values across geographical regions using Pearson correlation tests.
Sensitivity analysis for England and Korea data sets
In the England data set, there were 2 Cancer Alliance regions (South East London and Thames Valley) with sparse data due to data unavailability (<10% of lung tumours have any molecular testing data recorded (2016-2018)). To exclude the possibility of this confounding our analysis, we performed a sensitivity analysis, where we excluded data from these 2 regions. Of note, the correlation between PM2.5 and EGFRm lung cancer incidence was still significant (R=0.55; p=0.019) after these exclusions.
Similarly, in the South Korea data set Jeju-do (2017) was excluded due to poor data availability. The correlation between PM2.5 and EGFRm lung cancer incidence was still significant (R=0.38; p=0.033) after this exclusion.
However, for the sake of completion, we have reported the full data sets (including these 2 England regions and 1 South Korea region) in the main text.
3. Preclinical studies
Animal Procedures
Animals were housed in ventilated cages with unlimited access to food and water. All animal regulated procedures were approved by The Francis Crick Institute BRF Strategic Oversight Committee, incorporating the Animal Welfare and Ethical Review Body, conforming with UK Home Office guidelines and regulations under the Animals (Scientific Procedures) Act 1986 including Amendment Regulations 2012.
EGFR-L858R [Tg(tet-O-EGFR∗L858R)56Hev] mice were obtained from the National Cancer Institute Mouse Repository. R26tTA mice were obtained from Jackson laboratory. Mice were backcrossed onto a C57Bl6/J background and further crossed to generate Rosa26LSL-tTa/LSL-tdTomato/Tet(O)EGFRL858R mice. Rosa26rtTa/TetO-EGFRL858R and LSL-KrasG12D mice have been described previously14,15 . After weaning, the mice were genotyped (Transnetyx, Memphis, USA), and placed in groups of one to five animals in individually ventilated cages, with a 12-hour daylight cycle.Recombination was initiated by adenoviral Cre (Viral Vector Core, University of Iowa, USA) delivered via intratracheal intubation (single dose, 2.5x107virus particles/50 μl).
For exposure to fine particulate matter or control, SRM2786 from the National Institute of Standards and Technologies (NIST) resuspended in sterile PBS using sonication and particle size distribution was confirmed using a zetasizer. Mice were briefly anesthetized using 5% isoflurane and intratracheal administration of 5 μg, 50 μg or control PBS was carried out and recovery monitored. SRM2786 has certified mass fraction values of both organic and inorganic constituents from multiple analytical techniques and represents fine PM from a modern urban environment (Schantz et al., 2016).
Fluorescence-activated cell sorting analysis and cell sorting
Mouse lungs were cut into small pieces, incubated with collagenase (1 mg/ml; ThermoFisher) and DNase I (50 U/ml; Life Technologies) for 45 min at 37°C and filtered through 70 µm strainers (Falcon). Red blood cells were lysed for 5 min using ACK buffer (Life Technologies). Cells were stained with fixable viability dye eFluor870 (BD Horizon) for 30 min and blocked with CD16/32 antibody (Biolegend) for 10 min. Cells were then stained with antibody for 30 min (see Supplementary Table S6). Intracellular staining was performed using the Fixation/Permeabilization kit (eBioscience) according to the manufacturer’s instructions. Samples were resuspended in FACS buffer and analysed using a BD Symphony flow cytometer. Data was analysed using FlowJo (Tree Star).
Immunohistochemistry
Mouse lungs were fixed overnight in 10% formalin and embedded in paraffin blocks. Then 4 μm tissue sections were cut, deparaffinized and rehydrated using standard methods. Antigen retrieval was performed using pH 6.0 Citrate Buffer and incubated with: EGFR L858R mutant specific (Cell Signaling: 3197, 43B2), anti-RFP (Rockland: 600-401-379) and CD68 (ab283654). Primary antibodies were detected using biotinylated secondary antibodies and detected by HRP/DAB or . Slides were imaged using a Leica Zeiss AxioScan.Z1 slide scanner.
RNA-Sequencing (RNA-seq)
Lung CD45−CD31−Ter119−EpCAM+ were sorted from control and PM exposed mice after PM exposure by flow cytometry. Total RNA was isolated using the miRNeasy Micro Kit (Qiagen), according to the manufacturer’s instructions. Library generation was performed using the KAPA RNA HyperPrep with RiboErase (Roche), followed by sequencing on a HiSeq (Ilumina), to achieve an average of 25 million reads per sample.
RNA-seq Analysis
The RNA-seq pipeline of nf-core framework version 3.3 was launched with Nextflow version 21.04.0 to analyse RNA sequencing data16. Raw reads in fastq files were mapped to GRCm38 with associated ensemble transcript definitions using STAR version 2.7.6a17. Bam files were sorted with a chromosome coordinate using samtools version 1.12 . RSEM version 1.3.1 was used to calculate estimated read counts per gene and to quantify in a measure of transcripts per million (TPM)18. Differential expression analysis was performed using the R platform version 4.0.3 package LIMMA version 3.44.1 filtering with the absolute value of log fold change more 1 and p-value less than 0.0519. The gene expression between treatment groups was further analysed for their pathway enrichments using Gene Set Enrichment Analysis (GSEA).
Comparison to RNA-seq data from never-smokers in COPA study
RNA sequencing was applied to 18 samples of lung brushings from 9 never-smokers from the COPA study after exposure to filtered air and diesel exhaust. Salmon20 was used to estimate transcript-level abundance from RNA-seq read data. Differential expression analysis was performed using DESeq221. The log two fold change in gene expression before and after exposure to filtered air and diesel exhaust was calculated. P-values were adjusted using the Benjamini-Hochberg method. The log two fold change of significantly differentially expressed genes from the T control mouse was compared to the log two fold change expression of the genes from COPA participants.
Organoids
Lung tissue was minced manually with scissors and digested with Liberase TM and TH (Roche Diagnostics) and DNase I (Merck Sigma-Aldrich) in HBSS for 30 min at 37 °C in a shaker at 180 r.p.m. Samples were passed through a 100 μm filter and centrifuged at 1,250 r.p.m. for 10 min. The cell-pellet was incubated in Red Blood Cell Lysis buffer (Miltenyi Biotec) for 5 min at room temperature and passed through a 40 μm filter. After centrifugation, cells were washed with magnetic-activated cell sorting (MACS) buffer (0.5% BSA and 250 mM EDTA in PBS) and passed through a 20 μm strainer-capped tube to generate a single-cell suspension. Antibody staining was then performed for cell isolation or for flow cytometry analysis.
Lung organoid co-culture assays have been previously described22. Lung epithelial cells (EpCAM+CD45−CD31−Ter119−) from control or PM exposed mice underwent fluorescence-activated cell sorting (FACS) and were resuspended in 3D organoid media (DMEM/F12 with 10% FBS, 100 U ml−1 penicillin-streptomycin and insulin/transferrin/selenium (Merck Sigma-Aldrich)). Cells were mixed with murine normal lung fibroblast (MLg) cells and resuspended in GFR Matrigel at a ratio of 1:1. Then 100 μl of this mixture was pipetted into a 24-well transwell insert with a 0.4 μm pore (Corning). In each insert, 2,000-5,000 epithelial cells and 25,000 MLg 2908 cells were seeded. After incubating for 30 min at 37 °C, 500 μl organoid media was added to the lower chamber and media changed every other day. Bright-field and fluorescent images were acquired after 14 days using an EVOS microscope (Thermo Fisher Scientific) and quantified using FiJi (.2.0.0-rc-69/1.52r, ImageJ).
For interleukin-1-beta ex vivo treatment of lung alveolar type II cells, digested lung from ET mice (without in vivo Cre induction) was prepared as described above. Alveolar type II cells (AT2) were sort purified as previously described (MHC Class II+CD49flowEpCAM+CD45−CD31−Ter119−)23 and incubated in vitro with 6 x 10^7 PFU/ml of Ad5-CMV-Cre in 100uL per 100,000 cells 3D organoid media for 1hr at 37 C as detailed in24. Cells were washed three times in PBS before plating as above, with 20ng/mL IL-1b added to the organoid media in the lower chamber and changed every other day. TdTomato+ organoids were counted as above and the size analysed in FiJi. For wholemount staining of organoids, organoids were prepared according to previous methods25 and stained with anti-proSPC (Abcam, clone EPR19839) and anti-keratin 8 (DSHB Iowa, clone TROMA-1). 3D confocal images were acquired upon an Olympus FV3000 and analysed in FiJI.
Statistics and Reproducibility
Preclinical statistical analyses were performed using Prism (v.9.1.1, GraphPad Software). Epidemiological and mutation/sequence data analysis was performed in R version 3.6.2. Graphic display was performed in Prism and illustrative figures created with Biorender.com. A Kolmogorov–Smirnov normality test was performed before any other statistical test. After, if any of the comparative groups failed normality (or the number too low to estimate normality), a nonparametric Mann–Whitney test was performed. When groups showed a normal distribution, an unpaired two-tailed t-test was performed. When groups showed a significant difference in the variance, we used a t-test with Welch’s correction. When assessing statistics of three or more groups, we performed one-way analysis of variance (ANOVA) or nonparametric Kruskal–Wallis test.
No data were excluded. No statistical methods were used to predetermine sample size in the mouse studies, and mice with matched sex and age were randomized into different treatment groups. All experiments were reliably reproduced. Specifically, all in vivo experiments, except for omics data (RNA-seq), were performed independently at least twice, with the total number of biological replicates (independent mice) indicated in the corresponding figure legends.
Methods References
1. Jamal-Hanjani, M. et al. Tracking the Evolution of Non-Small-Cell Lung Cancer. N. Engl. J. Med. 376, 2109–2121 (2017).
2. Dahlgren, M. et al. Preexisting Somatic Mutations of Estrogen Receptor Alpha (ESR1) in Early-Stage Primary Breast Cancer. JNCI Cancer Spectr. 5, pkab028 (2021).
3. Kennedy, S. R. et al. Detecting ultralow-frequency mutations by Duplex Sequencing. Nat. Protoc. 9, 2586–2606 (2014).
4. Schmitt, M. W. et al. Detection of ultra-rare mutations by next-generation sequencing. Proc. Natl. Acad. Sci. U. S. A. 109, 14508–14513 (2012).
5. Valentine, C. C. et al. Direct quantification of in vivo mutagenesis and carcinogenesis using duplex sequencing. Proc. Natl. Acad. Sci. U. S. A. 117, 33414–33425 (2020).
6. Eeftens, M. et al. Development of Land Use Regression models for PM(2.5), PM(2.5) absorbance, PM(10) and PM(coarse) in 20 European study areas; results of the ESCAPE project. Environ. Sci. Technol. 46, 11195–11205 (2012).
7. Huang, Y. et al. Air Pollution, Genetic Factors, and the Risk of Lung Cancer: A Prospective Study in the UK Biobank. Am. J. Respir. Crit. Care Med. 204, 817–825 (2021).
8. Buuren, S. van & Groothuis-Oudshoorn, K. mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 45, 1–67 (2011).
9. Department for Environment, F. and R. A. (Defra) webmaster@defra gsi gov uk. Modelled background pollution data- Defra, UK. https://uk-air.defra.gov.uk/data/pcm-data#population_weighted_annual_mean_pm25_data.
10. ONS Postcode Directory (Latest) Centroids. https://geoportal.statistics.gov.uk/datasets/ons-postcode-directory-latest-centroids/explore?showTable=true.
11. 에어코리아. https://www.airkorea.or.kr/web.
12. 암등록통계자료 > 중앙암등록본부 > 국가암관리사업 | 국립암센터. https://ncc.re.kr/cancerStatsList.ncc?sea.
13. 行政院環境保護署. 行政院環境保護署 - 空氣品質監測網. https://airtw.epa.gov.tw/CHT/Query/His_Data.aspx.
14. Politi, K. et al. Lung adenocarcinomas induced in mice by mutant EGF receptors found in human lung cancers respond to a tyrosine kinase inhibitor or to down-regulation of the receptors. Genes Dev. 20, 1496–1510 (2006).
15. Jackson, E. L. et al. Analysis of lung tumor initiation and progression using conditional expression of oncogenic K-ras. Genes Dev. 15, 3243–3248 (2001).
16. Ewels, P. A. et al. The nf-core framework for community-curated bioinformatics pipelines. Nat. Biotechnol. 38, 276–278 (2020).
17. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
18. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
19. Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
20. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
21. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
22. Nolan, E. et al. Radiation exposure elicits a neutrophil-driven response in healthy lung tissue that enhances metastatic colonization. Nat. Cancer 3, 173–187 (2022).
23. Major, J. et al. Type I and III interferons disrupt lung epithelial repair during recovery from viral infection. Science 369, 712–717 (2020).
24. Dost, A. F. M. et al. Organoids Model Transcriptional Hallmarks of Oncogenic KRAS Activation in Lung Epithelial Progenitor Cells. Cell Stem Cell 27, 663-678.e8 (2020).
25. Dekkers, J. F. et al. Long-term culture, genetic manipulation and xenotransplantation of human normal and breast cancer organoids. Nat. Protoc. 16, 1936–1965 (2021).