Selection and recruitment of the Japanese Cent, HA, and Cont groups
For the centenarian group, we used data from two prospective cohort studies of the oldest individuals in Japan, the Tokyo Centenarian Study (TCS) and the Japan Semisupercentenarian Study (JSS)3,44. The cut-off date for the data collected from the Cent group was May 31, 2022. Among the 967 centenarians from whom gDNA was obtained, 964 centenarians were included, with data from three centenarians who had mismatched sex information or were assigned to close relatives (PI HAT of 0.1875 or higher, which represents the half-way point between 2nd- and 3rd-degree relatives) excluded; this resulted in a study population comprising 144 men and 820 women (female-to-male ratio: 0.851) with a median age of 106.0 years [interquartile range (IQR): 103.9-107.1]. Among these individuals, there were 915 deaths (death rate: 0.969), with a median age at death of 107.5 years [IQR: 105.7-109.2] (Table 1, Extended Data Figure 1). To validate CentPGS, an additional 60 Japanese centenarians aged 100-101 years from the 5-COOP project32 and 36 siblings aged 90-104 years were enrolled in this study. These additional samples were not excluded because they met the same exclusion criteria as the centenarians, but the close relative criterion was not applied to the siblings of the centenarians.
For the HA group, data from the Kawasaki Ageing and Wellbeing Project (KAWP), a prospective cohort study of older adults aged between 85 and 89 years with no limitations in performing activities of daily living (ADLs) at baseline, were used (Extended Data Figure 3)45,46. The cut-off date for the KAWP data was Sept. 30, 2022. Among the 1,026 participants in the KAWP, two individuals were excluded due to a lack of permission to determine their gDNA sequence, and eight individuals were excluded due to being close relatives; thus, 1,016 individuals were enrolled as healthy agers (HAs; 513 men and 503 women; female-to-male ratio: 0.495), with a median age of 86.8 years [IQR: 85.9-88.2], 168 deaths (death rate: 0.169), and a median age at death of 90.4 years [IQR: 89.0-91.8] (Table 1, Extended Data Figure 1).
The Tohoku Medical Megabank (TMM) project comprises community-based prospective cohort studies that include a population‐based adult cohort47 and a three‐generation cohort48. Among 30,081 participants, 11,519 individuals from the three-generation cohort were excluded because they were close relatives, 1,237 individuals were excluded due to lack of phenotype, and 9 individuals were excluded because the data were identified as PCA outliers from the Japanese cluster; thus, 17,306 individuals were enrolled as Japanese controls (Cont; 4,728 women (female-to-male ratio: 0.647) with a median age of 45 years [IQR: 32-63]) (Table 1, Extended Data Figure 1). Allele frequency data for ToMMo38k were downloaded from jMorp49 [https://jmorp.megabank.tohoku.ac.jp]. Written informed consent was obtained from the participants. The ethics committee of Tohoku University approved the protocol for all of the cohort studies (ID: 2023-4-097, 2023-4-098), which have been previously described47,48.
Genomic DNA extraction
For individuals in the Cent and HA groups, total gDNA was extracted from whole blood using a FlexGene DNA Kit (Qiagen, Hilden, Germany). We confirmed the quality of the gDNA by agarose gel electrophoresis and found that the gDNA was not degraded. For individuals in the Cont group, total gDNA was extracted from whole blood as previously described49.
Whole-genome DNA sequencing and analysis
For the Cent and HA groups, whole-genome DNA sequencing was performed for 526 centenarians via the HiSeq2500, HiSeqX, or NovaSeq 6000 platforms as previously described50. The whole-genome DNA sequence of 3,340 Japanese controls was determined using whole-genome DNA sequencing with HiSeq 2500 or NovaSeq 6000 as previously described49. Resequencing analysis was performed as described by Tadaka et al. with minor modifications49. In brief, a workflow known as the GATK Best Practices workflow, which is becoming the standard procedure globally for whole-genome resequencing analysis51, was used. Then, base quality score recalibration (BQSR) was applied to effectively reduce sequencer-specific bias.
Genotyping and imputation using a DNA microarray
The genotypes of 0.65 M SNVs of 441 centenarians, 60 additional centenarians, 36 siblings of centenarians, and 26,741 controls were determined using an Axiom Japonica Array NEO according to the manufacturer’s protocol. The genotypes of 0.65 M SNVs of 1,015 individuals in the KAWP were determined using an Infinium Asian Screening Array-24 v1.0 BeadChip Kit according to the manufacturer’s protocol. All DNA microarray images were analysed using previously described protocols 49. Genotypic imputation was performed to estimate the genetic variants among the genotypes identified by a DNA microarray. Prior to genotypic imputation, prephasing was performed by SHAPEIT52 version v2.r387 with the duoHMM method and a window size of 5. After this prephasing step, genotypic imputation was carried out using IMPUTE253 version 2.2.2 with the ToMMo 3.5KJPNv2 haplotype reference panel. For IMPUTE2, we used the following options: prephased haplotypes (-use_prephased_g), effective population size (-Ne) 20000, and number of reference haplotypes (-k_hap) 7000. In the imputation process, we divided each autosome into 3 Mb chunks for entry. After the imputations for each chunk were completed, we concatenated these imputed chunks to reconstruct a contiguous autosome.
Outlier identification by PCA
To identify outliers in the Japanese population, gDNA sequence samples from the Cent, HA, and Cont groups were subjected to PCA along with 2,504 gDNA sequences determined from the 1000 Genomes Project54, which contains data from 26 human races, including Asian, European, and African populations. Common SNVs between samples in the present study and the 1000 Genomes samples were extracted, SNVs were pruned using PLINK (version 1.90), and PCs 1-20 were computed using the "pca" command in PLINK 55.
GWAS and meta-GWAS analysis
To identify relevant SNVs in Japanese centenarians, we compared gDNA sequences between the Cent and Cont groups via a GWAS. WGS and DNA microarray-imputation samples were subjected to GWAS analysis separately using PLINK 1.9055 and merged as a meta-GWAS using N-weighted multivariate GWAMAs25. For N-weighted multivariate GWAMA data, summary statistics for 5.98 M SNVs commonly found between WGS and DNA microarray-imputation samples were extracted. The cross-trait intercept between the WGS and DNA microarray-imputation samples was 0.0606, and the SNP heritability values of the WGS and DNA microarray-imputation samples were 0.3211 and 0.2757, respectively. A Manhattan plot was generated using the "qqman" package (version 0.1.8) in R56. An enlarged view of a Manhattan plot with recombination rate information was generated using LocusZoom (version 1.3)57.
Calculations of SNPh2, lGC and genetic correlation using LDSC and GWAS summary statistics
SNP heritability, lGC, and genetic correlations were calculated using LDSC (version 1.0.1)20. For longevity-associated GWAS summary statistics, the European 90th/99th survival percentiles9 and parental lifespan (PLS)13 were used. Japanese disease and quantitative GWAS summary statistics were downloaded from the RIKEN JENGER server (http://jenger.riken.jp)58, and Japanese GWAS summary statistics for Alzheimer’s disease were downloaded from NBDC (https://humandbs.biosciencedbc.jp/hum0237-v1)59.
Transcript expression and promoter prediction analyses to identify relevant tissue-specific genes
The SNV-associated genes associated with e quantitative trait locus (eQTLs), sQTLs, ieQTLs, and isQTLs were analysed using the Genotype-Tissue Expression (GTEx) portal (V8, https://www.gtexportal.org/home/)26. The promoter predictions were performed according to the Enformer promoter prediction method27. Briefly, the Enformer architecture consists of three parts: (1) 7 convolutional blocks with pooling, (2) 11 transformer blocks, and (3) a cropping layer followed by final pointwise convolutions branching into two organism-specific network heads. Enformer takes a one-hot-encoded DNA sequence as input and predicts 5,313 genomic tracks for the human genome. Enformer generates a score summarizing the effect of a given variant as a single number per gene by comparing expression predictions for both variant and reference sequences. Finally, we extracted the data corresponding to the lead SNVs in the centenarian GWAS.
Generalized gene-based test for Japanese centenarians via MAGMA
A generalized gene-based test for Japanese centenarians in this GWAS was conducted using the FUMA web application (https://fuma.ctglab.nl/)60 to estimate longevity-associated genes. Independent significant SNPs were identified according to their P values (P < 1.0 × 10-2) and independence (r2 < 0.6 in the 1000 Genomes phase 3 ALL reference panel population) within a 250-kb window. The results of MAGMA gene-set analysis28 were assessed with both GO and KEGG pathway enrichment analysis with the "clusterProfiler" package in R. MAGMA tissue-expression analysis28 was conducted with eQTL data from the Genotype-Tissue Expression project (GTEx v8)26.
The ageing-related gene lists for the longevity map (341 genes, build 3), cell age (866 genes, build 3), and GenAge (307 genes, build 21) were downloaded from the Human Ageing Genomic Resource (https://genomics.senescence.info)29. The longevity-regulating pathway genes (89 genes), including those involved in the IGF/insulin, sirtuin, AMP-AMPK, and TOR pathways (hsa04211), were identified in the KEGG pathway database (https://www.genome.jp/)30. The gene list for the Japanese T2D risk locus (286 genes) was obtained from Imamura et al.31. The gene symbols in the gene lists were converted to Entrez IDs using the "bitr" command in the "clusterProfiler" package in R, and the overlapping Entrez IDs were counted in R.
Heritability enrichment analysis against the tissue-specific expressed genes or enhancer regions using LDSC
Heritability enrichment analysis was performed according to the instructions for cell type-specific analyses on GitHub for LDSC (https://github.com/bulik/ldsc/)61. LD score data for the East Asian population, including 10 cell type group-specific annotations and 220 cell type-specific annotations, were downloaded from the RIKEN Jenner server (http://jenger.riken.jp)62.
PRS, Z score of PRS, and lsmean for PRS Z score
PRSs for Japanese centenarians, Japanese diseases, and Japanese quantitative GWAS summary statistics were calculated with PRS-CS (version 1.0.0)63. The SNVs identified from the GWAS summary statistics were screened at a significance level of P < 0.01, the phi parameter was set to 0.01, and the LD reference panels were generated with the EAS reference, which was constructed using the 1000 Genomes Project phase 3 samples. The other parameters used were set to their defaults.
All PRSs were Z score standardized using the mean and standard deviation of the Cont(PRS) group. The average PRS in each group was calculated using the lsmeans method adjusted for sex to account for the different female-to-male ratios among the Cent, HA and Cont groups.
CentPGS
The SNVs identified from the Japanese centenarian GWAS summary statistics were screened at a significance level of P < 0.01. CentPGS was calculated with PRS-CS (version 1.0.0)63 with the phi parameter set to 0.01, and the LD reference panels were generated with the EAS reference. CentPGS was Z score standardized using the mean and standard deviation of the Cont(PRS) group.
Multiple logistic regression analysis for PRS and ROC analysis
For multiple logistic regression analysis using genetic factors, 62 PRSs, the genotypes of 4 SNVs, and CentPGS (only for logistic regression analysis between the HA and Cont(PRS) groups) were evaluated by univariate logistic regression adjusted for sex and LASSO to select covariates for multiple logistic regression analyses among the Cent, HA, and Cont(PRS) groups. All phenotype abbreviations for the 62 PRSs are listed in Supplementary Table 3. After covariates were selected, differences in genetic components among the Cent, HA, and Cont(PRS) groups and in the contributions of PRSs and SNV genotypes were analysed via multiple logistic regression analysis. The ROC plot was generated, and the AUC was calculated using the "pROC" package in R with default parameters.
Kaplan‒Meier and Cox regression analyses and regularized multiple Cox regression survival analyses
For the univariate and multivariate survival analyses, Kaplan‒Meier or regularized multiple Cox regression survival analyses were performed using the "survival" package in R. For the survival analysis endpoints, a telephone survey was used to determine the age of individuals in the Cent group at death, whereas both the telephone survey and medical insurance claims were used to determine the age of those in the HA group at death. Long-term care insurance claims were used to determine age at the end of each individual’s healthspan; specifically, the age at which long-term care grade 2 or greater was confirmed or the age at death in the medical insurance claim was used. These claims data were provided by the local government with the consent of the participants.
Baseline examination for observed traits in the Cent and HA groups
The methods for obtaining all observed traits at the baseline examination except telomere length were described previously3,44-46. Briefly, both Instrumental Activities of Daily Living (IADL) and Activities of Daily Living (ADL) are questionnaires used to assess the functions necessary for daily living. The timed up and GO (TUG) test, which measures the time required to rise from a chair, walk three metres, turn approximately 180 degrees, return to the chair, and sit down while turning 180 degrees, was used to assess mobility. BMI, SBP, and DBP were measured at the baseline health check. The Mini-Mental State Examination (MMSE; 0-30 points) is a questionnaire used to assess cognitive function64. The Extended Clinical Dementia Rating Scale (ex.CDR) is an assessment method for later stages of dementia33. The Geriatric Depression Scale (GDS) is a questionnaire used to assess levels of depression in older individuals65. The frailty index was calculated using the deficit accumulation model proposed by Rockwood66. Blood biomarker concentrations, including NTproBNP, cystatin C, and interleukin-6 (IL-6), were measured via ELISA. Blood tests for triglyceride (TG), high-density lipoprotein cholesterol (HDLC), LDLC, choline esterase (CHE), aspartate aminotransferase (AST), haemoglobin A1c (HbA1c), C-reactive protein (CRP), and albumin (ALB) were performed by SRL, a clinical laboratory in Japan. Medical history, including pneumonia, cancer, and dementia, was collected via interviews with a physician (Supplementary Table 5).
Phenotype-genotype correlations
To evaluate correlations between phenotype and genotype, all baseline observed traits were analysed using a generalized linear model with age at enrolment and sex as moderator variables. To compare the coefficients between observed traits, all scores of the observed traits were standardized.
Percentile analysis for Japanese centenarians
To correct for the variances in age that were due to differences in the number of births in each year, percentiles were calculated to map the age at death to the percentile of the population born in the same year. Demographic data for the number of births and the population by age were extracted from Japanese census data for the years 2000, 2005, 2010, 2015, and 2020 and were downloaded from e-Stat, a portal site for Japanese government statistics (https://www.e-stat.go.jp/en/). Finally, individuals exceeding the 90th, 99th, and 99.9th percentile ages were calculated for each birth year, and thresholds for the 90th, 99th, and 99.9th percentiles were determined. Among Japanese centenarians, 77.7% (749/964) were older than the 99.9th percentile.
Statistical analyses
The baseline characteristics, biomarkers, and medical history are expressed as the median and interquartile range or number with a percentage (Supplementary Table 5). Differences in baseline data were evaluated using the Wilcoxon rank-sum test, chi-square test, and Fisher’s exact test (Supplementary Table 5). All the statistical analyses were performed using R (version 4.2.2) with exactRankTests (wilcox.exact, Wilcoxon rank-sum test [version 0.8-31]), glmnet (LASSO and multivariate analyses [version 4.1-8]), survival (survival analysis [survfit, coxph, and cox.zph] [version 3.2-13]), lsmeans (lsmeans [version 2.30-0]), stats (glm [version 4.2.2]), pROC (roc, [version 1.18.4]) and default packages. The statistical significance threshold for multiple logistic regression, regularized multiple Cox regression, Kaplan‒Meier and Wilcoxon rank sum test analyses was set at 0.05, and the statistical significance threshold for the lsmean, multiple regression and genetic correlation with multiple testing was set at 0.05 with Bonferroni correction.