Evaluation of procedures for normalizing and analysing SomaScan proteomics assay

doi:10.21203/rs.3.rs-4862220/v1

Download PDF

Article

Evaluation of procedures for normalizing and analysing SomaScan proteomics assay

https://doi.org/10.21203/rs.3.rs-4862220/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

The aptamer-based SomaScan assay measures thousands of proteins. SomaLogic provides a multi-step pre-processing procedure to reduce the technical variability of this data. This paper will evaluate how each step of this procedure affects analysis results. We performed a comparative assessment using data from two randomised clinical trials in weight management. We show that SomaLogic’s adaptive normalization by maximum likelihood (ANML) procedure introduces a bias to fold change estimates, with a median bias of + 3.7% and + 3.4% in the two trials. The bias was confirmed by a simulation study, where ANML introduced false positive findings. Additionally, their plate scaling procedure has no effect on data when the calibration step is included. However, SomaLogic's pipeline excluding ANML does reduce technical variability without a substantial impact on fold change estimates. We recommend that researchers considering the use of ANML in clinical trials should verify the absence of this bias.

Biological sciences/Computational biology and bioinformatics

Biological sciences/Computational biology and bioinformatics/Proteome informatics

Biological sciences/Computational biology and bioinformatics/Statistical methods

normalization

proteomics

SomaScan

clinical trials

robust statistics

Proteomics is increasingly being used in clinical research and offers potential for drug repurposing, biomarker discovery and precision medicine. Multiple assays are available for high-throughput measurement of proteins¹. This paper focuses on SomaLogic’s SomaScan assay² version 4.2.7 which contains 7289 aptamers that are used to measure 6386 human proteins (unique Entrez Gene Symbols). Alternative assays include Olink’s Proximity Extension Assay (PEA)³ and mass spectrometry-based proteomics assays⁴.

Analytical pipelines typically encompass pre-processing procedures to reduce technical variability caused by sample handling and varying laboratory conditions. However, care must be taken to ensure that these procedures do not affect relevant biological signal.

Assays are often designed to accommodate these pre-processing procedures. Bridging samples are technical replicates that are included across different batches. They can remove variability between runs and are unlikely to impact biological signals since they are technical replicates. Examples of bridging samples include SomaScan’s calibrators that are used in the plate scaling and calibration steps and Olink’s inter-plate controls that serve a similar function³. Another tool are internal controls, which are insensitive to biological differences but capture technical variability between samples. SomaScan hybridization control elutions are an example and are used in their hybridization control normalization step. Olink uses extension controls to reduce noise from the extension and amplification³.

We’ll use the term sample level normalization for methods that aim to remove technical variability between samples without use of internal controls. These methods assume that biological differences between samples will only appear in a small number of analytes and adjust samples to align the general distribution of analytes. An example is probabilistic quotient normalization⁵, which forms the basis for SomaLogic’s processing steps. A multitude of methods are commonly used in transcriptomics⁶ and mass spectrometry^7,8. SomaLogic’s ANML procedure is an example of sample level normalization.

Previous assessments on SomaScan’s pre-processing procedure focused on how well it reduced variability in technical duplicates⁹. However, we will assess how SomaLogic’s pre-processing impacts fold change estimates, whether any steps are redundant, and which methods are appropriate for analysis.

Our evaluation is based on SomaScan data from the clinical trials STEP 1 and STEP 2 on weight management. In both trials treatment with semaglutide is compared to placebo. In addition, we perform simulation studies to evaluate the pre-processing procedures in two scenarios. The first being if the number of upregulated and downregulated aptamers are equal, and the second being if only downregulated aptamers are simulated.

SomaLogic’s default normalization procedure consists of 5 steps, hybridization normalization, intraplate normalization, plate scaling, calibration, and adaptive normalization by maximum likelihood (ANML). These steps aim to reduce technical variability and remove batch effects. We use a specialised notation system to refer to various combinations of these steps. This system is explained in Fig. 1a.

Normalization procedures impact on analysis results

We consider how different normalization procedures impact the analysis of differential protein abundances between semaglutide and placebo arms in STEP 1 and STEP 2. Analysis was done by robust regression using bisquare M-estimation. The results of each aptamer are considered a match between different normalization procedures if both are non-significant or if they are both significant with estimated effects in matching directions. We consider an effect significant at a 5% false discovery rate using Benjamini-Yekutieli¹⁰.

Documentation describing the normalization procedure is publicly available², however, the code itself and the references used are proprietary. Therefore, we reverse-engineered the procedure to allow us to remove individual steps. We also implemented a simpler alternative to ANML which we refer to as median normalization. Figure 1b shows that the reverse engineered version of the full pipeline closely matches that of the vendor, with 98.2% of aptamers matching in STEP 1, and 99.2% matching in STEP 2. Aptamers were sorted into matching categories in most cases, with a high correlation of 0.999 and 0.997 between the estimated log fold changes, for STEP 1 and STEP 2 respectively. The discrepancy seen is likely due to the difference in reference values used. The vendor uses proprietary values for the hybridization normalization and ANML steps. The reverse engineered implementation uses references based on the median of the pooled STEP 1 and STEP 2 data (using study specific reference made no notable difference).

We now compare different combinations of normalization steps to see how they impact results. Figure 1c shows substantial difference in STEP 1 between the categorization of results for raw data when compared to V-HIPC, with only 48.4% of aptamers matching. We also see a modest 45.7% for V-HIPC vs. V-HIPCA. In STEP 2, we 93.8% of aptamers match for V-HIPC and V-HIPCA, however if aptamers that are non-significant for both procedures are removed the overlap is only 38.3%. A similar trend is seen for raw vs. VHIPC in STEP 2.

Figure 1d shows that the plate scaling step has no effect on results if the calibration step is applied (RE-HIPC vs. RE-HIC). No aptamers switch category and the log fold changes have equal values giving a correlation of 1.000. This is due to the raw data of RE-HIPC and RE-HIC being completely equal. Removing the intraplate normalization step has a negligible effect with 96.5% aptamers matching in STEP 1. However, removing the hybridization step as well does lead to a substantial number of mismatching aptamers, with an overlap of 85.4% in STEP 1.

Figure 1e evaluates whether the complexity of ANML is warranted. ANML uses an iterative process based on reference ranges for each analyte. It is compared to a simpler median normalization step. We see the simpler process gives similar results as ANML, with 98.2% overlap in STEP 1.

Figure 2a shows volcano plots for treatment effect using raw data, VHIPC, and VHIPCA. In STEP 1, we see 811 significantly downregulated aptamers with the raw data, this increases to 3422 with V-HIPC, but drops to 741 with V-HIPCA. VHIPC finds 125 significantly upregulated aptamers in STEP 1, but this increases to 632 aptamers with V-HIPCA. This trend where the number of up- and downregulated aptamers is more even with V-HIPCA compared to raw data and V-HIPC is also seen in STEP 2. Figure 2c shows a histogram of ratio of the fold change estimates of V-HIPCA over V-HIPC, they show ANML shifts the estimated fold change of all aptamers upwards. In Supplementary Fig. 1, we see the same phenomenon of ANML shifting fold change estimates for gender and age at baseline. We investigate this fold change estimate shift, using two simulations. For Fig. 2b and Fig. 2d, we simulated a treatment effect in a portion of aptamers, the first simulation had an equal number of up- and downregulated aptamers (balanced effects) and the second had only downregulated aptamers (unbalanced effects). The second simulation simplifies the trend seen in STEP 1 and 2, where most aptamers are downregulated. Figure 2d shows that for the simulation with unbalanced effects ANML introduces a bias, whereas for the simulation with balanced effects it does not. With unbalanced effects, ANML introduces 101 false positives (5.0% of significant aptamers) with a positive log fold change, where only 1 false positive is found using REHIPC. Using REHIPC results in 10 false negative, and ANML increases this to 324.

ANML scale factors

ANML is based on scale factors that are multiplied to the data. Ideally, the scale factors adjust for technical factors while preserving the biological signal. We test to see if ANML’s scale factors are associated with biological factors, in particular semaglutide treatment, age, and gender. In Table 1, we see that ANML produces scale factors that are significantly associated with treatment, gender, and age. This means that the measurements of different groups of subjects are shifted relative to each other, which explains the shift in fold change estimates.

Table 1

Analysis of scale factors of ANML (SomaLogic’s implementation). The log2 scale factors were fitted against different factors using robust regression with a bisquare M-estimator at 90% efficiency. For the analysis of age, the fold change shows the increase for a 10-year difference.
analysis	contrast	visit	study	fold change	p-value
age	-	Visit 2	STEP 1	1.02	P < 0.001
		Visit 2	STEP 2	1.02	P < 0.001
		Visit 24	STEP 1	1.01	P < 0.001
		Visit 24	STEP 2	1.03	P < 0.001
gender	Male / Female	Visit 2	STEP 1	1.02	P < 0.001
		Visit 2	STEP 2	1.03	P < 0.001
		Visit 24	STEP 1	1.02	P < 0.001
		Visit 24	STEP 2	1.03	P < 0.001
treatment	Treatment / Placebo	Visit 2	STEP 1	1.00	P = 0.979
		Visit 2	STEP 2	1.00	P = 0.549
		Visit 24	STEP 1	1.04	P < 0.001
		Visit 24	STEP 2	1.02	P < 0.001

Estimated proportion of technical variability

To estimate the proportion of variation that is technical, we compare the variation of subject samples against QC samples. The QC samples are technical replicates meaning there is no biological variation, whereas the subject samples contain both biological variability and technical variability. Figure 4a shows the estimated proportion of technical variability. This is estimated by taking the quotient of the median absolute deviation (MAD) of the technical replicated QC samples over the regular samples. The regular samples are impacted by both technical and biological variability, whereas the QC samples are only impacted by technical variability. We see (Fig. 4a) that the hybridization normalization step reduces the estimated proportion of technical variance from (raw) 69.3–39.5% (RE-H). The RE-HIPC procedures further reduces the estimated proportion to 31.3%. However, including ANML on top of this processing pipeline, slightly increases the proportion of technical variability to 32.6%. We see ANML decreases variability in both subject samples (Fig. 4b) and QC samples (Fig. 4c), however, the increase in the ratio of the two suggests the procedure is removing biological signal.

Robust regression

In Fig. 3c we see a QQ-plot for all sample types, which all have heavy tails. Seeing heavy tails for buffers, QC, and calibrators indicates that the extreme values are a result of technical variability and do not reliably reflect biology. This violates the assumption of normal distributed residuals for linear regression. We therefore use robust regression using M-estimation.

Figure 3a shows a that robust methods find more aptamers with significant differential abundance. However, linear regression does not offer a conservative option, with 672 aptamers being statistically significant when using linear regression but nonsignificant with M-estimation in STEP 1 (only 9 aptamers for STEP 2).

In Fig. 3b we see that linear regression and M-estimation categorize 80.8% of aptamers the same in terms of significance and directionality and have a correlation of 0.857 between the estimated log fold change, in STEP 1. For STEP 2 93.3% of aptamers match, with a correlation of fold change estimates of 0.835.

An alternative method that is robust to outliers is rank-based inverse normal transformation (RBINT), which has previously been applied to SomaScan data¹¹. In Supplementary Fig. 2, we compare RBINT to alternatives using gender as the only covariate. Gender was chosen because it allows us to use the non-parametric method Kruskal-Wallis, allowing us to compare rank-based inverse normal transformation (RBINT) with a well-established alternative. RBINT is only based on the rank of measurements, meaning it does not produce meaningful fold change estimates. It has also previously been shown to inflate type I error in certain circumstances¹². In our comparison we see it finds more positive results than Kruskal-Walls making it unlikely that it provides type 1 error control. Research has shown RBINT outperforms linear regression on raw values in specific simulations¹³.

Spuriomers

SomaScan includes 20 spuriomers which are random sequences of DNA that do not target any known human protein and, thus, should not be associated with biological factors such as treatment, gender, and age. We can therefore use them to assess non-specific binding for the raw data and the normalization procedures VHIPC and VHIPCA. We assess treatment, gender, and age to see how biological signal impacts the spuriomers.

Supplementary Fig. 3 shows volcano plots of these 20 spuriomers for treatment, gender, and age. Previously we have corrected for multiplicity using all aptamers, however, in this analysis we only correct for the 20 spuriomers. We see spuriomers with significant association with treatment effect for both VHIPC and VHIPCA in in STEP 1, but none for STEP 2. For gender we see 8 significant spuriomers for V-HIPC in STEP 2, and 1 significant spuriomer for STEP 1 and VHIPCA. For age we see many significant spuriomers for both studies and all three levels of normalization.

Previous work demonstrated that SomaLogic’s normalization pipeline reduces variation in technical replicates⁹. We corroborated a reduction in total variability, however we also explored how the pipeline impacts fold change estimates and biological variability based on clinical trial data and simulations.

We found evidence that the ANML step introduces a bias to fold change estimates, and likely introduces false positive results when you have a substantial number of treatment-dependent aptamers, with effects predominantly skewed in one direction. ANML distorting biological signal is corroborated by previous research that showed ANML substantially reduces correlation with Olink’s proximity extension assay¹⁴. SomaScan data violates a key assumption for probabilistic quotient normalization, where treatment should not impact the overall distribution on of features⁵. This violation produces scale factors that are associated with semaglutide treatment causing the bias. We also find evidence that ANML removes biological signal. This is evident by the reduced number of significant aptamers for treatment, as well as the increased proportion of technical variability.

Having many significant aptamers that are skewed in one direction was also seen in a study that compared an Alzheimer’s and control cohort¹⁵. This study found 5119 downregulated and 17 upregulated aptamers using SomaScan on plasma. The high relative number of downregulated proteins was also seen with the Olink Proximity Extension Assay and mass spectrometry. The effects were skewed in the same direction in cerebrospinal fluid, however the skew was considerably less severe.

SomaLogic acknowledges that ANML requires that samples have approximately the same total protein content¹⁶, but it is unclear if differences in total protein content is causing this bias. SomaScan does not measure protein content directly and only measures a subset of proteins, meaning a sample’s median signal is not necessarily a good proxy of total protein content.

SomaLogic’s normalization procedure has unwarranted complexity. The plate scaling step was seen to be redundant and had no effect on data when the calibration step was used afterwards. The intraplate normalization could also be removed with only a minor change in results.

The reversed engineered pipeline including ANML produced similar results as SomaLogic’s native pipeline. This suggests that the choice of reference population is of minor importance. We also saw that the algorithm ANML resulted in similar results to using standard median fold change normalization.

We also showed that SomaScan data has a distribution with heavy tails and that the extreme values are due to technical variability since they also appear in calibrator, buffer, and QC samples which have no biological variation. This violates assumptions for normality for linear regression, making it important to use robust methods. We demonstrated that using robust methods like M-estimation has a large impact on results.

Our analysis of spuriomers shows that non-specific binding is strong enough to introduce significant results. We see false positives among spuriomers for both V-HIPC and V-HIPCA, meaning that current methods of normalization are not correcting for nonspecific binding. Assuming non-specific binding behaves similarly in protein targeting aptamers, this suggests scepticism is warranted towards results with similar fold change estimates as the spuriomers, as they are potentially due to non-specific binding.

We conclude that SomaLogic's normalization procedure, excluding ANML, effectively reduces technical variability without any negative consequences being observed in this study, despite some redundancy. In contrast, we find that ANML introduces a bias that significantly affects fold change estimates and p-values. We recommend checking for the bias if ANML is being considered for an analysis. We also see that SomaScan has extreme values, and recommend using methods that can handle them.

SomaScan

SomaScan measure relative protein content by binding each targeted protein with a specific short sequence of DNA called an aptamer (commercially referred as SOMAmer). The aptamers are designed to bind to specific proteins with minimal off-target affinity. Unbound aptamers are then washed away, and protein abundance (proxied by bound aptamers) is measured using microarrays.

SomaScan samples are processed on 96 well plates, where 85 wells are dedicated to samples, 5 are dedicated to replicated calibrators where samples from multiple subjects have been pooled. The calibrators are from the same pool on all plates and are used to correct for variation between plates. Furthermore, 3 wells are blank buffers where no protein has been added, and 3 wells come from a different pool of subjects and are used for quality control. The assay’s aptamers are split into 3 groups and are applied to the sample at the different dilution levels 0.005%, 0.05%, and 20%.

Normalization procedure

SomaScan uses 5 normalization steps, which are different variations of median fold change normalization which is a type of probabilistic quotient normalization^5,17 where scale factors are derived and then multiplied to the data to achieve the normalized values.

For a collection of values, a scale factor, \(\:s{f}_{i}\), is derived by taking the median ratio between references, \(\:r{f}_{j}\), and measurements, \(\:{x}_{j}\).

\(\:s{f}_{i}=\text{m}\text{e}\text{d}\text{i}\text{a}\text{n}{\left(\frac{r{f}_{j}}{{x}_{j}}\right)}_{j\in\:J\left(i\right)}\),

(Eq. 1)

The data is then normalized by taking the product of the value and its scale factor. SomaLogic use fixed in-house references for each of their steps.

Hybridization elution controls are DNA sequences that are added to each sample before measuring fluorescence with microarrays. They are used in the hybridization normalization step, to reduce technical variability from the microarray process of measuring DNA content. In this case a scale factor is derived for each sample, where \(\:J\left(i\right)\) is the set of hybridization control elution measurements for sample \(\:i\). The references, \(\:r{f}_{j}\), are SomaLogic’s internal references based on historical measurements.

The intraplate normalization step does not directly affect measurements of samples. They are only applied to the calibration samples, which prepares them for use in the following plate scaling and calibration steps. The step reduces the variability between each calibrator on given plate.

The plate scaling and calibration steps help reduce technical variability between plates of samples. This includes variability from any source that could introduce a batch effect, including the process of binding aptamers to proteins and measuring the concentration of DNA using microarrays. The plate scaling uses one scale factor for all the aptamers of a plate, whereas the calibration step uses a scale factor for each individual aptamer. This allows the calibration step to remove aptamer specific batch effects.

ANML aims to reduce sample-to-sample technical variability. Whereas the previous steps are based on controls that assure we do not remove biological signal, ANML requires its scale factors to be independent of the treatment effect to ensure this. ANML is an iterative variation of median fold change normalization. ANML follows (Eq. 2), where for a sample, \(\:i\), \(\:J\left(i\right)\) includes the aptamers of \(\:i\) that are within 2 standard deviations from the median. The median and standard deviation are derived from a reference cohort. This step is applied iteratively, where each step potentially includes more aptamers in \(\:J\left(i\right)\) for calculating the next scale factor. This step is repeated 100 times. ANML is done to each dilution level separately.

We also tested a simpler alternative to ANML we call median normalization. Here a scale factor is derived for each sample, but \(\:J\left(i\right)\) is all aptamers measured for that sample.

The R code used to reverse engineer the normalization steps was bundled into the R package AptamerTools. The code is available at https://github.com/mcbg/AptamerTools.

STEP 1 and STEP 2

STEP 1 (NCT03548935) and STEP 2 (NCT03552757) were double-blinded randomised phase 3a clinical trials investigating the effect of once weekly (OW) Semaglutide on weight loss. Study designs and results has been reported earlier^18,19 and here we give a short description. The proteomic analysis was ethical approved by the Ethics Committee for the Region of Southern Denmark (no. H-21046833), and the clinical trial was performed in accordance with regulations and ethical requirements. Subjects participating in both trials signed informed consent forms.

STEP 1 had 1672 subjects assigned to semaglutide 2.4mg or placebo in a 2:1 ratio. To be included, subjects either had a BMI above 30, or a BMI above 27 with a listed comorbidity. Type 2 diabetes was an exclusion criterion. Treatment lasted for 68 weeks including 16 weeks of dose escalation.

STEP 2 enrolled 1210 subjects with type 2 diabetes and a BMI above 27. Subjects were assigned to either semaglutide 2.4mg, semaglutide 1.0mg, or placebo in a 1:1:1 ratio and treated for 68 weeks including 16 weeks of dose escalation.

STEP 1 showed a 12.4% larger weight loss in patients taking Semaglutide 2.4 mg OW compared to placebo. For STEP 2 the equivalent result was 6.2% and 2.7% for Semaglutide 1.0 mg OW.

In both trials SomaScan was used on serum samples taken at baseline (pre-treatment) and at end-of-treatment (week 68). For STEP 1, 1310 subjects had data at both visits, and 645 for STEP 2.

Simulation study

The simulation consists of 20 plates, which each have 85 samples and 5 calibrators. Each plate had 46 samples in group A, and 44 samples in group B. For each sample 7596 aptamers, including the 12 hybridization control elutions. The data was simulated on log2 scale. Variation was added on the plate level with a standard deviation of 0.32. On the sample level we added a variation to all analyte with a standard deviation of 0.125. We also added variation to all aptamers except the hybridization control elutions with a standard deviation of 0.30. We simulated the mean of each aptamer using a normal distribution with mean 10 and a standard deviation of 1.6. The standard deviation of each analyte was simulated using a gamma distribution with mean 0.33 and standard deviation of 0.084. Using these values, variation was added to each combination of sample and analyte. To mimic the outliers seen in the STEP data, a much larger standard deviation of 3.5 was used 1.5% of the time. To simulate treatment effects, 2250 analytes were chosen to have an effect of half of the aptamer’s standard deviation. The rest of the aptamers had no effect. Two versions of the simulation were performed. The first version had balanced effects of treatment, where there was 50% chance of the effect being positive or negative. In the second version all effects were negative. The levels of variation were chosen to be roughly like the pooled STEP 1 and STEP 2 data.

Reverse engineering

SomaLogic use proprietary references when normalizing. This means that it is not possible to completely reverse engineer their pipeline. The reversed engineered implementation used references from the pooled STEP 1 and STEP 2 trials. The median value was taking for the relevant sample type. The reverse engineered normalization pipeline was implemented with R. The ANML procedure was implemented using the Rcpp package²⁰ to improve performance.

Modelling

Robust regression using Mestimation was done in R using the RobStatTM package²¹. Mestimation was done with the bisquare function with an efficiency of 0.85. We estimated differential expression of treatment by fitting the log₂ transformed values with treatment, gender, age, race, and log₂ transformed baseline values as covariates. For STEP 2 the two treatment arms were pooled into a single treatment group. We considered a result significant at a 5% false discovery rate using Benjamini-Yekutieli¹⁰. STEP 1 and STEP 2 were considered separately, meaning we only correct for testing multiple aptamers.

Computation

All analysis and data processing were done using R version 4.2.1.

Competing interests

The authors declare no competing interests.

Author Contribution

Michael C. B. Galanakis, Milan Geybels and Dirk Valkenborg conceived the study design. Michael C. B. Galanakis wrote the manuscript, performed the statistical analyses, and did the programming. All authors reviewed and approved the manuscript.

Acknowledgement

M.C.B.G, M.G, and D.V. received a grant for this research from the Danish Innovation Fund (204000005B). We also thank the patients and investigators of STEP 1 and STEP 2.

Data Availability

It is possible to request access to a de-identified and anonymized version of data using the access request proposals found at https://www.novonordisk-trials.com/.

Correa Rojo, A. et al. Towards Building a Quantitative Proteomics Toolbox in Precision Medicine: A Mini-Review. Front. Physiol. 12, 723510. 10.3389/fphys.2021.723510 (2021).
Gold, L. et al. Aptamer-based multiplexed proteomic technology for biomarker discovery. Nat. Precedings, 1–1 (2010).
Assarsson, E. et al. Homogenous 96-plex PEA immunoassay exhibiting high sensitivity, specificity, and excellent scalability. PLoS One. 9, e95192. 10.1371/journal.pone.0095192 (2014).
Zhang, F., Ge, W., Ruan, G., Cai, X. & Guo, T. Data-Independent Acquisition Mass Spectrometry-Based Proteomics and Software Tools: A Glimpse in 2020. PROTEOMICS 20, 1900276, doi: (2020). https://doi.org/10.1002/pmic.201900276
Dieterle, F., Ross, A., Schlotterbeck, G. & Senn, H. Probabilistic Quotient Normalization as Robust Method to Account for Dilution of Complex Biological Mixtures. Application in 1H NMR Metabonomics. Anal. Chem. 78, 4281–4290. 10.1021/ac051632c (2006).
Abrams, Z. B., Johnson, T. S., Huang, K., Payne, P. R. O. & Coombes, K. A protocol to evaluate RNA sequencing normalization methods. BMC Bioinform. 20, 679. 10.1186/s12859-019-3247-x (2019).
Tokareva, A. O. et al. Normalization methods for reducing interbatch effect without quality control samples in liquid chromatography-mass spectrometry-based studies. Anal. Bioanal. Chem. 413, 3479–3486. 10.1007/s00216-021-03294-8 (2021).
Reinhold, D., Pielke-Lombardo, H., Jacobson, S., Ghosh, D. & Kechris, K. Pre-analytic Considerations for Mass Spectrometry-Based Untargeted Metabolomics Data. Methods Mol Biol 323–340, doi: (1978). 10.1007/978-1-4939-9236-2_20 (2019).
Candia, J., Daya, G. N., Tanaka, T., Ferrucci, L. & Walker, K. A. Assessment of variability in the plasma 7k SomaScan proteomics assay. Sci. Rep. 12, 17147. 10.1038/s41598-022-22116-0 (2022).
Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat., 1165–1188 (2001).
Ferkingstad, E. et al. Large-scale integration of the plasma proteome with genetics and disease. Nat. Genet. 53, 1712–1721. 10.1038/s41588-021-00978-w (2021).
Beasley, T. M., Erickson, S. & Allison, D. B. Rank-based inverse normal transformations are increasingly used, but are they merited? Behav. Genet. 39, 580–595. 10.1007/s10519-009-9281-0 (2009).
McCaw, Z. R., Lane, J. M., Saxena, R., Redline, S. & Lin, X. Operating characteristics of the rank-based inverse normal transformation for quantitative trait analysis in genome-wide association studies. Biometrics. 76, 1262–1272. 10.1111/biom.13214 (2020).
Pietzner, M. et al. Synergistic insights into human health from aptamer- and antibody-based proteomic profiling. Nat. Commun. 12, 6822. 10.1038/s41467-021-27164-0 (2021).
Dammer, E. B. et al. Multi-platform proteomic analysis of Alzheimer's disease cerebrospinal fluid and plasma reveals network biomarkers associated with proteostasis and the matrisome. Alzheimers Res. Ther. 14, 174. 10.1186/s13195-022-01113-5 (2022).
SomaLogic. SomaScan®v4.0 and v4.1 Data Standardization. (2021).
Dieterle, F., Ross, A., Schlotterbeck, G. & Senn, H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal. Chem. 78, 4281–4290. 10.1021/ac051632c (2006).
Davies, M. et al. Semaglutide 2·4 mg once a week in adults with overweight or obesity, and type 2 diabetes (STEP 2): a randomised, double-blind, double-dummy, placebo-controlled, phase 3 trial. Lancet. 397, 971–984. https://doi.org/10.1016/S0140-6736(21)00213-0 (2021).
Wilding, J. P. H. et al. Once-Weekly Semaglutide in Adults with Overweight or Obesity. N Engl. J. Med. 384, 989–1002. 10.1056/NEJMoa2032183 (2021).
Eddelbuettel, D., François, R. & Rcpp Seamless R and C + + integration. J. Stat. Softw. 40, 1–18 (2011).
Maronna, R. A., Martin, R. D., Yohai, V. J. & Salibián-Barrera, M. Robust statistics: theory and methods (with R) (Wiley, 2019).

No competing interests reported.

supplementarymaterial05aug2024.pdf

Download PDF

Editor assigned by journal
12 Nov, 2024
Editor invited by journal
06 Sep, 2024
Submission checks completed at journal
03 Sep, 2024
First submitted to journal
05 Aug, 2024

You are reading this latest preprint version

Evaluation of procedures for normalizing and analysing SomaScan proteomics assay

Status:

Version 1

Abstract

Figures

Introduction

Results

Normalization procedures impact on analysis results

ANML scale factors

Estimated proportion of technical variability

Robust regression

Spuriomers

Discussion

Method and materials

SomaScan

Normalization procedure

STEP 1 and STEP 2

Simulation study

Reverse engineering

Modelling

Computation

Declarations

Competing interests

Author Contribution

Acknowledgement

Data Availability

References

Additional Declarations

Supplementary Files

Status:

Version 1