Label-free quantification (LFQ) is a standard approach used in proteomics mass spectrometry (MS). Due to the similarity of this data type to expression microarray data, analysis methods from that field are commonly used for LFQ-MS. A major difference, however, is the presence of missing values in MS, but not in microarray data.
It is well established that missing values do not oc- cur entirely at random, but more often at low intensities [1, 2, 3, 4]. The fraction of missing values varies by ex- perimental design, but it is not uncommon to have more than 50% missing values, especially in affinity purifica- tion experiments. This issue hence cannot simply be ignored but needs proper handling, and doing so is a cen- tral challenge in statistical analysis of LFQ data, e.g., for identifying proteins which are differentially abundant be- tween conditions. In the last years several method have been proposed to tackle this challenge, most of which rely on imputation, i.e., they simply replace missing val- ues with some number that is deemed realistic.
However, a fundamental problem with imputation is that it obscures the amount of available information: im- puted values will be considered as equally certain as ac- tually measured values by any downstream processing (identifying differentially abundant proteins, clustering, quality control). This can invalidate inferential conclu- sions due to underestimating statistical uncertainty or cause loss of statistical power. Therefore, we propose a probabilistic dropout model that explicitly describes the available information about the missing values.
Figure 1A demonstrates that missingness carries information: observations in proteins with many missing values (red) have a lower intensity than observations in proteins with only one or no missing values (purple). In addition, Figure 1B illustrates that the ratio of these densities forms a curve with sigmoidal shape, clearly showing how the probability of a value being missing depends strongly on overall intensity.
If sample size is limited, substantial gains in statis- tical power can be gained from using shrinkage estima- tion procedure for variance estimation (“variance mod- eration”) [5]. This approach is widely used in transcrip- tomics data analysis, e.g., by the limma package [6]. The advantage of using limma or similar approaches for LFQ- MS has been advocated only rather recently (e.g., [7]). For example, the DEP package [8] performs imputation followed by a limma analysis to infer differentially abun- dant proteins. As stated above, the use of imputation may compromise the validity of limma’s statistical infer- ence, and hence, the purpose of the present work is to adapt limma-style inference to account for values miss- ing not at random and so improve power and reliability of differential abundance analysis for LFQ-MS.
A typical analysis of a label-free tandem mass spec- trometry experiment consists of a number of steps. First, peaks in the MS1 need to be identified using the corre- sponding MS2 spectra. Second, the MS1 peaks need to be quantified. In the literature, two approaches are pop- ular for this tasks: spectral counting and peak area inte- gration [9]. Abundant peptides are more often recorded by the MS2, thus the number of MS2 spectra associated with a peptide can be used as a proxy for its abundance [10]. Alternatively, more abundant proteins cause larger peaks in the MS1, thus a second approach is to integrate the peak area of a peptide [11, 12]. Subsequent compar- isons of the methods concluded that peak area based methods perform better than spectral counting [13, 14]. Consequently, we will only focus on methods that handle continuous intensities.
The third important step is the aggregation of the peptide level information to protein information. Tradi- tionally, the peptide intensities are aggregated to protein intensities and then in a separated step the differential abundance is calculated for each protein. One popu- lar method, that is directly integrated in the MaxQuant platform [15], is called MaxLFQ that uses delayed nor- malization based on the peptide ratios [16]. Alterna- tive methods include summing up the peptide intensi- ties, taking the average of the top three peptides [17], selecting a reference peptide to calculate the protein in- tensity [18], averaging the ratios [19], or using relative abundances while taking into account shared peptides [20]. More recently, some methods have been published that directly try to combine both steps to gain more power. The result of all those steps is a table with in- tensities for each protein and sample. The values in this table are commonly on a log2 scale in order to account for the mean-variance relationship of the raw data (Sup- plementary Figure S1).
Several methods have been published in the last years that use those protein intensities to calculate differ- ential abundance. Perseus [21] is a platform with a graphical user interface, developed by the same group as MaxQuant, which provides functionality to normal- ize the data, impute missing values, identify significant proteins using a t-test and visualize the results. For multiple testing correction, Perseus offers two options: either the Benjamini-Hochberg procedure [22] or signif- icance analysis of microarrays (SAM), a permutation- based correction originally described in Ref. [23]. As already mentioned, DEP [8] is an R package that pro- vides a similar set of functionalities, but uses the more powerful variance moderated t-test to identify significant proteins using the R package limma [6, 24]. For multi- ple testing correction, DEP uses by default the methods in the fdrtool package [25]. In order to handle miss- ing values, it provides an interface to a large number of imputation methods from the MSnbase R package [26]. In contrast, Perseus only provides two imputation meth- ods, which either replace the missing values with a small deterministic value (MinDet) or with random values jit- tered around that small value (MinProb). DAPAR and ProStaR [27] are complementary software tools where DAPAR is an R-package that is similar to DEP, but has additional imputation methods based on the imp4p package [28]. ProStaR internally uses DAPAR and pro- vides a web-based graphical user interface to make the software more approachable to newcomers.
Approaches that work without imputation are limited so far. one approach is the “empirical Bayesian random censoring threshold” (EBRCT) model, which avoids im- putation by integrating over the inferred intensity dis- tribution for missing values [29]. However, it cannot handle the case if a protein is completely missing in one condition. This can actually be problematic because in an affinity purification experiment those proteins might actually be the ones that we care about the most. An- other tool that follows a similar idea is QPROT [4]; a command line tool that uses empirical Bayesian priors and integrates out the position of missing values using a cumulative normal distribution below a hard limit of detection.
Lastly, Triqler [31] is a tool that directly works on the peptide level instead of the protein level. This has the advantage that it can incorporate additional uncertainty due to the integration of multiple peptides to one protein intensity. It is a command line tool written in Python that fits an empirical Bayesian model integrating out the uncertainty for missing peptide quantifications.
Here, we present proDA (inference of protein d ifferential abundance by probabilistic d ropout analysis), a novel method for infering differential abun- dance that makes full use of the information inherent in missingness. proDA models the random process resulting in missingness and so avoids the problems discussed above that are inherent to imputation-based inference.
In the following section, we explain the intuition be- hind proDA and how it differs from the existing tools. In the third section, we use a spike-in and a semi-synthetic dataset to perform benchmarks. We show that many exiting methods have either low statistical power, or seri- ous deficiancies in controlling false discovery rate (FDR), i.e., they refer too many false positives, and that hence the need for a general, powerful and statistically reliable inference method is unmet. We show that proDA offers strong performance with reliable FDR control, and thus is suitable to fill this crucial gap in existing methodol- ogy. In the fourth section, we demonstrate proDA in an application setting, by analyzing a real dataset studying ubiquitination. We close with a conclusion.