Principal component analysis
Principal component analysis (PCA) was a statistical method that converts a set of observed possible correlated variables into linearly independent variables (i.e., principal components) through orthogonal transformation. PCA could reveal the internal structure of data and transform a multivariate data set into low-latitude data presented with a small number of principal components (PC). SIMCA software (V15.0.2, Sartorius Stedim Data Analytics AB, Umea, Sweden) was used to conduct Data processing LOG conversion plus center (CTR) formatting process, and then automatic modeling analysis[10]. PCA based on lipidomics analysis could clearly distinguish the difference between pancreatic cancer and adjacent tissues (Figure 1). The lipid metabolism of the two groups was obviously distributed in different regions, indicating that the lipid metabolism of the two groups had their own unique characteristics. The X-coordinate PC[1] and Y-coordinate PC[2] represented the scores of the first and second PC respectively, and the color and shape of scatter points represented samples experimental grouping. The samples were all within the 95% confidence interval (Hotelling's T-Squared Ellipse).
Orthogonal projections to latent structures - discriminant analysis
In high-dimensional data, variables contained not only differential variables related to categorical variables, but also a large number of undifferentiated variables that may be related to each other. The difference variables would be dispersed to more principal components because of the influence of related variables, so better visualization and subsequent analysis could not be carried out. Orthogonal projections to latent structures - discriminant analysis (OPLS-DA) could filter out orthogonal variables unrelated to categorical variables in metabolites, and analyzed non-orthogonal variables and orthogonal variables respectively, so as to obtain more reliable information about the intergroup differences of metabolites and the degree of correlation between the experimental group. SIMCA software (V15.0.2, Sartorius Stedim Data Analytics AB, Umea, Sweden) was used to perform LOG conversion and UV formatting on the Data. First, OPLS-DA modeling analysis was performed on the first principal component, and 7-fold cross validation was used to verify the quality of the model. Then the validity of the model was evaluated by R2Y (the model's interpretability to the categorical variable Y) and Q2 (the model's predictability) obtained after cross-validation. Finally, the permutation test was used to randomly change the arrangement order of the classification variable Y for several times to obtain different random Q2 values, and further tested the validity of the model. To investigate the statistical differences in lipid metabolites between the pancreatic cancer group and the para-cancer group, a multivariate analysis model OPLS-DA was used. Data from the pancreatic group and the para-cancer group were distributed in two opposite regions in the OPLS-DA model analysis (Figure 2A). In the figure, the X-coordinate T [1]P represented the predicted principal component score of the first principal component, the Y-coordinate T [1]O represented the orthogonal principal component score, and the shape and color of scatter points represented different experimental groups. The results of the OPLS-DA score chart showed that the two groups of samples differed significantly. All samples were within the 95% confidence interval (Hotelling's T-Squared Ellipse).
The permutation test established the corresponding OPLS-DA model to obtain the R2 and Q2 values of the random model by randomly changing the ranking order of the categorical variable Y for several times, which played an important role in avoiding the overfitting of the test model and evaluating the statistical significance of the model. Q2 values of the random model were all smaller than Q2 values of the original model. The intercept of Q2 regression line and vertical axis was less than zero; at the same time, as the retention degree of displacement decreased gradually, the proportion of Y variable of displacement increased, and Q2 of the random model decreased gradually. This indicated that the original model had good robustness. The original model could better explain the difference between the two groups of samples (Figure 2B). The abscissa represented the permutation retention of the permutation test (the proportion consistent with the original model's Y variables; the point at which the permutation retention was equal to 1 was the value of R2Y and Q2 of the original model). The vertical coordinate represented the value of R2Y or Q2, the green dot represented the value of R2Y obtained by the permutation test, the blue square represented the value of Q2 obtained by the permutation test, and the two dotted lines represented the regression lines of R2Y and Q2 respectively. The original model R2Y was very close to 1, indicating that the established model conformed to the real situation of the sample data. The original model Q2 was close to 1, indicating that if new samples were added to the model, a more approximate distribution could be obtained. In general, the original model could better explain the difference between the two groups of samples. Q2 values of the random model were all smaller than Q2 values of the original model. The intercept of Q2 regression line and vertical axis was less than zero. At the same time, as the retention degree of displacement decreased gradually, the proportion of Y variable of displacement increased, and Q2 of the random model decreased gradually. This indicated that the original model had good robustness and no overfitting phenomenon exists.
Univariate analysis
Student's t-test P value was less than 0.05, fold change was greater than 2 and variable importance in the projection (VIP) of the first principal component of the OPLS-DA model was greater than 1 when the differential metabolites from the cancer group and the para-cancer group were selected.
Volcano plot
The results of screening differential metabolites in the cancer group versus para-cancer group were visualized in the form of a volcano plot (Figure 3). Each point in the volcano plot represented a metabolite, the horizontal coordinate represents the fold change (FC) of each substance in the comparison group (log2 FC), the vertical coordinate represented the P-value of the Student t test (-log10 P-value), and the size of scatter point represented the VIP value of the OPLS-DA model. The larger the size of scatter point was, the greater the VIP value was. The scatter color represented the final screening result. The significantly up-regulated metabolites were shown in red, the significantly down-regulated metabolites were shown in blue, and the non-significantly different metabolites were shown in gray.
Heat map of hierarchical clustering analysis
Euclidean distance matrix was calculated for the quantitative values of the differential metabolites, and the differential metabolites were clustered by using the full chain method, and the heat map was used to demonstrate (Figure 4). The abscissa in the figure represented different experimental groups, the ordinate represented the different metabolites compared in this group, and the color blocks at different positions represented the relative expressions of metabolites at corresponding positions.
Radar chart
We calculated the corresponding ratio of the quantitative value of differential metabolites, and took the logarithmic transformation of base 2, which was shown in red in the figure, and the corresponding content trend change was displayed in the radar chart (Figure 5).
Heatmap of correlation analysis
For each group of comparison between the cancer group and the para-cancer group, we calculated the correlation coefficient of the quantitative value of the different metabolites. Pearson's method was used for calculation, and it was presented in the form of headmap (Figure 6). The horizontal and vertical coordinates in the figure represented the different metabolites compared in this group, the color blocks at different positions represented the correlation coefficient between metabolites at corresponding positions, red represented positive correlation and blue represented negative correlation. At the same time, the nonsignificant correlation was marked with a cross.
Bar plot
The bar plot of lipid group visualized the results of cancer group and para-cancer group by using the change degree of metabolite content and classification information (Figure 7). Each column in the lipid column represented a class of metabolites. The ordinate of the figure represented the relative change percentage of the content of various substances in this group ratio. If the relative change percentage of content was 0, it meant that the content of this substance was the same in both groups. The percentage change in relative content was positive, it indicated that the content of this substance was higher in the cancer group. A negative percentage of the relative change in content indicated a higher content of the substance in the para-cancer group. The abscissa of the column chart of lipid group represented the lipid classification information.
Bubble plot
The bubble plot was visualized by the degree of metabolite content change, difference significance and classification information of the cancer group and the para-cancer group (Figure 8). Each point in the bubble represented a metabolite. The size of the point represented the P-value of the student's T-test (-log10 P-value). The bigger the dot, the smaller the p-value. Gray points represented non-significant differences with a P-value not less than 0.05, and colored points represented the p-value was less than 0.05 (different colors marked according to lipid classification). The abscissa of the bubble plot represented the relative change percentage of the content of each substance in the group (for substances with great change in content, the relative change percentage of the content of other substances was marked on the corresponding abscissa scale). The relative change percentage of the content was 0, indicating the same content of the substance in the two groups. The relative change percentage of content was positive, indicating that the content of this substance was higher in the cancer group. A negative percentage change in relative content indicated a higher content of the substance in the para-cancer group. The ordinate of the bubble plot represented the lipid classification information. The black line at the bottom showed the distribution density of the metabolite (a line represented a metabolite).