Study design and population
This study is nested within the Childhood Acute Illness and Nutrition (CHAIN) Network cohort study that aimed to characterise the biomedical and social risk factors for mortality in acutely ill young children17. Briefly, the CHAIN cohort enrolled children aged 2-23 months during admission to one of nine hospitals in six countries in sub-Saharan Africa and South Asia. Acutely-ill children were enrolled and classified into three nutritional strata based on the mid-upper arm circumference (MUAC); not wasted (MUAC ≥12.5 cm for age ≥6 months or MUAC ≥12.0 cm for age <6 months), moderately wasted (MUAC 11.5 cm to <12.5 cm for age ≥6 months or MUAC 11.0 cm to <12.0 cm for age <6 months), and severely wasted or kwashiorkor (MUAC <11.5 cm for age ≥6 months or MUAC <11.0 cm for age <6 months or bilateral pedal oedema unexplained by other medical causes)17. Children were followed up for 6 months after discharge from hospital with planned visits at days 45, 90 and 180. A total of 3101 children were enrolled between November 20, 2016 and January 31, 2019 of whom 5.1% (n = 157) had HIV infection, defined by polymerase chain reaction (PCR) or antigen testing per national protocols. CHAIN collected detailed clinical, demographic, anthropometric, laboratory and social exposures as well as blood, stool and rectal swabs as described previously17.
The current study is a case-control analysis, focussing on how HIV infection modulates systemic biological mechanisms that influence post-discharge growth among children hospitalised with acute illness. We initially categorised all children with HIV infection (n=157) as cases, while controls consisted of children without HIV, 24% of whom were selected using a random number generator. To focus specifically on post-discharge growth, we excluded children who passed away during their hospital stay (n=182, including 45 children with HIV) and those from Asian study sites where HIV prevalence is very low. Consequently, our analysis included a total of 112 children with HIV infection (cases) and 722 children without HIV (controls), as illustrated in Figure 1A. For the mechanistic analysis investigating influence of HIV on post-discharge growth, only severely wasted children (n=217 of whom 38 children had HIV) at discharge that survived up to 3 months and had proteome data were included.
Plasma proteomics and data pre-processing
Proteins in plasma samples obtained from children at discharge from the hospital were quantified using the aptamer-based SomaScan assay as previously described18,19. The SomaScan assay employs Slow-Off rate Modified Aptamers (SOMAmers) to profile plasma proteome. It quantitatively transforms protein epitopes into SOMAmer-based DNA signals, subsequently measured via DNA-hybridisation microarrays. The resultant text-based ADAT files were imported, transformed and annotated using a free and open-source R package called readat20. Subsequently, the non-human protein targets were filtered retaining only proteins from humans (n=7335 proteins) which were log-transformed, scaled to unit variance by autoscaling and mean-centred.
Data analysis
Baseline characteristics
Participant characteristics including clinical, demographic, nutritional status and anthropometry at hospital discharge were summarised using median with interquartile ranges or mean with standard deviations if continuous and proportions if categorical, stratified by HIV status.
Growth from discharge through 6 months of post-discharge follow up
We compared growth between children with and without HIV by examining mid-upper arm circumference (MUAC), weight-for-age z score (WAZ), weight-for-height z score (WHZ) and height-for-age z score (HAZ) from discharge to 180 days post-discharge. Fixed-effects panel model specification was used with individual effects included in the model as unobserved time-invariant characteristics of the child at a given time point. The model included an interaction term of HIV status with time point and adjusted for sex, recruitment site and baseline anthropometry at discharge for each growth parameter as shown in equation 1.
Yit = αi + Timet + baseline anthropometry + sex + site + Timet:HIV status + uit (1)
where Yit is either MUAC, WAZ, WHZ or HAZ at time point t; ai (i = 1...n) is the individual fixed effect; Timet is the time trend variable t; baseline anthropometry is the growth parameter at discharge while sex and site are time-invariant covariates; Timet:HIV status is the interaction term of time with HIV status at a given time point and uit is the error term. Children without HIV were used as the reference group in the interaction part of the model. This was implemented in R using the plm21 package.
Given that the parent CHAIN study ensured accuracy by double measuring and verifying anthropometric data in hospitalised children, no outlier values were excluded from the analysis.
Construction of the protein correlation network
We used a weighted correlation network analysis, which assumes that linkages within biological networks have a non-random scale-free topology22. In biological settings, a networks topography is considered to have reached near scale-free topography at r2 ≳ 0.8. Using the weighted gene correlation network analysis (WGCNA)23 package in R, we reduced the dimensionality of the plasma proteomics dataset into clusters of tightly correlated proteins called modules. Firstly, to achieve a scale-free topology criterion, we generated sets of soft thresholding powers ranging from 1 to 50 (Supplementary Figure 1A). Optimal soft thresholding power of 9 with r2 = 0.85 and minimum mean connectivity was chosen. Secondly, a biweight mid-correlation, 𝑠𝑖𝑗 = bi𝑐𝑜𝑟(𝑥𝑖, 𝑥𝑗) matrix was generated between every pair of proteins (i and j). This correlation matrix was then transformed into an adjacency matrix, 𝑎𝑖𝑗 =|𝑠𝑖𝑗|𝛽 through power transformation using the soft thresholding power of 9. Through power transformation, weak and negative correlations are punished while strong correlations are amplified, this in turn helps to augment signal-noise ratio in the adjacency matrix thus increasing robustness of the network. Among various potent soft thresholding powers, 9 produced a network with fewer unclassified proteins, making it the optimal choice for analysis. For better biological interpretation, a signed network of proteins was constructed. By applying hierarchical clustering algorithm implemented in the WGCNA23 package, tightly correlated proteins were clustered into modules using the one-step automated blockwiseModules function. The minimum number of proteins forming a module was set to 10. Other parameters used for network construction include: maxBlockSize of 20,000 to ensure that the entire proteome is analysed as a single block and not as small blocks. The parameter for merging the modules was set at 0.25.
Association between HIV status and protein modules at discharge
Each protein member of a module is characterised by an eigenprotein, E(q), which is the first principal component of a given module obtained through singular value decomposition. The module eigenprotein primarily represents the collective behaviour of that specific module. The association between individual protein modules, treated as the outcome variable, and HIV status was implemented using multivariate linear regression with inverse probability weighting (IPW) to balance the baseline characteristics between children with and without HIV24,25. Weight for each observation selected into the nested case-control study was generated by computing the probability of a child being HIV positive given their age, sex, nutritional status, clinical syndromes (malaria, diarrhoea, pneumonia) and site of recruitment. The R script for this can be found from the Harvard Dataverse website26. The association between HIV status and protein modules was executed using lmer function in the lme427 R package as shown in equation 2.
ME(q) = HIV status + (1|site) u; weighted (2.0)
where ME(q) is the eigenprotein for the respective protein module; site is the enrolment site here modelled as a random effect and u is the error term.
Significant associations between protein modules and HIV status were determined at p-value <0.05 after adjusting for multiple comparison for the 39 identified modules using Bonferroni correction.
Biological pathway analysis
To uncover biological processes represented within each differentially expressed protein module we assessed Gene Ontology (GO) enriched biological processes using The Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.8 Bioinformatics Resource28, WEB-based Gene SeT AnaLysis Toolkit (WebGestalt)29 2019 version and the STRING database30 version 12.0. STRING is a public database of known and predicted protein-protein interactions. The interactions include direct (physical) and indirect (functional) associations; they stem from computational prediction, from knowledge transfer between organisms and from interactions aggregated from other (primary) databases. Homo sapiens and list of human proteins (n=7335) from the SomaScan platform served as the background for calculating fold enrichment. Significance of GO-enriched biological processes was determined based on a Bonferroni corrected p-value of <0.05 and fold enrichment score. Protein connectivity within the modules was visualised in Cytoscape31 version 3.10.2. Proteins in the network that occurred as singletons (not connected to each other) were removed from the final network output.
Hub proteins identification in modules associated with HIV status
As databases may not be able to ascertain all biological pathways involved in each module, we further identified hub proteins, which aided in understanding the biological mechanisms underlying the modules. To identify hub proteins, we determined the association between every protein in a module and HIV status. For each module, we further defined the quantitative measure of module membership as the correlation between the module eigenprotein and the protein expression profile. Connectivity values were scaled by dividing the within connectivity with maximum connectivity for that module, Ki = ki/kmax. This was implemented using signedKME and intramodularConnectivity.fromExpr functions for a signed network within the WGCNA R package23. The absolute effect sizes of protein-HIV significance were then plotted against the scaled intra-modular connectivity, revealing hub proteins. Subsequently, the expression profiles of the hub proteins were compared between children with and without HIV, with p-values adjusted for multiple testing for the 27 modules associated with HIV status using Bonferroni correction.
Association between HIV-related protein modules and 90-day post-discharge growth
To examine the link between HIV and post-discharge growth among severely wasted children, we initially identified HIV-related protein modules associated with post-discharge anthropometry. Here, anthropometric measurements including MUAC, WAZ, WHZ and HAZ at 90 days after hospital discharge were used. Post-discharge growth was defined as anthropometric measures at day 90 after accounting for baseline anthropometry. To determine which HIV-related protein modules at discharge were associated with 90-day post-discharge growth, mixed-effects linear regression models were employed adjusting for anthropometry at discharge, age at discharge and sex, with site included as a random effect. This was implemented using lmer function in the lmertest32 R package. Models were built for individual anthropometry, as shown in equation 3.0. We used 90-day post-discharge anthropometric measurements in this analysis because they are closer to the sample collection time and are more likely to be related to the biological factors measured at discharge, compared to the 180-day measurements.
Post-discharge anthropometry (at day 90) = HIV-related ME(q) + anthropometry at discharge + age at discharge + sex + (1|site)+ u (3.0)
where post-discharge anthropometry (MUAC, WAZ, WHZ and HAZ) is the anthropometric measurement at day 90 after hospitalisation; HIV-related ME(q) is the eigenprotein of HIV-related protein modules identified using equation 2.0 above; anthropometry at discharge is the baseline anthropometric indices taken upon hospital discharge; site is the random effect of the model and u is the error term.
Candidate hub proteins of anthropometry-associated modules were determined by the correlation between the absolute effect size of protein-anthropometry significance and intra-modular connectivity.
Pathways linking HIV, proteome and post-discharge anthropometry using structural equation modelling
We examined pathway(s) linking HIV to 90-day post-discharge anthropometry using structural equation modelling (SEM). Simple linear regression models were built to assess the effect of HIV on 90-day post-discharge anthropometry based on our hypothesised base path model (Supplementary Figure 2). The primary exogenous and endogenous variables in the SEMs were HIV status and 90-day post-discharge anthropometry, respectively, adjusting for baseline anthropometry. Other variables included HIV-related protein modules significantly associated with 90-day post-discharge anthropometry, baseline anthropometry at discharge, age at discharge, sex and site of recruitment. SEM models included children with missing 90-day post-discharge anthropometry thus, coefficients were estimated using full information maximum likelihood estimator (FIML), accounting for missing observations33. This was implemented in the lavaan34 package in R using the sem function. Standardised path coefficients were used to assess the relationship between exogenous and endogenous variables. Associations with significance levels <0.05 were considered important.
Model fits for the SEM were evaluated using the Chi-square (χ2) test statistic, with a p-value > 0.05 indicating no significant difference between the hypothesized and the fitted models. This implies that the conceptualized model potentially describes a true relationship under investigation in the general population. Other model fit metrics included the comparative fit index (CFI, where CFI > 0.90 indicates good model fit), root mean square error for approximation (RMSEA, < 0.06 represents good fit, < 0.09 reasonable fit) or standardised root mean squared residual (SRMR, < 0.06 represents good fit)35. Supplementary File S2 summarises model parameter estimates and fit measures. A general analysis workflow of the present study is shown in Supplementary Figure 3.