Participants
Twenty-five healthy full-term newborns (mean gestational week 39.24 ± 7.82, recording age 7.27 ± 11.40 days after birth, 15 males/10 females, head circumference 34.08 ± 1.43, birth weight 3020.20 ± 324.11, Apgar Score at five minutes 9.79 ± 0.66 were recruited at the maternity ward of the Hospital Clínic Barcelona (Spain). Out of this initial sample, 2 newborns were not included, because we did not obtain enough analyzable data due to movement artifacts. The remaining 23 newborns were included in the final analysis. One single session was recorded per subject.
Infants had been assessed by board-certified neonatologists and diagnosed as healthy term newborns with no major congenital abnormalities or illness since birth. Newborns under medication and/or with congenital malformations, chromosomal abnormalities, hypoxic-ischemic encephalopathy, intraventricular hemorrhage greater than grade 2, and any other type of brain damage, congenital heart disease, siblings with autism spectrum disorders or other neurodevelopmental disorder were excluded from this study.
Ethical Considerations
The study was conducted following the Institutional Research Ethics and the declaration of Helsinki. Formal ethical approval was granted by the Local Ethical Committee, Hospital Clínic Barcelona (Ref: NeuroCry/HCB/2021/0843). The consent form documented the study's aims, nature, and data acquisition procedures. Anonymization and data confidentiality was maintained throughout the study. All parents agreed and signed the informed consent prior to participation. In addition, signed informed consent was obtained from the family to publish the newborn’s face in this manuscript for an open access publication.
Procedure
Data collection was performed during the standard routine of newborn nursing (before and post feeding, during some medical procedures, etc.). As such one session was conducted with each neonate. Synchronized EEG, NIRS, audio, and video recordings were acquired for each newborn, who was lying down comfortably in a cot in the hospital maternity ward. Continuous unique sessions lasting from 20 to 120 minutes were recorded in a paradigm where the newborn could be calm-awake or crying. Within this paradigm, different distress levels were defined as changes in the newborn’s status generated by uncomfortable scenarios (i.e., fuzziness, stress, pain, etc.), yielding in the following conditions: resting, cry and distress.
In order to ensure a proper data crossing among different nature data sources, all the devices were properly synchronized via timestamp before each session. In addition, markers were introduced in every signal type. Figure 1 shows the experimental design and overall analysis pipeline.
After extracting features from the neurophysiological and audio signals and analyzing COMFORT scale scores from facial expressions and body movements, we conducted statistical analysis between the extracted features and the different distress levels based on the cry sequences.
Audio Analysis Pipeline
Data acquisition. Newborn crying emissions were recorded by means of a portable high-quality field recorder (ZOOM H1N™) equipped with a unidirectional microphone, positioned at a fixed distance (30cm) from the infant’s mouth and stored on a multimedia laptop on a .WAV double channel audio track, with sampling rate Fs = 48 kHz and 24-bit resolution. Cries were never induced for the purpose of the study, as spontaneous vocalizations are part of normal infant behavior. Several audio recordings were registered during each session, in order to include various crying episodes, with a suitable amount of time both before and after each cry episode. During the recording, environmental noises, including human speech and noises from medical machinery, were also captured. Thus, our dataset resembles that of real-world samples.
Data processing. Segmentation. All the audio recordings have been manually segmented into cry episodes (CEs – the amount of time the infant cries in each audio recording divided by silence periods). Then, CEs have been manually segmented into cry units (CUs - individual cry patterns within a CE separated by an expiration phase). Visual spectrographic analysis has been carried out using iZotope RX 7 Audio Editor™. The classification of CEs and CUs has been done manually, considering segments with high spectral content and intensity over time as distress cries and those with lower spectral content as normal cries39 (See Fig. 1a). Both the segmentation and the qualitative assessment of every CEs and CUs have been carefully reviewed by at least two cry signal experts. Cries without unanimous agreement among the experts were excluded from further analyses. Afterwards, the three different distress levels have been acoustically identified in every CE:
-
resting: no CEs, pause or resting periods with silent audio recordings, the newborn is not crying but awake/alert state.
-
cry: CEs composed by lower spectral content CUs and milder acoustical intensity.
-
distress: more acoustically intense CEs that are composed of high spectral content CUs.
Feature extraction.
Cepstrum Analysis. In order to prove the objectivity of qualitative labeling, Machine Learning algorithms have been used as an automatic approach to validate manual audio segmentation. For that purpose, two different approaches have been executed. The first one uses traditional Machine Learning based on a recent study11 with a similar infant cry classification (pain vs non-pain) achieving a 90.7% accuracy using Random Forest40, reason why this algorithm has been considered our baseline. As input features, the first thirteen Mel Frequency Cepstral Coefficients (MFCCs) of every CU have been computed using the Python 3 package for audio analysis Librosa. MFCCs are widely used in acoustic research and have been proven to be a reliable feature to classify audio signals41.
The second approach computes the spectrograms from each CU that were used as input of a Deep Learning (DL) algorithm. This DL method was employed to validate the manual binary classification of cry and distress conditions. In this case, we used a Convolutional Neural Network (CNN)42 consisting of 2-dimensional convolutional layers and dense layers. In order to avoid overfitting, pooling layers were also used together with batch normalization layers to optimize the training. In both approaches, 80% of the samples were used for the training and 20% for the validation.
Time Analysis. Within CEs, the actual vocalizations (cryCE) are not continuous, but are punctuated by inspirations and spontaneous pause or silence periods (unvoicedCE). The frequency of each of them within CEs was assessed by quantifying their total duration and percentage of appearance in the full CE. Specifically, the duration of every CU and the unvoiced window between CUs within every CE were computed for every cry pattern. Hence, the following variables were studied for full cry episodes: total duration in seconds and percentage of unvoiced part between CU per CE, duration in seconds and percentage of cry per CE and of unvoiced part between CE. Those episodes separated by less than 5 seconds labeled with the same condition were not included in the study as they should be considered together as a unique episode43.
Frequency Analysis. Currently, there are several software tools that estimate F0 and resonance frequencies but all of them were initially applied to adult voices. Since the adult and infant vocal passages differ in shape, these tools should be used with precaution5. Digital signal processing and frequency analysis of each CU were conducted through Praat software44, considering that is the most commonly used. Default values were changed according to the infant cry literature. A band-pass filter between 200–1200 Hz was selected. Audio recordings were collected with a sampling rate of 48,000 Hz, and the signal was low-pass filtered at 10,000 Hz45. The main frequency features include F0 and its descriptive statistics (maximum, minimum, mean, standard deviation), the resonance frequencies of the vocal tract (F1, F2, F3) along with the percentage of high-pitch (F0 > = 800Hz)46 and hyper-phonation (F0 > = 1000HZ)47 level of the CU were computed. Other voice quality parameters related to the phonation of the vocalization are also included: local jitter (Jitter: micro-variations of the F0 measured with pitch period length deviations), local shimmer (Shimmer: amplitude deviations between pitch periods), harmonic to noise ratio (HNR, quantifies the amount of additive noise in the voice signal)48. These perturbation measures are widely extended in clinical settings49.
Eeg Pipeline
Data acquisition. Neurophysiological data were acquired using an ANT Nëo Monitor eego™ (ANT Neuro, Germany – CE mark MDD 93/42/EEC, CE class Iia, FDA 510(k) in USA.) with 8 EEG and 2 aEEG channels mounted in an elastic cap (waveguard ™ original, Germany) with high-quality Ag/AgCl sensors. These non-invasive, gel-based electrodes are fixed to the cap and present a very low profile, which makes this cap very comfortable for the newborn (e.g., avoiding excessive rubbing and pressure onto the scalp). The electrodes were placed according to the extended 10–20 positioning system (channels F3, F4, C3, C4, T7, T8, P3, P4) and were later re-referenced offline to the average reference. The sensor impedance was kept below 10kΩ, and EEG data were acquired at a sampling rate of 512 Hz. All recordings were done by research assistants/clinicians with EEG acquisition experience.
Data processing. The dataset was analyzed offline using Matlab r2022a with the Brainstorm Toolbox50. A band-pass filter between 1–45 Hz was applied to the EEG data to remove power line contamination and low frequency artifacts. EEG data were manually examined by a careful visual inspection to detect ocular, muscle, and jump artifacts confirmed by an EEG expert (SP). We did not use an automatic algorithm because most of the methods available (e.g., ICA) are developed for adults' brain signals acquired in normal environments that generate artifacts that are generally easy to detect and correct. In our case and due to the nature of the acquisition of the data, newborns are crying, sometimes crying irritated while recording, so the movements and artifacts generated by that situation are not easy to detect nor correct, so the automatic methods become ineffective (see supplementary material).
After that, bad channels were manually identified and interpolated using spherical splines51. A maximum of 1 channel was interpolated if more channels were found as bad the whole trial was rejected from the analysis. The remaining artifact-free data were segmented into four-second epochs52, according to the audio/distress segmentation criteria mentioned in the section before yielding to the following conditions: resting, cry and distress.
EEG data analysis was performed for the following classical frequency bands: delta (ẟ: 1-4Hz), theta (θ: 4-8Hz) and alpha (α: 8-12Hz). Higher frequencies, from beta to gamma range, were not included in the analysis to avoid contamination with muscle activity.
Additionally, the power spectrum of each EEG sensor was computed by using Welch’s periodogram method53, taking the 4s segments tapered with a Hanning window and a 50% overlap. For each sensor, relative power was calculated by normalizing the power at each frequency by total power over the 1–45 Hz range.
To quantify the relative power changes across conditions with respect to the resting state, the total relative power of the frequency bands analyzed was considered as 100%, and the percentage of relative power for each frequency band was calculated for each sensor and all the conditions.
Nirs Pipeline
Data acquisition. NIRS data were collected for the newborns along with the acquisition of audio and EEG, hence having the same conditions and same timestamps used on the other signals’ registrations. In this case, Root O3™ (Masimo, USA) - CE mark G1 092076 0013 Rev. 00) was the equipment selected for NIRS data acquisition. This device uses NIRS forehead sensors to enable measuring regional hemoglobin oxygen saturation (rSO2), i.e., the central oxygenation level. Functional arterial hemoglobin oxygen saturation (SpO2), i.e., the peripheral oxygenation level and pulse rate (PR-bpm), i.e., the heart rate signal are continuously and non-invasively monitored with a fingertip sensor on the newborn.
Data processing. rSO2, SpO2, and PR-bpm data were collected every 2 seconds and saved by the device. Later, these variables were exported offline and analyzed in Python 3. NIRS data that were characterized by a standard deviation lower than 0.5 were not considered in the analysis to eliminate errors from the data acquisition process. Also, the interquartile range (1.5*IQR) method was used to remove outliers. The remaining non-rejected data were segmented into normal cry, distress and resting time episodes based on the timestamps obtained in the audio segmentation criteria explained in the audio signal processing section. The 15 seconds preceding and following each segment were discarded9. In addition, a low band-pass filter was applied to the corresponding CE intervals removing SpO2 mean values lower than 8054, rSO2 lower than 5055 or PR-bpm lower than 7056 to eliminate noise and errors derived from newborn’s movements.
Facial Expression & Body Movement Analysis
Nowadays, neonatologists use common tools to measure distress levels in newborns from a qualitative perspective, especially assessing crying, facial expressions, and body movements. Among them, the COMFORT scale allows for assessing distress states, sedation, and pain in nonverbal pediatric patients, being cry characteristics part of the assessment57,58. The COMFORT scale was adapted to Spanish, and it has been shown to be a valid and reliable tool (Cronbach alpha coefficient of 0.785 for newborns) to assess comfort in a group of children admitted to a Spanish Intensive Care Unit59,60. The COMFORT scale has been used to qualitatively evaluate the video recordings of facial expressions and body movements during each session and to identify the levels of distress.
Data acquisition and processing. A high-quality video recording of the newborn was acquired for each session ensuring the registration of facial expressions and body movements following a standardized protocol. Afterwards, two experts reviewed (AL, IAP) and assessed the newborns individually according to the COMFORT scale for each cry episode on the video. In case of disagreement between the experts, a third reviewer (AP) was asked to present their evaluation. The aspects evaluated include six sections: alertness, agitation, crying, body movements, muscular tone, and facial tension. Each section can be rated from 1 (calm infant) to 6 (stressed infant) and the total distress score of each CE ranges from 6 to 30, with larger score values indicating a higher arousal threshold.
Statistical Analysis
Statistical analysis was performed using Matlab r2022a, Graphpad Prism 8 and SPSS22. We conducted statistical comparisons between all three conditions (resting, crying and distress) and for each pairwise condition related to audio, EEG, and NIRS signals, and the COMFORT scale.
Shapiro-Wilk test was applied to feature arrays to verify that data were not normally distributed. Also, due to the nature of the data collection, which consisted of spontaneous cry recordings during the newborn's daily routine, the segments of the three different conditions (resting, crying, and distress) were not balanced. As such, we randomly selected a representative number of segments for each signal feature (audio, EEG, NIRS), described in the results section below.
Audio and NIRS processed data were compared with an ANOVA and a Tukey-Kramer tests for post hoc comparisons and a bootstrapping procedure repeated 10000 times to correct for normality. EEG and the COMFORT scale data were assessed with a Mann-Whitney U-test for pairwise comparisons, and a Kruskal-Wallis test when more than 3 conditions were compared. For EEG pairwise comparisons, the Holm-Bonferroni correction method was applied while for the 3 condition comparisons the Dunn's test was selected.
For an integrative approach, EEG/NIRS features and COMFORT scale results were correlated with the acoustics features using the Spearman (Rho) correlation coefficient. Additionally, the Kendall Coefficient of Concordance (W)61 was calculated to assess the level of agreement between audio features with neurophysiological and behavioral data for cry and distress conditions. We used Cohen's interpretation guideline62, where 0.3 < = W < 0.5 and W > = 0.5 correspond to moderate and strong agreement effects, respectively.