2.1 Ethical Statement
The study protocol was approved by the institutional ethics committee of the First Affiliated Hospital of Third Military Medical University (also called Army Medical University, KY2021060), on February 20, 2021, and written informed consent was obtained from each patient. The clinical trial was registered on the Chinese Clinical Trial Registry (No. ChiCTR2100044138) on March 11, 2021. The principal researcher was Prof. Bin Yi.
2.2 Patient Recruitment and Image Collection
The patient recruitment and image collection phase was conducted at the First Affiliated Hospital of the Third Military Medical University in Chongqing, China, from March 18, 2021, to April 26, 2021. The study set forth specific inclusion criteria: willingness to participate in the research and capability to adhere to the study protocol; necessity for Arterial Blood Gas (ABG) analysis as part of routine clinical care; and a perioperative Hemoglobin variance exceeding 1.5 g/dL. Conversely, the exclusion criteria encompassed: refusal to participate; incapacity to cooperate due to mental health conditions; presence of eye diseases, exposure to eye or facial radiation therapy; affliction by carbon monoxide or nitrite poisoning, jaundice, or any condition affecting the conjunctiva color; or any other factor deemed by researchers to render a participant unsuitable for the study.
To facilitate patient enrollment, image capture, data collection, and image analysis, a standardized research methodology was established. The research team comprised eight members, each assigned specific roles: one for patient recruitment, two for capturing images, two for data collection and management, one for conjunctiva analysis, and two for quality assurance. Prior to the commencement of patient recruitment, all team members underwent training to familiarize themselves with the study's procedures, including the inclusion and exclusion criteria, techniques for conjunctiva exposure and image capture, and conjunctiva analysis standards.
On the day preceding surgery, eligible patients who consented to participate signed a written informed consent form. On the day of surgery, following ABG analysis, the designated team members proceeded to the operating room or the post-anesthetic care unit (PACU) to photograph the patients’ right and left facial profiles, ensuring standard conjunctiva exposure under the typical lighting conditions of the operating room and PACU. The interval between the ABG analysis and the image capture did not exceed 10 minutes. All photographs were taken with the patients in a supine position, using the rear camera of the same smartphone (20.00 megapixel and f/1.8 aperture) under identical settings. Simultaneously, two other team members recorded patient identifiers, gender, Hb levels, age, and other pertinent information.
At the end of each day, the data collection team reviewed the images to identify patients with Hb variations greater than 1.5 g/dL, discarding all unselected images permanently. The quality control team oversaw the entire process, ensuring the integrity of patient recruitment, image quality, and data accuracy throughout the study.
2.3 Workflow and Experimental Methodology
In this research, we innovated a smartphone-based solution capable of estimating hemoglobin levels through the application of deep learning. This system utilizes a smartphone application to capture eye skin images, which are subsequently analyzed by a deep neural network. This network has undergone training on a dataset comprising Hb measurements obtained via invasive blood tests. It employs features extracted from the skin images to forecast Hb concentrations. The system's workflow and the experimental setup are delineated below.
Fig.1 provides a schematic overview of our system's workflow and the study's experimental framework. The system encompasses an algorithm dedicated to eyelid segmentation and another algorithm designed for predicting Hb concentrations based on these values (refer to Fig.1). Leveraging deep learning technology, we accomplished swift and reliable detection of Hb levels in patients undergoing surgery.
To compile training datasets, we captured eyelid images from patients using various smartphone models. Data augmentation techniques were employed to enhance the deep learning method's accuracy and robustness. The efficacy of the trained model was assessed on novel datasets, with its precision being verified against a collection of 265 test samples. To further validate the model's accuracy, we conducted a comparative analysis involving two distinct experimental cohorts: one comprising human experts and the other utilizing the prediction model outlined in this paper, both evaluating the same set of 265 test images. Medical professionals estimated the Hb concentration range based on the patients' eye images and assessed the accuracy of their estimations. Conversely, the smartphone application processed the eye images through segmentation and subsequently forecasted the Hb values. The application then precisely determined the prediction error within a specified range.
This experimental design not only highlights the potential of mobile technology in medical diagnostics but also showcases the accuracy and efficiency of deep learning algorithms in predicting critical health markers such as hemoglobin levels.
2.4 Image Data Augmentation
To effectively train deep learning models, a substantial volume of training data is essential. Nonetheless, enlarging the training dataset poses a significant challenge. A practical approach to augment the volume of training data involves the reproduction of existing data. This process generates multiple images from a single source by randomly applying a combination of techniques illustrated in Fig.2, which includes: (A) color temperature adjustment, (B) contrast enhancement, (C) brightness alteration, (D) Gaussian blur application, (E) horizontal flipping, and (F) stochastic cropping and resizing. These augmentation techniques are selected to mirror the variety of conditions encountered when photos are captured using smartphones in real-world scenarios.
The rationale behind each technique is as follows:
- Techniques A, B, and C address the variability in color representation across different smartphone models, ensuring the model is not biased towards the color metrics of a specific device.
- Technique C also accounts for the diverse lighting conditions under which photos might be taken, ranging from dimly lit environments to brightly illuminated settings.
- Technique D introduces an element of blur to simulate photos taken out of focus, a common occurrence in hastily captured images.
- Techniques E and F are designed to mimic minor inaccuracies in framing and alignment that can occur during the photo capture process, ensuring the model can accurately process images despite slight imperfections.
By employing these data augmentation techniques, we not only increase the diversity of our training dataset but also enhance the robustness and generalizability of our deep learning model to accurately interpret images under a wide array of conditions typical of smartphone photography.
2.5 Model Optimization for Precise Eyelid Detection and Hemoglobin Concentration Prediction
The effectiveness of many AI-based diagnostic methods can significantly diminish when faced with variations in lighting conditions, camera angles, and other external influences. To counteract these challenges and enhance algorithmic performance, this study employed two distinct algorithms: 1) an Eyelid Semantic Segmentation Algorithm, and 2) a Prediction Algorithm based on color intensity analysis. The model's performance was assessed utilizing a deep learning framework, which was implemented directly within a smartphone application, as depicted in Fig.1.Initially, the Efficient Group Enhanced UNet (EGE-Unet) [18] was deployed to accurately identify the target eyelid regions. The success of the prediction algorithm was found to be closely tied to the precise localization of these regions of interest. Subsequently, the deep learning network was tasked with performing hemoglobin concentration predictions, through which the DHANet model emerged as the superior prediction model due to its exceptional accuracy.
In the concluding phase of model optimization, the DHANet model was selected as the definitive choice for executing highly accurate hemoglobin concentration predictions. This decision was based on its proven efficacy in diagnosing concentration levels accurately, thus underscoring the critical importance of both accurate eyelid segmentation and effective color intensity analysis in enhancing the performance of AI diagnostic tools, especially when operated in the variable and unpredictable environment of smartphone applications.
2.6 Deep Learning Model Architecture
This study introduces a deep learning model structured in two pivotal stages: the Region of Interest (ROI) cropping stage and the decision-making stage. This bifurcation stems from the observation that in diagnostic imaging, particularly as illustrated in Fig.1, the most informative content is often localized within a small area near the test line. Direct decision-making from the original, full-sized image is inefficient due to the disproportionate ratio of relevant information to the overall image size. This model draws inspiration from human diagnostic practices, where focus is typically narrowed to the test line area. By mimicking this approach—segregating the precise cropping of the test line area (ROI cropping stage) from the diagnostic analysis utilizing the test line's data (decision stage)—we aim to enhance learning efficiency.
The EGE-Unet[18] model, an advanced iteration of the traditional U-Net[19] designed to address challenges in medical image segmentation, is deployed during the ROI cropping phase. It incorporates two novel modules: the Group multi-axis Hadamard Product Attention module (GHPA) and the Group Aggregation Bridge module (GAB). The GHPA module facilitates the extraction of lesion information from various angles by grouping input features and applying Hadamard Product Attention operations across different axes, an idea inspired by the Multi-Head Self-Attention mechanism. Meanwhile, the GAB module merges semantic features and detail features across scales, alongside the masks generated by the decoder, through group aggregation. This integration enables the extraction of multi-scale information efficiently. The EGE-Unet model stands out for its segmentation accuracy, low parameter count, and computational simplicity, making it particularly suited for practical applications.
Fig.3 delineates the EGE-UNet's design, showcasing a U-shaped layout with symmetrical encoder-decoder components. The encoder is segmented into six stages, each characterized by varying channel numbers. The initial stages employ standard convolutions, while the latter ones utilize the GHPA for multi-perspective representation extraction. Each encoder-decoder junction incorporates the GAB, enhancing upon the simplistic Skip connections found in the original U-Net. Deep supervision is employed to facilitate mask predictions at multiple scales, contributing to the GAB inputs. These enhancements enable EGE-UNet to surpass previous methods in terms of segmentation efficacy while maintaining a reduced parameter and computational footprint. For an in-depth discussion on the GHPA and GAB modules, refer to reference [18].
The decision stage of this paper introduces a hemoglobin concentration prediction model that adopts a regression-based approach. Drawing inspiration from the miniaturized face detection Delta Age AdaIN (DAA) network[20], this method encodes age into binary form for input into a Transfer learning framework to capture continuous age-related feature information. The binary code mapping yields two groups of values corresponding to the mean and standard deviation of the comparison ages, respectively. The age decoder calculates the difference age, and the mean of all comparison and difference ages is utilized for age prediction. This methodology is adapted for the eyelid prediction stage, as depicted in Fig.4.
The architecture of the eyelid prediction system, as depicted in the diagram, leverages deep learning alongside binary encoding mapping technology, comprising four main components:
1. EyelidEncoder Module: This pivotal module transforms the eyelid image into a comprehensive feature vector, encapsulating essential characteristics of the eyelid. For this purpose, the C3AE network[21] is employed due to its efficiency and compactness, making it particularly suitable for deployment on mobile platforms.
2. Delta Hemoglobin AdaIN (DHA): The DHA component is instrumental in estimating hemoglobin concentrations by juxtaposing the current image against a repository of images representing a spectrum of hemoglobin levels. It facilitates hemoglobin concentration prediction by evaluating the feature discrepancies across images.
3. Binary Encoding Mapping Module: Given that hemoglobin concentration variation is a continuous and gradual phenomenon, an 8-bit binary code is utilized to encapsulate the range of hemoglobin concentrations. This method employs binary encoding to transform the continuous spectrum of hemoglobin levels into a discrete, yet seamless, binary representation, enhancing the model's efficiency and interpretability.
4. EyelidDecoder Module:Acting as the final step in the prediction pipeline, the EyelidDecoder module interprets the outputs from both the EyelidEncoder and the binary encoding mapping modules. Utilizing this consolidated information, it accurately predicts the patient's hemoglobin concentration levels.
In refining the DHA prediction model, the performance of the EyelidEncoder module was optimized by evaluating the resnet18 network in comparison to the original c3ae model. Additionally, a range of widely recognized mobile image processing architectures—MobileNet[22], MobileNetV2[23], MobileNetV3[24], Shufflenetv2[25], Squeezenet[26], Wideresnet[27], Resnet18_CBAM[28], PFLD[29], and BCNN—were analyzed for their applicability. Notably, the PFLD model is recognized for its compact structure, suitable for age prediction, while the BCNN, a simple 5-layer convolutional network, was developed in-house. The efficacy of these models was assessed using various metrics, including Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and R-Squared (R2), to ensure a comprehensive evaluation of model performance.
2.7 Experiments on the Server
The experiments were conducted using the open-source PyTorch learning framework and programmed in Python. The hardware setup for these experiments was hosted on a Dawning workstation at the Chongqing Institute of Green and Intelligent Technology, part of the Chinese Academy of Sciences. This setup boasted dual NVIDIA 3090 graphics cards, each with 11 GB of memory, and ran on a 64-bit Ubuntu 16.04 operating system.
2.8 Model Porting and Mobilization
A smartphone application for Android systems was developed to facilitate hemoglobin concentration estimation directly from eyelid images. As depicted in Fig.5, the mobile application is divided into two main sections: sampling detection and case management.
Sampling Detection Section:
- Photo-taking Functionality: Users can capture images using both the front and rear cameras of their device. The application features an interface with a target detection box to guide users in framing the eyelid within the photograph. Alternatively, users can select existing images from their photo album for analysis.
- Eye Area Image Display: Captured images of the eye area are displayed through the application's interface for review and further processing.
- Hemoglobin Concentration Recognition: The application employs the developed model to analyze the selected eye area images, determining hemoglobin concentration levels and highlighting specific regions associated with these levels through mask areas.
- Result Display Function: The detected eyelid area, mask area, and the calculated hemoglobin concentration values are presented to the user, enabling easy visualization and understanding of the results.
Case Management Section:
- Users have the capability to store detection outcomes and enter patient details to create a new case file. Future sampling for the same patient can be added and linked to the existing case, allowing for monitoring of hemoglobin level changes over time.
The model porting process to a mobile platform involves several technical steps. Initially, the segmentation and prediction models are converted to the ONNX format for broader compatibility. Subsequently, model invocation is handled via OpenCV, with inference code crafted in C++. This inference logic is then packaged through NDK cross-compilation, leading to the creation of an SDK with a standardized C interface. The final application interface and related logic are developed in Android Studio, enabling the SDK to perform predictions through a conventional C interface. This comprehensive approach ensures the seamless integration of sophisticated deep learning models into user-friendly mobile applications, enhancing accessibility and utility for end users.