4.1 Imaging setup
All experimental data used in this work were captured on our MFLI system, where the detailed information can be found in [25]. Briefly, the system uses a large-format Intensified Charge-Coupled Device (ICCD) camera (Picostar HR, LaVision GmbH, Germany), for wide-field detection over a 8 × 6 cm2 in combination with a Digital micro-mirrors device (DMD), DLi 4110, Texas Instruments, TX, USA, for a wide-field illumination. As an excitation source, we used a tunable Ti-Sapphire laser, Mai Tai HP, Spectra-Physics, CA, USA, which delivers 100 femtosecond pulses at 80 MHz. A gate width of 300 ps and gate delay of 40 ps were used for capturing time-resolved fluorescence decays (for in vivo and in vitro experiments) with a total of 176-time points, which is referred to as a number of gates (G = 176). An emission filter at 740 × 10 nm (FF01-740/13–25, Semrock, IL, Rochester, NY, USA) is used to capture the TPSFs, with the laser set at a 700 nm wavelength, and Alexa Fluor 700 dye was used to obtain the time-resolved fluorescence signals.
4.2 Generation of training data and classical Fluorescence lifetime processing
Fluorescence Lifetime decay follows exponential decay. Depending on the number of components present in the sample, the decay kinetics can be described by a combination of multi-exponential functions. Most FLI imaging experiments involve up to two components; hence, a bi-exponential model is typically used. The two-component or bi-exponential model also includes mono-exponential cases (where fractional amplitudes AR are one or zero). Mathematically, the TPSF is the convolution of the IRF and the fluorescence decay associated with the lifetime parameters, as shown in Eq. 2, where lifetime decays are denoted as τ1, τ2, and AR is the amplitude fraction.
$$\:TPSF\left(t\right)\:=\:IRF\left(t\right)\:*\:({A}_{R}{e}^{\frac{-t}{{\tau\:}_{1}}}+\left(1\:-\:{A}_{R}\right){e}^{\frac{-t}{{\tau\:}_{2}}})$$
2
The in-silico data used for training and validating the proposed model was generated using Eq. 2. Initially, time-resolved fluorescence lifetime images with dimensions of 28×28 pixels were generated by using the MNIST dataset. Fluorescence decays were generated for a range of lifetime values commonly used in NIR applications: 0.2 ns to 0.8 ns for τ1 (short-lifetime component) and 0.8 ns to 1.5 ns for τ2 (long-lifetime component). The range of the AR (fraction amplitude) was set from 0–100%, respectively (both bound corresponding to mono-exponential cases). To ensure that our simulated data accurately represents experimental applications, pixel-wise IRFs were used. To capture experimental IRFs, a white diffused paper was placed on the imaging table and illuminated using the DMD with an excitation wavelength of 700 nm. The reflected light was captured using a neutral density (ND) filter. Subsequently, each TPSF was generated by convolving randomly selected IRF from the dataset with simulated fluorescence decay profiles. To approximate the noise characteristics of real-world measurements, system-derived noise, including read-out noise, dark noise, etc., as explained in [25], was incorporated into the simulated TPSFs. This approach ensures that the simulated data closely matches the noise dynamics observed in the actual system.
To evaluate and compare the model’s performance in the absence of experimental ground truth, we used the NLSF method, which is commonly used to estimate the FLI parameters described in Eq. 2. Traditionally, FLI parameters are estimated from experimental data through iterative fitting optimization methods such as the NLSF, which incorporates the Levenberg-Marquardt algorithm [26], or center of mass (CMM) analysis [27]. For our NLSF analysis, we utilized a software named AlliGator [28], allowing adjustments and constraints on fitting parameters, including short and long lifetimes, fraction amplitudes, and offsets, depending on experimental conditions. We selected between single and double exponential decay models according to the complexity of our datasets. AlliGator also provides an option for offset correction when there is a mismatch between the TPSF and IRF. We evaluated the importance of offset correction in our NLSF analysis by comparing data with and without this feature in our phantom experiments. Additionally, we benchmarked our results against those obtained using FLI-Net to provide a comprehensive evaluation of our approach in the context of established methodologies.
4.3 Deep learning network architecture
MFliNet was designed to process data using a transformer encoder-decoder architecture developed by Vaswani et. al. [23], as illustrated in 3. The model’s design includes two inputs (IRF and TPSF) and three outputs, short lifetime, long lifetime, and fraction amplitude. The architecture of MFliNet allows the network to focus on different parts of the input sequences, thanks to the Transformers’ attention mechanism.
In MFliNet, the attention mechanism is implemented using multi-head self-attention layers. Each layer has 16 attention heads, allowing the model to process various aspects of the input data simultaneously. In the self-attention mechanism, three vectors, Query (Q), Key (K), and Value (V), are generated from the input data. These vectors are created through linear transformations, which means that the same input tensor is multiplied by different learned weight matrices to produce the Q, K, and V vectors:
Q = XWQ, K = XWK, V = XWV (3)
where X is the input sequence, and WQ, WK, and WV are the weight matrices for the query, key, and value projections, respectively. This means that although Q, K, and V come from the same original data, they are transformed into different representations that the model uses to compute attention. The self-attention process begins with the dot-product of the query vector with the key vectors, resulting in a score matrix:
where dk is the dimension of the key vectors. This score matrix indicates the relevance of each time step in the sequence relative to the current time step. The scores are then scaled by the square root of dk to maintain stable gradients. After scaling, the scores are passed through a SoftMax function to produce attention weights for computing a weighted sum of the value vectors:
These weights are adjusted to determine the focus of the model on different parts of the input, producing a context vector that is highlighting key segments from the input. This approach enables the model to capture dependencies by recognizing relationships between elements in the sequence. The self-attention mechanism achieves this by comparing each part of the sequence with every other part, allowing the model to identify and leverage patterns that span the entire sequence that traditional models might overlook. The context vectors from all attention heads are then concatenated and linearly transformed to produce the final output of the multi-head self-attention layer:
After the tuning process, using two identical stacks of transformer encoder layers yielded the most optimal results for our application. Each encoder layer contains a multi-head self-attention mechanism followed by a fully connected feed-forward network (FFN). The FFN, composed of linear transformations and activation functions, further processes the attention outputs, enhancing the model’s capacity to learn complex patterns:
FFN(x) = ReLU(xW1 + b1)W2 + b2 (7)
Residual connections and layer normalization are incorporated into each encoder layer to stabilize training and improve gradient flow. Layer normalization standardizes the outputs, reducing training time and enhancing model performance. In each layer of the encoder and the embedding layers, outputs are consistently maintained at a dimension of the inputs. This consistency, as stated in [23], is crucial for the effective use of residual connections and layer normalization, both essential for the network’s stability and performance. Furthermore, to enhance the model’s ability to discern more global and complex relationships across the input sequence, self-attention layers are placed after the encoder layers for both inputs. This structure allows the model to understand and focus on the variations in the offset of IRF and encode these variations between the inputs and the FLI parameters.
The outputs from the two paths are concatenated and passed through a convolutional neural network (CNN) residual block to integrate and refine the extracted features. While the self-attention layers and the encoder layers are adept at capturing dependencies and long-range relationships within the data, where CNNs fall short, CNNs, on the other hand, are more suited to capturing local patterns and can identify local or short-term temporal dependencies [29, 30]. The CNN residual block is introduced after the attention layers to address this complementary need. This capability was used for modeling the variations in time-domain data, such as subtle fluctuations in captured signals. The CNN residual block consists of several key components. The block contains multiple convolutional layers that apply convolution operations to the input data. After each convolutional layer, batch normalization is applied. This technique normalizes the outputs of the convolutional layers, stabilizing the learning process and improving the convergence speed. Skip connections help address the vanishing gradient problem, allowing gradients to flow more easily through the network, further improving training stability.
The network splits into three branches to compute the short lifetime, long lifetime, and fraction amplitude, allowing a focused analysis of each output simultaneously. Each branch passes through additional self-attention layers with layer normalization and transformer decoder layers. By doing so, the model can better filter out the noise in the inputs, leading to improved performance in extracting and predicting the lifetime parameters. The transformer decoder in MFliNet incorporates both self-attention and cross-attention layers. While self-attention captures in-sequence importance, the cross-attention layer adjusts the weights to align the processed outputs with the decoder’s current state, effectively comparing different sequences and mapping them to the correct FLI parameters. This helps transform and integrate the previously learned characteristics of the input signals and map them to the correct FLI parameters, where the attention mechanism here follows the same formula as Eq. 5, with the weights adapted accordingly. The final layers in each pathway are CNN layers with a kernel size of 1x1, incorporating L2 regularization. This configuration produces the final output parameters.
The model was trained using an NVIDIA RTX 3090 GPU. The training data consisted of 2000 samples (2000x28x28 images resulting in 1,568,000 generated TPSFs and corresponding IRFs), where 10% were used for validation. After hyperparameter tuning, the RMSProp optimizer was chosen following the previous study, with an adaptive learning rate starting from 0.001. Mean Squared Error (MSE) was used as the loss function for each output branch. This architecture and training process ensured that MFliNet could accurately model and predict the desired outputs from the FLI data.
4.4 Phantom preparation and the in vivo experiment
For experimental validation, we designed a step ladder phantom to introduce variations in sample-detector distance, as depicted in Fig. 4(a). A 3D printable case was designed to accommodate five discrete containers arranged at various heights: ground level, 5 mm, 10 mm, 15 mm, and 20 mm. Each container was crafted with dimensions of 40×40×10 mm to accommodate tissue-mimicking phantoms. The phantoms were made with agar constituting 1% of the total volume (80 cm3). To prepare the phantoms, agar was first dissolved in distilled water, heated until fully integrated, and then allowed to cool slightly before further processing. The optical properties of the phantoms (absorption coefficient (µa) of 0.005 mm− 1 and a reduced scattering coefficient ) of 1 mm− 1) were controlled through the addition of India Ink and intralipid solutions to provide absorption and scattering contrasts, respectively. In each ladder step, a specific area was designated for the placement of a cuboidal fluorescence embedding with dimensions of 5×5×40 mm. This embedding consisted of Alexa Fluor 700 dye dissolved in phosphate-buffered saline to achieve a concentration of 20 µM. The embeddings were placed at a depth of 1 mm from the surface of each phantom.
For in vivo MFLI imaging experiments, we imaged HER2 + breast tumor xenografts HCC1954 in athymic nude mice. The cell line was sourced from ATCC (Manassas, VA, USA) and maintained in RPMI 1640 media enriched with 10% fetal bovine serum (ATCC) and 50 units/mL/50 µg/mL penicillin/streptomycin from ThermoFisher Scientific (Waltham, MA, USA). We initiated tumor xenografts by subcutaneously injecting 5 × 106 HCC1954 cells suspended in PBS and mixed in a 1:1 ratio with Cultrex BME (R&D Systems Inc, Minneapolis, MN, USA) into the inguinal mammary fat pads of female athymic nude mice aged 4 weeks (CrTac: NCR-Foxn1nu, Taconic Biosciences, Rensselaer, NY, USA). Tumors were monitored daily for 4 weeks. The mouse was administered with a retro-orbital injection of AF700 conjugated with MDT-TZM (MDT-TZM-AF700) at 20 µg and AF750 conjugated with MDT-TZM (MDT-TZM-AF750) at 40 µg in a 2:1 acceptor to donor ratio through staggered injection. Donor injection was performed 2 hours ahead of acceptor injection through the retro-orbital route. MFLI Imaging was conducted 24 hours post-injection using the MFLI imaging setup. Throughout the imaging process, the mouse was anesthetized with isoflurane, and the body temperature was maintained with a Rodent Warmer X2 (Stoelting, IL, USA). All animal procedures were conducted with the approval of the Institutional Animal Care and Use Committee (IACUC) at both Rensselaer Polytechnic Institute and Albany Medical College. The animal facilities of both institutions have been accredited by the American Association for Accreditation for Laboratory Animals Care International.
To examine the offset variation across different heights on the phantom and in live intact animals, we plotted the pixelwise IRFs for comparison, as shown in Fig. 4. For each specified height in the step ladder phantom, we plotted the average IRFs in Fig. 4 (a). Moreover, we also illustrated the IRFs of various anatomical regions in live intact animals, including the liver, urinary bladder (UB), and tumors, in Fig. 4(b). Lastly, in Fig. 4(c), we examined the variability of the IRF within a single tumor. This plot contains IRFs on three points within a tumor: the top, middle, and bottom. The top refers to an IRF from the upper region of the tumor, the middle corresponds to an IRF from the central area in terms of height, and the bottom shows an IRF from the lowest part of the tumor.