C. elegans culture and chemical treatment.
C. elegans culture and chemical treatment are performed according to previously published protocols [29, 30, 36]. Briefly, N2 C. elegans strains (Caenorhabditis Natural Diversity Resource - CaeNDR) are collected from gravid adults by sodium hypochlorite treatment and developed into synchronized larvae 1 (L1) stage worms overnight. L1s are placed in a 24-well plate with HB101 food in S media. The L1 larvae are treated with Methylmercury (II) hydroxide (CH3Hg, CAS# 1184-57-2, Sigma), a known developmental toxicant [37, 38], in conventional 24-well plastic plates with solvent controls. The plates are incubated at 20°C for 72 hours until the worms in the control wells reach the day 1 (D1) stage of adulthood. The experiments are repeated five times using five vivoChip-24x devices on three different days.
High-resolution imaging of C. elegans.
To acquire high-resolution images of worms, we load all 24 populations into a 24-well microfluidic device (vivoChip-24x, vivoVerse) in M9 buffer after 72 hours of chemical treatment in conventional plates. Underneath each well of the vivoChip-24x device, there are 40 parallel, gently tapering 3 mm long microfluidic trapping channels to trap C. elegans (Figs. 1a-c). The vivoChip-24x device contains a total of 960 trapping channels. A custom-designed gasket seals the device to provide fluidic connections to all wells. A single input in the gasket applies fluidic pressure to push the worms into individual channels (1 animal per channel) using intermittent ON/OFF fluidic pressure cycles. Once all channels fill up with worms as they are immobilized inside the parallel, narrowing channels, a constant fluid pressure holds them still for performing blur-free imaging. Automated high-resolution imaging is then performed on all 960 channels to collect time-lapse and z-stack brightfield images, and z-stack fluorescence images within 30 minutes using a customized automated microscope (IX73, Evident) with a high-quantum efficiency, fast, and large area camera (IRIS15, Teledyne). All 40 channels underneath each well are imaged in 5 FOVs using 10×, 0.4NA objective. Each FOV includes 8 channels. The entire worm volume is captured with 10 z-slices at 6-micron steps centered around the best focal plane of a fiduciary marker. We also collected 5 time-lapse 3D hyperstack images at 1-second intervals (Fig. 1d). Following the time-lapse brightfield imaging, a single z-stack of autofluorescence images is also acquired using a GFP filter set using the same objective.
There are 2 types of vivoChip-24x devices used in this study to accommodate the complete immobilization of C. elegans of different body sizes. The first device, the vivoChip-24x-3L device, has 3-layer microchannels with different heights that can immobilize young adult (YA) to Day 1 adult (D1) stage worms (Supplementary Fig. 1a). The second device, the vivoChip-24x-4L device, has an additional layer (4 layers in total) to reduce the microchannel dimensions further to enable immobilization of smaller larvae state (L4) worms as well (Supplementary Fig. 1b). We used the 4-layer (4L) microfluidic chip (vivoChip-24x-4L) for testing toxicants in a wide range of concentrations that may result in widely different sizes of C. elegans from young L4 up to D1 adult stage.
Pre-processing of images
Images are automatically uploaded to a local server for processing and analysis. Each channel is then cropped into individual hyperstacks by clipping the full FOV hyperstack into eight 150 µm wide sections (Fig. 1e). The cropping is centered around each predicted channel centerline, which is determined in relation to the fiduciary marker (Fig. 1g).
Manual segmentation for ground truth data
Manually annotating each volumetric image of individual C. elegans is done with an in-house graphical user interface (GUI) toolbox (vivoSegmenter). The GUI allows the user to scroll over multiple z-plane images and time points for each cropped channel. Since the morphology of the segmented worm body in a given channel does not change substantially between time points, users only consider the first time point. In the GUI, the user first selects one of the 3 classes for each channel: no worm (empty channel), partial worm (a partially visible worm is present inside the channel), or full worm (a full worm body is present in the channel). Users then segment full worms by clicking multiple points along the worm bodies in different z-slices in each cropped channel presented by the GUI. Specifically, the GUI registers the coordinates (x, y, and z) of the points as the users click on the image. Once all the points are entered and closed to encompass the worm, a polygon is created by connecting all the points as vertices. Most of the C. elegans body segmentation requires users to examine 3 to 5 z-slices. While depth information is important for clearly delineating boundaries, wide-field microscopy lacks the optical sectioning required to produce fine-grained segmentation over depth. We, therefore, collapse all segmentation polygons, namely our ground truth, into a single 2D binary image. Finally, each channel is assigned a class label (full, partial, or empty). A single polygon is then generated for the full worm class label only.
Data pre-processing for ML analysis
Our network is designed to work with either a full z-stack image volume or a subset of the data volume. Specifically, while we collect 10 focal planes per time point during imaging, we have found that successfully solving most vision-related problems was optimally done with a subset of a single volume centered on a relevant focal plane (Fig. 1f). Finding this central focal plane of a cropped channel can be done by taking the Laplace transform of each z-slice and selecting the one with the highest frequency components [39, 40]. We note that the mode (most occurring value) of the z-values from all the vertices the user clicks for worm segmentation is the same z-plane that we estimate as the best focal plane through the Laplace transform. After identifying the central plane, \(\:N\) planes from focal planes from each side are collected and stacked along the channel dimension (\(\:H\times\:W\)) to form a 2.5D tensor, together represented as a \(\:(2N+1)\times\:H\times\:W\) tensor. For each cropped channel, the height (H) of the image is fixed to 5,056 pixels and the width (W) to some value between 340 ≤ W ≤ 360 pixels depending on the cropping process. To obtain a similar size for all 960 channels and make the size suitable for our network, images are padded on both sides to obtain a fixed width of 384 pixels. In practice, we found using \(\:N=1\:\)(3 z-stack images) to be the optimal configuration for this problem, as expanding beyond 3 z-slices did not improve performance.
vivoBodySeg architecture for C. elegans body analysis
We propose a 2.5D U-Net for vivoBodySeg model with an attention mechanism at the bottleneck for the classification and semantic segmentation of C. elegans [41]. The proposed architecture consists of the following sub-networks: a fully convolutional encoder, a bottleneck layer consisting of a small vision transformer (ViT), and a fully convolutional decoder (Fig. 2a). This network produces two outputs: a pixel-wise segmentation over classes produced by the decoder and an image-wise classification over classes produced at the bottleneck layer.
Drawing terminology from previous work, an N.5D CNN refers to a convolution neural network (CNN) that processes N + 1 dimensions but only N of them in a convolutional manner [42]. The last dimension is stacked over the channel/feature dimension in a manner similar to how spectral information is often treated. Each layer of our vivoBodySeg network is defined by the same residual convolutional layer in both the encoder and decoder; the general design of this layer follows what is proposed by He et. al. in their work on the ResNet architecture (Fig. 2b) [43]. Following the standard practice for models that perform semantic segmentation, the final decoder layer is followed by a set of linear layers and the softmax function to produce a soft segmentation.
To augment the standard convolutional approach, we introduce a ViT at the bottleneck to enable efficient, long-range communication between embedded voxels (Fig. 2c). While standard images collected by non-scientific cameras may span hundreds of pixels, our images with Iris15 camera span over 5,056 pixels along the worm length. By replacing the standard convolutional bottleneck with a ViT, our goal is to ease the classification task as relevant image patches span thousands of pixels. The output of our ViT is routed to two separate sub-networks: the convolutional decoder and the classification subnetwork. The classifier we designed involves the pooled attention mechanism introduced by Lee et. al. with a single seed vector followed by a series of linear layers to produce a vector with elements such that we can use it for our classification task [44].
Unlike a standard ViT that generates a tokenized version of our image through a linear embedding of voxels [45] or a secondary generative model such as a VQ-VAE [46], the encoder of our U-Net serves as our embedding mechanism. Upon reaching the bottleneck, the 4D tensor is rearranged into a tokenized format, \(\:{X}_{e}=Flatten\left({F}_{enc}\left(X\right)\right)\in\:{\mathbb{R}}^{B,\frac{H}{S}\times\:\frac{W}{S},C}\) where \(\:s\) is equal to \(\:{2}^{Layers}\) and \(\:c=256\), and combined with learnable positional encodings. Following this step, the data is processed as a sequence by a small 4-layer ViT. Following the standard design introduced by Vaswani et. al. for natural language processing, data is first normalized and routed to a multi-head self-attention block. Following a residual connection, data is once again normalized and routed to a feed-forward network where we maintain the standard feature expansion e.g., \(\:\left|{C}_{FFN}\right|=4*\left|{C}_{MHSA}\right|\) [47]. The input and output of the self-attention block are also connected by a residual connection. Our code and network configuration files for the vivoBodySeg framework are available for academic use upon request.
Network training, validation, and testing for C. elegans images
The training set includes 3,637 channels acquired using the vivoChip-24x-3L devices from experiments conducted for different treatment conditions. In 81% of the channels, the entire body of C. elegans is fully present within the channel (full worm), 14% of these examples are worms only partially visible within a channel (partial worm), and 5% are examples of purely empty channels (no worm). We split the entire data in an 8:1:1 ratio between the training, validation, and testing sets, where the percentages of class distribution are consistent across all sets.
We utilize horizontal and vertical flips, small rotations, and contrast adjustments to enhance the underlying dataset. Each training step randomly selects a mini batch of 32 images and serves as examples in a single forward pass. We use an AdamW optimizer with an initial learning rate of 2×10− 4 and weight decay of 1×10− 2 [48]. During training, the learning rate is scheduled according to cosine annealing with warm restarts (\(\:{W}_{0}=10,\:F=2\)) over 1,200 total epochs [49]. We update our network according to our loss functions for segmentation and image classification. Only full worms are sent to the decoder for learning segmentation, and the weights are updated. All vivoBodySeg networks are trained on a computer with 128 GB of memory and an A6000 GPU with 48 GB of VRAM.
Post-processing and inference to find C. elegans body parameters
During inference or testing, we only consider those channels with full worms and apply a simple set of post-processing procedures to clean the data and extract relevant endpoints. During post-processing, a threshold of 0.50 is applied to the output such that we form a binary mask indicative of the C. elegans body. After this step, the connected component analysis identifies large binary objects as the worm bodies (trained to detect L4 up to adult stage worms) and removes all small objects outside this binary mask (such as laid eggs, small larvae, debris, etc.) present within the channel. The binary mask is used to estimate three body parameters: length, area, and volume. The body length is retrieved from the longest spanning tree in the skeleton of the binary mask. The binary object provides the area of the C. elegans body. The total volume is estimated by taking into account the known height of each pixel inside the predicted C. elegans mask.
Evaluation metrics for model performance
When evaluating our network, we report several metrics to quantify the overall segmentation quality. To understand general network performance, we use the Dice score to track validation progress and to quantify how the model works in a test setting. We use the Wilcoxon signed rank sum test to assess whether the mean Dice score ranks differ between models. We then use our post-processed data to further elucidate the model accuracy by reporting the ratio of the predicted skeleton length over the ground truth skeleton length. We also estimate the volume ratio (predicted volume to the ground truth volume) and classification accuracy using the weighted F1 score. To compare the model performance with human scorers, we calculate the Dice score from the segmentation for multiple scientists and compare it with the Dice score estimated between the predicted mask and the ground truth segmentation data.
Autofluorescence analysis
We measure the autofluorescence signal within the predicted body mask using the fluorescence image captured with a GFP filter set. First, we create a maximum-intensity projection image from all 10 z-stacks. Using the control wells with 0.2% DMSO treatment, we determine a threshold intensity above which the brightest 5% of the pixels lie. These 5% pixels correspond to the granules of lysosomes in the worm gut, which are major contributors to the increase in autofluorescence under stressors. We calculate the average pixel intensity considering the pixels with an intensity above this threshold and within the predicted body mask for all worms. We use the average autofluorescence value per unit body length and per unit body area to identify dose-dependent responses to a chemical treatment.
Statistical analysis of dose-dependent body parameters
For developmental toxicity assays, body parameters and autofluorescence signals are calculated for each worm. The individual worm values are filtered to remove all measurements from worms with body lengths and autofluorescence signals significantly deviating from the median using the Tukey fences (1.5 * the interquartile range). Worms with measurements outside these fences are removed from the analysis. After filtering, we use the remaining worms to estimate the well average (µ) and standard deviation (σ) for body length, area, and volume. The data is presented as average ± standard error of the mean (SEM) from multiple replicates. We calculate the coefficient of variance (\(\:CV=\sigma\:/\mu\:\)) from all the control wells. The average values for each body parameter and autofluorescence signal are plotted for different concentrations, and the effective concentration (EC10) value for the 10% change in the parameter is fitted from datasets with a 4-parameter, variable slope Hill function using the “Find ECanything” nonlinear fit function of GraphPad Prism. The slope bottom and top are constrained to zero (for length, area, and volume) and left unconstrained (for autofluorescence signal), respectively. The EC10 values are presented with ± 95% confidence interval (CI) values for each parameter. To calculate the lowest observable adverse effect level (LOAEL) for each phenotype, we test for normality (Shapiro-Wilk test) and identify the first dose where the phenotype departs significantly (p-value < 0.05) from the baseline value of the control population using Welch ANOVA with post hoc Dunnett’s T3 multiple comparison tests.