We explored an end-to-end pixel-wise segmentation method to label each pixel as panicle, leaf or background automatically under natural field conditions, and then generated the leaf to panicle ratio (LPR) by division of the number of pixels assigned for each class in each field image. Figure 1 shows the overall work-flow of this method, including two parts. Part 1 is the offline training workflow, which builds a deep learning network called FPN-Mask to segment panicle and leaf from field RGB images. Part 2 is the procedure of GvCrop to develop a software system for calculating LPR.
Experimental setup
In 2018, plots of ongoing field experiments at Danyang (31°54′31″N, 119°28′21″E), Jiangsu Province, China were selected to take pictures for the training data set. Of note, these experiments were not specially designed for a phenotyping study. In brief, the plant materials of these experiments were highly diverse in genotypic variation, containing seven main cultivars of Jiangsu and 195 mutants with contrasting agronomical traits as reported by Abacar et al [25]. Further, the seven cultivars had two sowing dates, resulting in obviously different phenotypes for a certain genotype. Thus the diversity in plant architecture and canopy structure of the tested materials can provide as many kinds of phenotypes as possible for image analysis.
In 2019, three experiments were conducted to test and apply the proposed FCN-Mask model. (1) Genotypic variations in LPR. A total of 192 mutants were investigated. The plot area was 2.4 m × 1.4 m with a row spacing of 30cm and plant spacing 20cm. Nitrogen, phosphate (P2O5) and potassium (K2O) fertilizers were applied at a rate of 240 kg ha-1, 120 kg ha -1 and 192 kg ha-1, respectively, and were equally separated into basic fertilizers (before transplanting) and topdressing (at 4th leaf age in reverse order). (2) N fertilization effects on LPR. A japonica rice cultivar, Wuyunjing 30, was selected for field experiment with a randomized complete-block design. It had three replications and a plot area of 2.4m ×1.4m. Total N fertilizer was 240 kg ha-1 N, and two N fertilization modes with different base/topdressing ratios were applied: (1) N5-5: base/topdressing, 5/5; (2) N10-0: base/topdressing, 10/0. (3) Regulation of plant growth regulators on LPR. Solutions of 100mM gibberellin, 100mM uniconazole, 25mM 2, 4-epibrassinolide, 25mM brassinolide as well as the control, water, were made up in distilled water with 0.5% TWEEN-20. One cultivar, Ningjing 8, from the N treatment was used as material. Spraying was conducted at the rate of 500ml m-2 after sunset, with three times starting at booting stage on August 22 and with a 2-day interval.
In addition, a dynamic canopy light interception simulating device (DCLISD) was designed for capturing images from the sun’s position installed on a supporting track. The bottom part consists of 4 pillars with wheels and the upper part is comprised of two arches consolidated by two steel pipes, and a moveable rail for mounting the RGB camera (Fig. 2 A). The sun’s trajectory is simulated by two angles, the elevation angle and the azimuth angle (Fig. 2 B, C), which is calculated according to the latitude, longitude, as well as the growth periods at the experimental site.
Image acquisition
Images of the training data set were captured in the field experiments in 2018, reflecting the large variations in camera shooting angle, the elevation angle and the azimuth angle of the sun, rice genotype, and plant phenological stages (Fig. 3). Images for validation and application of the proposed model were acquired in 2019. For the three treatments of genotypes, N fertilization, and spraying, an angle of 40° was selected for the tripod. The height of the camera (Canon EOS 750D, 24.2 megapixels) was 167.1 cm, the average height of a Chinese adult, and the distance between the central point of the target area and vertical projection of the camera on the ground was 90cm. The camera settings were as follows: focal length, 18 mm; aperture, f/11; ISO, automatic; and exposure time, automatic. In the experiment with DCLISD, the camera model was SONY DSC-QX100, with settings were as follows: focal length, 10 mm; aperture, f, automatic; ISO, automatic; and exposure time, 1/200s.
Dataset preparation
Training dataset: Considering the camera angle, solar angle, panicle type and growth stage (Fig.3), we selected 360 representative images as the training dataset (Table 1), with GG (green panicle with green leaf), YG (yellow panicle with green leaf) and YY (yellow panicle with yellow leaf) having 113, 104, and 143 images, respectively. Fig. 1 (1)-(3) shows the preparation of training data. Considering that the original size of these field images is as large as 4864×3648, they were trimmed to the size between [150,150] and [600,600] by Paint.Net software. Second, we labeled each patch, which was obtained manually using Fluid Mask software. Finally, 1896 representative patches were selected as the final training sample set. Among them 1210 samples were added continuously during the late daily test model. Further, to increase the diversity of the training dataset and avoid overfitting, we performed basic data enhancement to the training set, including random horizontal/vertical flip, rotation by 90 degrees, and histogram equalization. To reduce the effect of illumination, we performed random brightness enhancement on the image. The input size of all the images were resized to 256×256.
Testing dataset:Depending on the conditions for data acquisition, we divide all collected images into 3 groups based on the growth stages. From each group, we randomly selected 30 testing images and finally selected 90 images as testing dataset (Table 2). As to the testing dataset, the captured field images include many other objects including tracks, chains, neighbor plots, color-chart and sky which were not required in our approach. So, a significant region from the plot was selected as region of interest (ROI) and cropped manually for all selected testing images.
Network structure
In this study, we proposed a deep learning-based method for rice panicle segmentation, called FPN-Mask. The method consisted of a backbone network and a task-specific subnetwork. The Feature Pyramid Network (FPN) [27] was selected as backbone network for extracting features over an entire input data, originally designed for object detection and has the advantages of extracting multi-level feature pyramid from an input image with a single scale. The subnetwork is referenced of the Unified Perceptual Parsing Network [28], which performs semantic segmentation based on the output of the backbone network (Fig. 4).
Backbone network for feature extraction: The FPN [27] is a standard feature extractor with a top-down architecture and lateral connections, the top-down architecture is based on Residual networks (ResNet) [29], which consists of 4 stages, each stage is denoted as C2, C3, C4 and C5, respectively.
We denoted the last feature map of each stage in ResNet as {C2, C3, C4, C5}, respectively. In our backbone network, we removed the global max pooling layer before C2, because, it will drop out semantic information, therefore, the down-sample rates of each stage {C2, C3, C4, C5} from {4,8,16,32} to {1,2,4,8}. And down-sampling rates of feature maps output by FPN {P2, P3, P4, P5} are {1, 2, 4, 8}, respectively, namely, the size of P2 is same as the original image 256×256, P3 size is 128×128, P4 size is 64×64 and P5 size is 32×32. The number of feature maps output per stage is equal to 32.
Subnetwork for semantic segmentation: the subnetwork is based on the multi-level features extracted from the backbone network introduced above. Each level of the features will be fused together as an input feature map for semantic segmentation, which has been proved to outperform semantic segmentation compared to using only the highest resolution feature map [28, 30]. To up-sample low-level feature maps {p3, p4, p5} to get the same size feature as the original image, we directly adopt the bilinear interpolation layer instead of the time-consuming deconvolution layer, and attach a convolution layer followed by each interpolation layer to refine the interpolation result. After up-sampling, different levels of features are concatenated as the final semantic feature. The concatenated multi-level features are convoluted by a convolution layer to refine the result and a convolution layer to reduce channel dimensions (these two convolution layers are all attached to a batch normal layer and a relu layer). Finally, we get a 3-channel semantic segmentation result.
Loss function for semantic segmentation
The cross-entropy loss function is a standard method for classification [31]. In practical application, due to the uneven number of pixels in different categories, the loss calculated by cross-entropy loss function cannot reflect the real situation [32], so our paper uses the focal loss, which is specifically designed to solve the imbalance problem [32], and the loss function focuses on the more difficult classification location by changing the weight of different categories. For specific descriptions refer to [32].
Training
We experiment with ResNet-18 as the FPN backbone. All conv layers are initialized as in He et al [33]. Batch layers are simply initialized with bias and weight . The mini-batch size is 24, Adam optimization method, and training for 7 days with the base learning rate of 1e-3. For improving the robustness of our model and avoiding overfitting, we test the model performance every day, and add bad performance samples iteratively, adding 40 samples every day. All the experiments in this article were conducted on a high-performance computer with Intel 3.50 GHz processor and 128 GB of memory. Two NVIDIA 1080 GeForce graphics processing unit (GPU) has a 12GB memory used to accelerate the training of our model.
During the training, we tested model performance with all collected images every day, and selected the images that did not perform very well as supplementary training samples to make sure training samples covered all cases of 6GB images. There were 60 field images which generated 302 patches which were added as supplementary training samples. The good or bad performance standards were determined by ourselves through observation. The training period continued until the testing performance of all images met the accuracy requirements visually, and the loss function curve was smooth without fluctuations.
PostProcess
Although deep network has a strong ability to process semantic segmentation problems, it is impossible to achieve 100% accuracy depending only on auto segmentation methods. So, developing a tool for manually modifying the segmentation results is very necessary. To solve that problem, we developed software called GvCrop, which not only integrates the pixel-wise segmentation method (Fig.1, (6)), but also integrates the ability to modify the segment results by human interaction (Fig.1, (7)). But pixel-level labelling of the wrong location is time consuming, Thus, the idea of processing the image regions with homogeneous characteristics instead of single pixels can help us to accelerate the manual label speed (Fig.1, (7)). So, according to the image’s color space and boundary cues, we used the gSLICr [34] algorithm to group pixels into perceptually homogeneous regions, which is the Simple Linear Iterative Clustering (SLIC) [35] implemented on GPU using the NVIDIA CUDA framework, 83× faster than the SLIC CPU implementation. The gSLICr has three parameters: S, C and N. S is super pixel size, C is compact coefficient degree, N is the number of iterations. In our paper, S is set to 15, C is set to 0.2, and N is set to 50. After super pixel segmentation, users can modify auto-segmentation results based on super pixels.
Accuracy assessment
To quantify the performance of our method, we choose Pixel Accuracy (P.A.) (1) and mean IoU (mIoU) (2) as the metrics to evaluate our semantic segmentation. These two metrics are standard metrics to quantify semantic segmentation tasks [28]. (P.A.) indicates the proportion of correctly classified pixels to the total pixels and mIoU indicates the intersection-over-union (IoU) between the ground truth and predicted pixels, averaged over all classes. (see Equations 1 and 2 in the Supplementary Files)
Where n is the number of class, pij is the number of pixels of class i predicted to belong to class j, so the pii is the true positive, the pij is the false negative, the pjj is the false positive, the pjj is the true negative.
Calculation of leaf-panicle ratio (LPR)
Based on the extraction and identification of leaf and panicle, software, CvCrop, was developed to calculate LPR, based on the quantity of pixels contained in the leaf and panicle regions in an image. The formula of LPR was as follows: LPR = L / P, where L and P is the total quantity of the pixels of leaf and panicle in the picture, respectively.