Delineating organs at risk (OARs) in computed tomography (CT) images is an essential step for optimizing and quantitatively evaluating a radiation treatment plan[1–3]. Manual delineation is the main option in clinic, but time-consuming. Although the computer-assisted approaches (e.g. atlas-based algorithm[4–7]) have come to help lessen such burden, they can’t guarantee the segmentation accuracy and reproducibility[8–10].
In recent years, with the development of convolution neural network (CNN) in the field of image, more and more networks[11–17] appear to segment OARs in CT images automatically and show good results. Feng X et al.[12] used three-dimension (3D) Unets to locate thoracic OARs and then segment them. They reported mean Dice of 0.89, 0.97, 0.93 and average 95% Hausdorff distance of 1.89 mm, 4 mm, 2.10 mm for spinal cord, lung and heart respectively. Tao He et al.[11] proposed a U-like network trained under a multi-task learning scheme. The major task was segmentation. The auxiliary task was global slice classification under the hypothesis of OARs concurrently appearing in similar slice orders for most patients. This U-like network reached heart Dice of 0.95. Besides, a 5-channel CNN with multiple images of highlighting different tissues achieved average heart, spinal cord and lung Dice of 0.91, 0.76 and 0.95[16]. A Unet-GAN[17] attained mean Dice of 0.85, 0.96 ~ 0.97 and 0.88 for heart, lung and spinal cord.
Among the above networks, Unet[18] is a classic one for good image segmentation. When we trained a two-dimension(2D) Unet to delineate thoracic OARs, a few exceptions with mis-identification as shown in Fig. 1 appeared. In this figure, some pixels are wrongly categorized into OARs. This phenomenon may be caused by the CNN principle. CNN conducts each pixel’s classification as a separate task and only based on the gray distribution of a small-size image (i.e. receptive field). There is no prior knowledge, such as shape or identification of neighboring organ, involved in the classification[19]. Therefore, it may give a wrong classification when different organs, in a receptive field, shows similar gray value. As shown in Fig. 2, the lung pixels in the heart pixel’s receptive field (rheart) and the vitro air pixels in the arm pixel’s receptive filed (rarm) both exhibit 0 grayscale. Besides, the heart pixels and the arm pixels show similar image intensities. Without knowing the 0 grayscales in rheart and in rarm belong to lung and vitro air respectively, it is a high likelihood that CNN wrongly classified the pixel B into heart.
To decrease the above mis-identification, a possible solution is to label various organs by using different numbers. Given that the common activation function of rectified linear unit (ReLU)[20] has no upper limit of its output and hence it may map an image intensity to any positive number, we can’t use a preassigned number to represent an organ. Therefore, we built a multi-output (MO) network whose different outputs corresponde to different organs. In this way, the network could learn an optimal number to represent a certain organ. The design of MO network is also known as multi-task[21, 22] (MT) or multi-label[12, 23] (ML) learning. They have been reported in some papers[12, 21, 23], but most of them focused on the overlap between model result and ground truth, less investigated their performances in reducing mis-identification.
To verify the above hypothesis, we modified a classic 2D Unet into a multi-output one (abbreviated as MO-Unet) and trained it for segmentation of three thoracic OARs (i.e. lung, heart and spinal cord). Then we compared its performances with a single output 2D Unet (abbreviated as SO-Unet). The reset of this paper is organized as follows. The results are shown and discussed in Sect. 2 and 3 respectively. Our conclusion is presented in Sect. 4. Section 5 gives the detailed architecture of MO-Unet and introduces our experiments.