This section presents how experimental data was generated and how each model was created and trained. We also explain the tests conducted to assess the performance of the models and the verification done on the results.
4.1 Problem Definition for Case Study Area
The case study organisation for this research is an aircraft manufacturer and the studied processes are drawn from the assembly stage for the wings. The current installation process for components on the inside of aircraft wings requires workers to place and secure them in defined places, as indicated by the wing frame suppliers. A component of difficulty for the case study organisation engineers is the wing brackets which secure electrical and hydraulic lines, seen in Fig 5. This requires the engineer to adhere the bracket to the wing, within the marked boundary, with only limited visibility. The conditions for a correct application of the bracket are that it must be within the boundary within a 7mm tolerance, and that there must be flow of adhesive around the whole circumference of the bracket. Fig 5 shows these conditions for installation.
In the case study organisation’s drive to automate the verification of aircraft wing assembly, the installation of these brackets will be a key task. The automatic verification of the bracket installation process will have to respect the criteria that the human operators adhere to. Therefore, the solution proposed must be able to determine whether a 7mm tolerance is maintained around the edge of the bracket, and whether there is complete adhesion around the whole circumference of the base. If these criteria are met, then the bracket can be considered installed correctly. If either of these criteria are violated, then the solution will have to identify this bracket for second inspection by a human operator, followed by a corrective procedure if necessary.
4.2 Component Representation and Data Generation
A decision was taken to generate a custom dataset for this project because of the lack of open source databases with the brackets needed for aircraft installation. A suitable mock setup of the bracket installation process was developed. The replica bracket was created using LEGO® bricks to represent the bracket and modelling compound to represent the adhesive applied to the bracket base. Modelling compound was chosen over a true adhesive for the ease of removal and reapplication. A solid blank background with a slight shine and texture was used to represent the wing surface, and a suitable boundary marking applied to locate the placement of the bracket (see Fig 5).
Experimental data was generated by taking images of the bracket installed both correctly and incorrectly. Consideration was given to the subtleties in each of these cases, to identify the sub-cases that might exist within both, and to ensure that data was gathered on as many variations of the installation as possible and supporting generalization of the trained model.
For the correct installation case, this involved all those installations which met the criteria as discussed previously. Special attention was paid to the edge cases, where the bracket was just within or outside the tolerance range from the marking. It was expected that these would be the most difficult for the models to distinguish between correct and incorrect installation, so sufficient images of these were obtained to support any conclusions drawn from the results. In the incorrect installation case, there were several sub-cases identified that the model would have to learn as belonging to the incorrect distribution of images. These were as follows: (1) Inside guide, no adhesion; (2) Outside guide, correct adhesion; (3) Outside guide, incorrect adhesion in one location; (4) Outside guide, incorrect adhesion in many locations; (5) Outside guide, no adhesion (Fig 6).
Images were obtained for each of these sub-cases, again with attention given to including images of the edge cases. By ensuring that all these cases were covered in the data generation, the chances of inherent bias in the dataset were mitigated. Hence, any results obtained in testing would be representative of what would be expected given a true dataset. This dataset was validated by an Aerospace expert as what would be expected on a true production line. The images taken were RGB images of shape (256, 256, 3). For training, there were 70 images of correctly installed brackets and 70 images of incorrectly installed brackets, totalling 140 training images.
In order to replicate a low data scenario, 70 images of incorrectly installed brackets corresponding to 10 images of each of the 7 sub-classes identified previously were collected. For validation, there were 100 images of correctly installed brackets and 100 images of incorrectly installed brackets, totalling 200 validation images. Furthermore, a framework to procedurally load the generated images and perform random augmentations was created. This included affine transforms of translation, rotation, scaling, shear, horizontal and vertical flipping, and the colour space transform of scaling image brightness. The images were also normalised to aid the gradient descent algorithm in minimising the loss function.
4.3 Baseline CNN Model
A baseline CNN architecture was implemented (Fig 7) and trained as a binary classifer. Each convolutional layer was followed by batch normalisation, a ReLU activation function, 30% dropout, and a Keras max-pooling layer with default settings. Optimisation of the model was achieved using the ADAM algorithm with a learning rate of 0.0001 and a binary cross-entropy loss function. Training was performed over 25 epochs with a batch size of 16. Images of either correctly installed brackets or incorrectly installed brackets were passed to the network during training, along with their class labels. The model predicted the probability of the image belonging to the positive class, which in this case was of the image showing an incorrectly installed bracket. The default threshold of 0.5 was applied to determine the binary class prediction.
4.4 SNN Model
The SNN architecture was implemented with two baseline CNNs as seen in Fig 7. The dense feature layers of each twin layer were followed by 30% dropout also, however, no dropout was added to the L1 distance layer. There were no learnable parameters in this layer and hence no chance of overfitting. Optimisation of the model was achieved using the same parameters as the baseline CNN model.
The SNN was trained by passing pairs of images to the twin CNNs (see Section III.D for further detail on the selection of this pair), along with the label of whether they were of the same class or not. The identical CNN networks with shared weights would then perform the same feature extraction on the
images, producing feature vectors to be compared in the L1 distance layer. The distance between the feature vectors would be calculated as in Equation 1, where p and q are the two n-dimensional feature vectors. This would then be converted to a similarity score at the output layer by passing it through a sigmoid function, where a value closer to 1 denoted a higher confidence that the images were of the same class. A default threshold of 0.5 was used to determine whether the final decision was that the images were of the same class or not.
4.5 Model Training with Transfer Learning
The VGG16 model was selected for use in this study, due to its proven ability to work well in transfer learning. The VGG16 architecture (shown in Fig. 8) is simple when compared to other state-of-the-art CNNs, such as ResNet [22] and Inception [23]. This enabled us to focus on evaluating the overarching SNN architecture instead of dealing with various aspects of the network. The input layer of the VGG16 model was replaced with a new input layer suitable for taking the images in the bracket installation dataset. The earlier layers of the VGG16 model were then set to remain fixed throughout training (i.e., a learning rate of 0), but the last block of convolutional layers was set to be trainable. This would allow specific high-level features of the brackets and adhesive to be learned. Finally, a randomly initialised dense layer of 128 units was added to the end of the VGG16 model to create the feature vector, and the same output layer in Section III.D was added. The dense layer after the VGG16 model had a dropout of 50%, and the model was optimised using Adam with a learning rate of 0.0001 and a binary cross-entropy loss function. The model was trained over 25 epochs with a batch size of 16. For the SNN, each of the twin CNNs were replaced with the VGG16 model and the input layer of each modified as above. The outputs of each VGG16 model were then passed to separate dense layers of 1024 units with 30% dropout to create the feature vectors. The remainder of the model then stayed the same as for the base SNN. The model was optimised using the same optimiser and training options as before.
4.6 SNN Input Pair Consideration
The two images selected at each training step for an SNN are typically randomly sampled from the available classes. In this study, the image pairs could be one of: (correctly installed, correctly installed), (correctly installed, incorrectly installed), (incorrectly installed, correctly installed), or (incorrectly installed, incorrectly installed). Initially, this default method of selecting the images was used.
However, following on from our experimental results, the method of passing in the data was revisited. As discussed previously, the incorrectly installed bracket case had several sub-cases where each sub-case could differ significantly from the others. It was therefore hypothesized that attempting to label images from different incorrect sub-cases as being similar would cause optimisation problems for the SNN. An additional test was therefore conducted, looking to investigate the performance impact of the input image pair choice. For this, one of the input images was always presented as a reference image of a correctly installed bracket, and always to the same input of the network. The other input image could be of either a correctly installed bracket or any case of incorrectly installed bracket. This method of passing the input images is referenced by “custom input image pair”. The method was justified by the nature in which the SNN would be used in deployment. One image would be a reference image of a correctly installed bracket and the second would be of a newly installed bracket. The model would then verify whether the bracket had been installed correctly by comparing the images.
4.7 Edge Case Consideration
The edge case images were of particular interest in this study. It was expected that these would be where most misclassifications would be made. The edge cases are those where the bracket is placed just within or outside the tolerance of the marked boundary and with correct adhesion. All the edge case images in the original dataset were taken into a separate dataset, and then additional images were taken and added to create a suitably sized dataset specifically for the edge case images. The new edge case dataset had 50 images each of correctly and incorrectly installed brackets for training, resulting in 100 images. For validation, there were 40 images of each case, resulting in 80 images in total.
4.8 SNN Similarity Voting
Ensemble methods involve training multiple models on subsections of the training data and then voting on the same set of testing data. When verifying whether a new test image is the same as a known reference image, the similarity score will depend on the reference image used. This means that the resulting decision of whether the images are the same or not could change depending on which reference image is used and how similar it is to the test image provided. In deployment into aircraft assembly lines, verification will be done by providing a reference image to the SNN of a correctly installed bracket, and then passing it an image taken of the newly installed bracket. It is expected that following training of the model, if the test image and the similar reference image were passed through the SNN then these would be identified as being similar, and the newly installed bracket would be verified as having been installed correctly.
If however an incorrectly installed bracket image were used as the reference image, the similarity score generated may be more uncertain and lie around the threshold value of 0.5. The bracket may then be identified as being installed incorrectly. Furthermore, using the bagged model voting as a reference, the hypothesis here is that the dependency of the verification result on the reference image used would be reduced if the test image were compared to multiple reference images. Given the limited distribution of possible reference images, the test image should be similar to more reference images than not. Hence, when compared to enough reference images, it would be expected to see the majority of the outcomes being that the images are similar. A majority voting rule would then produce the correct result. Fig 9 shows the method used at testing to implement this voting scheme. Compared to the typical bagging method, multiple test input pairs are evaluated on a single model and a vote had on the final outcome, rather than evaluating multiple models on a single test input.
4.9 Testing and Evaluation
To evaluate how the changes made improved or hindered performance, suitable metrics were required to assess the models. Though the SNN performed image verification rather than classification, the typical classification performance metrics still applied. In particular, the accuracy, precision, and recall metrics were selected. The accuracy metric was selected as an initial indicator of model performance, which allowed for the training progress to be monitored and simple comparisons to be made between models. The accuracy by itself was not a sufficient indicator of performance however, and so the precision and recall metrics were also observed.
The chosen metrics enabled quantitative assessment of the model performances and also held important industrial implications. The precision of any model proposed would indicate what proportion of brackets identified as being incorrectly installed actually were incorrectly installed. From another perspective, it could be inferred from this value how often a bracket identified as being incorrectly installed was in fact correctly installed.
This metric, known as the False Discovery Rate (FDR), is calculated from the precision metric. Aerospace companies would be interested in these two metrics, as they would indicate how often a human operator would have to give a second opinion on the automated test. If the precision was too low, or conversely the FDR was too high, then the solution would be returning many false positives (FPs). It would then not be cost effective, as human operators would still be a common necessity in the procedure. Perhaps the most important metric in safety critical sector like Aerospace is the recall. Practically, this metric shows what proportion of incorrectly installed brackets are identified as such. From a different perspective, the number of incorrectly installed brackets falsely identified as being correctly installed can be evaluated by another metric, the False Negative Rate (FNR). A low recall or high FNR value would indicate many incorrectly installed brackets were passing through undetected, which could have severe consequences should aircraft be put through to operation with these onboard.
4.10 Results Verification on a Comparable Dataset
In order to give confidence in the results obtained using the custom dataset generated, the decision was made to repeat the experiments on a well-known baseline dataset. It could then be seen whether the same trends were present in the results. This was an important additional test for this study, where the data used was created specifically for the task. It would prove that the previous results observed on the hand-crafted dataset were not obtained through any bias or simplicity of the data. The Omniglot dataset (see Fig 10) was selected for this, which is a popular benchmark for SNNs [12].
Modifications to the dataset were necessary to formulate the problem in the same way as for the bracket verification task. To make the dataset sufficiently similar to that used in the bracket verification task, two of the alphabets were chosen that were similar in nature to each other. Namely, these were the Latin and Greek alphabets. This maintained the similarity that was present between the correctly and incorrectly installed brackets. Examples of similar and dissimilar images for this task are seen in Fig 10, where in this case similarity refers to whether the character is from the same alphabet or not. With the new dataset, the task was to perform alphabet verification at the alphabet level because each alphabet was analogous to the correctly or incorrectly installed bracket classes, and each character was analogous to the different sub-classes of brackets as described previously.