In order to test the ability of EFDs to discriminate among the haplogroups of T. dimidiata, we used the images obtained by Gurgel-Goncalves et al. [30], which are available in the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.br14k. Originally, 44, 30 and 40 images of haplogroups 1, 2 and 3 were obtained, respectively, with which the automated identification process tested by Gurgel-Goncalves et al. [30] was performed. For this study, only images that had the necessary characteristics to perform the contour analysis were selected, that is, only images with an unmodified contour and wings that were not broken or overlapped. This filtering process resulted in a total sample of 37, 23 and 36 images for haplogroups 1, 2 and 3, respectively. The conditions under which the photographs were taken and more information about the samples are detailed in Gurgel-Goncalves et al. [30].
The images were preprocessed in Adobe Photoshop CS5. The legs and antennae were removed from each image to leave only the body contour. The brightness and contrast values were adjusted to their minimum and maximum values, respectively, to leave only a binary image (Fig. 1). All images were saved as bitmaps (BMP) in 24-bit RGB format.
SHAPE 1.3 software [41], designed to evaluate the contour shape based on Elliptical Fourier Transform, was then used to quantify the body contours. The mathematical description of contour extraction based on EFDs can be found in Iwata et al. [42].
SHAPE has four subprograms: ChainCoder, Chc2Nef, PrinComp, and PrinPrint, which together facilitate the processing of digital images, acquisition of the chain code and Fourier coefficients, and Principal Component Analysis. It also includes routines for the visualization of the shape from previously digitized data (ChcViewer and NefViewer).
The chain code is a coding system to describe the spatial information of the contours with numbers from 0 to 7 [43]: the digits indicate the direction of the next step around an outline: 0 = one step to the right, 2 = one step up, 4 = one to the left, 6 = one down, and the other digits are intermediate addresses. To obtain this code for each image, the ChainCoder subprogram was implemented for the images of the haplogroups. This subprogram reads the BMP images, converts them to grayscale, binarizes them from a threshold value selected in the image histogram, eliminates possible noise existing in the images using erosion-dilution filters and obtains the chain code, which is saved in an ASCII file with an extension chc.
Once the chain code file was generated for each image using the Chc2Nef program [32], the Fourier transform coefficients for 5, 10, 15, 20, 25 and 30 harmonics were calculated consecutively, using the ellipse as normalization of the first harmonic. Fourier coefficients were stored in an ASCII file of extension .nef that were used for subsequent multivariate analysis.
Given that a large number of variables is produced (four coefficients for each harmonic), a Principal Component Analysis (PCA) was performed using the variance-covariance matrices to reduce the dimensionality and create new derived variables that can be analyzed statistically. This was done using the PrinComp module, as proposed by Rohlf and Archie [34], and component scores were used as shape variables. These derived variables contain all of the information for each haplogroup body shape, as demonstrated by the fact that the contours can be graphically reconstructed from these values using an inverse Fourier transform in the PrinPrint module, according to the procedure of Furuta et al. [35].
To evaluate the ability of FED to discriminate among the three haplogroups of T. dimidiata, a Discriminant Function Analysis was performed. The minimum number of harmonics needed to produce satisfactory classifications was determined. For this, the scores of the principal components recovered from the PrinComp module were used. For the first five harmonics, the number of principal components was 16, while for 10, 15, 20, 25 and 30 harmonics, 30 principal components were recovered. A Canonical Variance Analysis (CVA) was performed with the minimun number of harmonics that allowed the best discrimination of the haplogroups and the confusion matrix was obtained to estimate the classification errors.
Finally, we compared each of the first five principal components among the three haplogroups to determine whether there were statistically significant differences among them. Because the data were not normally distributed, we used a Kruskal-Wallis test to compare among the three groups.
As an alternative method of discrimination and identification, a multilayer perceptron neural networks were trained. Artificial neural networks are mathematical models constructed by simulating the functioning of biological neural networks (the nervous system). They present a set of processing units called neurons, cells or nodes (formed by several mathematical equations), interconnected by connections that include a weight that modifies the values that pass through them between neurons [44]. Artificial neural networks ( ANNs ) have been advocated in many disciplines for addressing complex pattern- recognition problems. The advantages of ANNs over traditional, linear approaches include their ability to model nonlinear associations with a variety of data types ( e.g., continuous, discrete) and to accommodate interactions among predictor variables without any a priori specification (Bishop 1995). Neural networks are considered universal approximators of continuous functions, and as such, they exhibit flexibility for modeling nonlinear relationships between variables. For example, ANNs exhibit substantially greater predictive power than traditional, linear approaches when modeling nonlinear data (based on empirical and simulated data; Olden & Jackson2002b).
The variables used to make the network were the scores of the principal components that contributed most to the total variance, obtained from the Fourier coefficients from 25 harmonics. For the basic topology, the automated search procedure of Statistica version 8.0 software was used, with an input layer of 30 neurons, corresponding to each shape variable, and the output layer with three neurons, one for each haplogroup to identify.
In the exploratory step, the most efficient network was evaluated by testing with hidden layers of between 10 and 40 neurons. Two error functions (sum of squares and cross-entropy) and four activation functions (identity, logistics, tangent and exponential) were used. The learning rate was 0.1, the inertia 0.66, and the stoping rule was set when the training error was below 0.001. Network learning was represented using the behavior of the maximum, average and minimum errors. Sixty percent of the data were randomly selected for network training and the remaining 40% was used for validation. Of the 30 networks, the one with the lowest classification error of the validation data was selected as best. The classification power for the species was analyzed using the confusion matrix and the calculation of the percentages of omission and commission errors.