Our results demonstrate that the levels of inter-observer error permeating shape data collated under a collaborative research framework, where the research protocols are outlined in detail, fall within the acceptable threshold. We found that, inevitably, increases in error occur as a consequence of relying on multiple observers, who each have access to different equipment, yet we do not deem this to be significant enough to highly distort the results towards a different conclusion about the data. Therefore, our innovative 3D printing approach and the results reported here have important implications for error assessments of linear metric and GMM data when recording lithic shape as well as the aggregation of data collected by multiple observers.
Outline based GMM was found to be slightly more sensitive to inter-observer error than metric methods. As Caple et al. (2018) point out, EFA involves global descriptors capturing around 99% of the variance in the outline shape, and therefore discrepancies between images lead to error in the coefficients dispersed throughout the full outline. Therefore, even if the error is not equally distributed, it is measured as such and consequently outline methods are often more sensitive to error than linear methods that capture only certain dimensions of an object. 2D outline based GMM provides comprehensive morphological information on the gross outline shape of an object, whereas linear metrics are able to capture aspects of the 3D shape but in much less detail; the increase in the morphological information captured, plus the added potential for automated data capture (e.g. Bonhomme et al. 2014; Matzig 2021) and impressive shape visualization (e.g. Fig, 5), will be worth the potential increase in error with 2D GMM in many scenarios. Our use of PCA to highlight axes of variance within lithic shape assemblages also demonstrates that inter-observer error does not affect all PCs equally. As outlined by Page (1976), subtle errors in each variable are combined in multivariate analyses and can be extracted by a single or small set of PCs, although they may also describe real aspects of covariance and so require careful consideration as to their source. When undertaking metric analyses, it is possible to assess error in each individual measurement; if the metrics are combined via dimension reduction methods such as PCA, the contributions of each individual measurement to each PC are readily identifiable through the PCA coefficients. This is less feasible with GMM data, particularly when using outlines and semi-landmarks, and in such cases, it is preferable to assess error on each of the leading PCs, as demonstrated above, rather than on each set of coordinates, which can be very numerous. Overall, error is impossible to avoid completely, and indeed the imperfect fidelity of cultural transmission means that copying errors can naturally occur during the knapping process and inflate variance between and within assemblages (Eerkens and Lipo 2005; Schillinger et al. 2014). In this sense, error is certain to arise within a data set capturing lithic variability; however, steps can be taken to ensure it is minimized, such as standardization of data acquisition, processing, and analytical procedures, calibration, high quality equipment, and assessment of error through repeat measures (Evin et al. 2020; Lyman and VanPool 2009; Robinson and Terhune 2017; Yezerinac et al. 1992). In the case of the current study, we determine that inter-observer error is low enough for accurate analyses under both methods, especially as the high \(ICC\) and \(R\) values demonstrate acceptable levels of congruence between the six observers.
Through the development of clear research protocols, our results demonstrate that multiple observers can successfully work together to produce sets of comparable data for aggregation. We believe that collaborative research designs, such as the one reported in Timbrell (2022), play an integral role in addressing the vulnerabilities of international research to disruption, revealed most recently in 2020 by the outbreak of coronavirus (COVID-19), which halted both domestic and international travel as well as social interaction. Our results suggest that, as well as single researchers visiting multiple collections to independently access lithic samples, international colleagues are also able to work together in situ to generate data, thereby building resilience in archaeological practice (Douglass et al., 2020; Scerri et al., 2020). We stress though that collaborative research designs should involve an equitable partnership in relation to the data, following the imminent Cape Town statement (see Else, 2022), with all researchers being involved in all stages of the research, from planning and protocol development to publication and dissemination (Chirikure 2015; Douglass et al. 2020). In this way, dual project development can enable local researchers to benefit from international archaeological research, thereby avoiding some (but not all) of the neo-colonial ‘helicopter’ practices that have been hugely criticized in archaeological and anthropological sciences, particularly in Africa (Ackermann 2019; Athreya and Ackermann 2019; Sahle 2021). We have provided here an initial pilot test of collaborative data collection using a 3D printing approach. This approach is unique and, to our knowledge, has not yet been applied in the context of lithic variability nor inter-observer error assessments. We propose that future studies should aim to reproduce our approach with more expanded samples of replica artefacts, and discuss three important aspects of potential future study design below.
The first aspect relates to the use of statistics and simple metrics for reporting inter-observer error. Statistics such as the \(ICC\) and %\(TEM\) express the error variance relative to the overall variance of the sample; variance is decomposed into that due to genuine variation among the artefacts and that due to variation among the observers (including that due to different individuals, their different cameras, lenses, etc.). Whilst this approach has many advantages, one immediate drawback is that these statistics are directly affected by the magnitude of genuine variation in both the sample of artefacts and in the dimensions measured. A given, constant level of measurement error will appear large when the artefacts measured are highly standardized, but small when the artefacts measured are highly variable. Even if one were to measure the widths and lengths of a set of highly standardized artefacts, a given level of measurement error would appear smaller the further the ratio of width to length is from unity, as this would increase the magnitude of genuine variation in the measurements taken. For this reason, it is always valuable to present simple indices of absolute error (such as standard deviation or variance) for single measurements alongside the indices of relative error variance across all measurements provided by the \(ICC\) and \(\%TEM\). Such simple indices are valuable in assessing inter-observer error even when the ultimate study involves more sophisticated morphological analyses, such as those based on GMM. In the current study, Table 2 presents such indices, and demonstrates that levels of error are minimal (the largest standard deviation among multiple observers for a single measurement = 0.613mm).
The second aspect relates to the exploration of the effects of the raw material used for production of the reference collection on the results of comparative studies. In this study, we used flint because it was available and accessible at the University of Liverpool, where the materials were prepared. This fine-grained raw material tends to produce well-defined features and edges, and so it would be interesting to replicate the approach with a more coarse-grained material, such as quartzite, chert, calcrete or sandstone. This is especially pertinent in our case as the shapes obtained from these materials are likely to be more representative of the actual African stone tools that have been recorded in the main project. However, we note that heat-treated silcrete may achieve a grain as fine as flint (Key et al. 2021), and that obsidian can be even finer-grained than flint; since both silcrete and obsidian are raw materials commonly found in African Middle Stone Age assemblages, we suggest that the flint used here acts as a suitable middle ground in terms of granularity and can therefore be considered as broadly comparable to those raw materials studied in the main project.
Finally, an aspect of variation between individual replicas that we did not explicitly measure is that which can arise through 3D printing. Zeng and Zou (2019) outline some of the factors that can affect the precision of 3D printing, which include slicing and support errors. However, we propose that, even if there are printing errors present in our replicas, these are likely minimal due to the highly comparable data obtained across the project. Additionally, printing errors should not contribute to differences between the two data collection strategies as both the multiple observers and the single observer recorded measurements from the same set of replicas.