We first compared both phantom and patient plans’ GPRs to quantitatively determine the agreement between QA systems when passing plans above a certain GPR threshold. Basavatia et al. had previously concluded that Mobius performed similarly to the other measurement-based systems when comparing whether patient plans passed by using 90% GPR at gamma criteria 3%/3 mm as the passing threshold [16]. Our study supports these findings while using the TG218-recommended action limit. When using the TG218-recommended universal tolerance limit, Mobius consistently failed more patient plans than both PD and ArcCHECK, but the vast majority of plans that passed ArcCHECK and PD still passed Mobius. As such, it may be useful to use Mobius as a first screen to determine if a plan requires a further measurement-based QA check.
Tolerance limits could be calculated separately for plan cases with different complexities as plans with higher complexities may have a larger deviation in GPRs [2]. Both Au et al. and Song et al. stratified the investigated plans by treatment site when comparing the Mobius system with other traditional QA methods, and their data suggest that GPR is treatment site- and QA system-dependent [22, 23]. Our results support these studies: prostate plans were more likely to fail Mobius than ArcCHECK, while HNN plans were more likely to fail ArcCHECK but not Mobius. The prostate plans that failed Mobius generally used very large fields and required long measurements on the ArcCHECK. Given Mobius’ limitations with modelling larger field sizes and off axis regions, these failures may not be entirely unexpected and the ArcCHECK results may be more reliable. Improvements with Mobius’ modelling method may resolve such issues in the future. On the other hand, HNN plans tend to be more complex with higher degrees of MLC modulation, and the ArcCHECK’s limited resolution may reduce its reliability in correctly picking out poor patient plans. It would be valuable to study the sources of these discrepancies to more detail. Nevertheless, if Mobius were to be used as the first screen in a PSQA workflow, measuring HNN plans regardless of their pass rates on Mobius may be a good approach.
The higher failure rates on Mobius as compared to both ArcCHECK and PD may also be indicative of other differences in the techniques. First, our PD measurements use the perpendicular composite measurement technique, which is not recommended by TG218, but nevertheless remains common in QA practice across institutions worldwide [24, 25]. The fact that these composite PD measurements rarely failed QA provides further proof that such measurements may not be useful in detecting poor plans. Second, the higher failure rates on Mobius as compared to ArcCHECK could reflect the fact that its GPRs are from comparing doses on heterogeneous patient CTs instead of homogeneous phantoms, which is the case for most measurement-based PSQA.
We next investigated the sensitivity of the three QA systems in detecting systematic intentional errors, in terms of whether it fails a pre-determined GPR threshold. Mobius proved to be the most sensitive to collimator rotation errors, but ArcCHECK outperformed it in gantry angle error detection, and PD was the most sensitive to MLC bank shift errors. Still, Mobius was able to achieve good sensitivity of at least 70% for 3° gantry and collimator angle errors when using the action limit threshold. These findings partially support Au et al.’s work, which found that Mobius was able to detect 2° collimator angle errors and 3 mm MLC bank shift errors when assessed at 2%/2 mm gamma criteria [22].
An underlying assumption here is that the GPR is sensitive to such errors. This is the only means through which one can assess sensitivity for ArcCHECK and PD without the use of other software (such as 3DVH for ArcCHECK). However, GPR has been reported to be insensitive to small errors under several test conditions [7, 11]. Therefore, it may be prudent to look towards other means of detecting errors.
Mobius offers a solution to this by determining if delivery parameters match the ones in the original plan. This allows the user to pinpoint the error. However, a core limitation of this method is that accuracy of the log files is assumed.[12] If the log files are not accurate because the LINAC calibration is off, then it is unlikely for Mobius to be able to detect delivery errors that arise. As an example, Agnew et al. had previously reported a discrepancy in the observed and log file-recorded MLC position [17]. Daily wear and tear of motors that control MLCs will also contribute to discrepancies between log file records and actual MLC positions [26]. This demonstrates the importance of establishing confidence in the accuracy of the log files through rigorous routine machine quality assurance.
We recalculated the plans with errors on the treatment planning system to determine how the DVHs would change, and the extent to which these errors resulted in clinically unacceptable plans. Surprisingly, gross errors did not affect clinical goals much, contrary to several previous studies that had also studied systematic errors [27, 28]. Only very large errors in MLC bank shifts (5 mm) and collimator angle errors (> 3°) had a clear detrimental effect to the coverage of the PTV. These findings could be due to two main factors: the plans we studied did not have small PTVs, and were also fairly robust, in that they had exceeded their respective planning goals by a large margin. However, it was still surprising to see that there were mostly inconsequential dosimetric changes. This was also reflected in Lehmann et al.’s study, where systematic errors only caused clinically significant errors some of the time [29]. Nevertheless, given the small number of plans investigated in both studies, it would be worthwhile to include a larger number of plans with more sites for better generalisability.