The analysis of the cultural adaptation process of the internationally validated CCOG questionnaire showed good results. The Cronbach alpha value for all videos taken together was above 0.80, following the preference established by Streiner10, who suggests that coefficient values should be between 0.80 and 0.90, indicating that the Brazilian version shows acceptable internal consistency and reliability.
Items 1 (Greets patient) and 2 (Introduces self and role) of the tool showed negative loads in the confirmatory factor analysis, which suggests that they are not adequately measuring the intended construct. The analysis shows that 0% of the variance of these two items may be explained by the intended construct as 100% noise or other non-intended construct. The determination coefficient (R²), which tells us how much variance of a variable is explained by another variable (in our case the item verse construct) also had a low value (zero). This may be explained by the fact that it was a simulated station and the fact that the recording began inside the office, but some resident doctors greeted the patient before entering the office and began filming, thus interfering with the analysis. Since the evaluators were instructed to leave blank if the task could not be evaluated, these initial items for starting the consultation had the most amount of blanks.
When redoing the same type of analysis without items 1 and 2, we see an improvement in the model adjustment. The adjustment indices of the observed data in relation to the proposed theoretical model, together with the significant improvement of the adjustment obtained with the model in comparison to a one-dimensional model are validity evidences based on the internal structure of the tool. Thus, we suggest that in order to evaluate items 1 and 2, the beginning of the consultation should take place inside the evaluated environment so that we may observe the interviewer greeting and introducing themselves to the patient.
The item with the highest agreement among evaluators was that resident did not show judgment, which seems to be a clearer parameter and residents are well trained to avoid. Among the items with the highest disagreement among respondents is an item with 2 tasks: “identify and confirm the list of issues”. Perhaps having two tasks on the same item interfered with the difference in responses. For this reason, we suggest changing the item to “confirm the list of problems”, since to confirm the problems the student or resident must have already previously identified them.
We found significant disagreements in the task “Establish dates”, which the reviewers suggested changes. Most likely, the difference in answers occurred due to the difficulty in understanding and so we modified the final version. The last item of the questionnaire, which refers to the important ability of making a shared decision, also showed a high degree of disagreement, probably because a complete agreement with the patient involves a complexity of dialogues and negotiations that may need better defined parameters. There was low intra-class correlation coefficient in the domain “Closes the consultation”, probably because of the difficulty in understanding the shared decision-making process in the item “Contracts with the patient the next steps”. The word “Contract” may give room for different interpretations as to what one considers a satisfactory degree of patient participation in the decision-making.
Other items have a more subjective interpretation and caused greater differences in the evaluation. The evaluators mentioned difficulties when defining parameters in less objective or technical behavioral evaluation items, such as “Demonstrates respect / appears confident / demonstrates empathy”. This may be because such items need further development when defining their parameters among evaluators, according to the learning objectives for each phase of medical training. Moreover, because they are complex tasks and difficult to judge as an external observer, who attempts to measure the feelings of the interviewer and the patient. For a wholly real evaluation we would need to know the patient's opinion, such as if the patient felt confidence in the interviewer or felt that the interviewer was empathetic.
These items in particular should be discussed among the group of evaluators to define what will be considered satisfactory or unsatisfactory and partially satisfactory. We observed that when an item was not performed, the evaluators found it easier to evaluate as "No", but when the residents performed the task, the evaluators were in doubt about choosing between “YES” and “YES, BUT...”, thus indicating that we need to better define when a task is accomplished wholly or partially.
These difficulties may have interfered in some reliability and validity coefficients of the tool. However, another scale-validation study in Germany showed similar intra-class correlation coefficients ranging from 0.05 to 0.57. In this study, the authors suggested deleting the item “Negotiates agenda” when using the questionnaire with students at the beginning of the course and only using it toward the end of the course, when they would have the ability to address multiple topics and perform other procedures besides merely collecting the patient's history. In addition, they also attributed the reliability difficulties of the scale in the study due to the need for further instructions and better defined parameters among evaluators before the application12.
In addition, the difficulties presented when evaluating and judging the items may hamper a final summative evaluation. Increasing discussions have taken place regarding the association of judging by items with a subjective holistic judgment13. Some studies comparing the psychometric properties of checklists and global assessment scales in OSCEs evaluated by experts indicated a higher reliability between stations and a better validity than checklists14,15. The tool predicts, in its original version, a global evaluation with no note value between “SATISFACTORY”, “SATISFACTORY, BUT...” and “UNSATISFACTORY”, which we did not use in the study, but recommend its regular use alongside the questionnaire.
The tool may be used for both a formative and summative evaluation, but due to the difficulties already discussed in the judgment of more subjective items, CGOC may bring about more benefits when applied to a formative rather than summative evaluation. We should emphasize that, whenever possible, constructive and detailed narrative feedback should be associated with a summative assessment, since students prefer feedback rather than notes and the effectiveness of reflective feedbacks is high16.
Despite the difficulties observed, the reliability coefficients in the Many-Facet Rasch Model were excellent in all facets, varying from 0 to 1 and the higher the value, the lower the risk of false positives or negatives and thus a lower measurement error, demonstrating that the tool has acceptable reliability for reproducibility in other contexts. One limitation of this study is that evaluators, while all experienced preceptors, had different backgrounds in assessing communication skills. In addition, although we had about 170 tools filled, a larger sample could provide more information. Another limitation is that we were unable to conduct a second evaluation to confirm the reliability of the evaluation among each evaluator. In addition, the preceptors’ evaluation could have been associated and/or compared with the evaluation from other standpoints, such as colleagues, staff, and patients, since evaluation by multiple sources and at different times in the Medical Residency Programs have shown to be good evaluators of attitudinal skills and complex tasks17,18.
Although the study consisted of resident physicians, we believe the tool can also be used with undergraduate students, as previously demonstrated in other studies, with adaptations and standardization of the parameters of items according to the course period and learning objectives12. We underline the extreme importance of discussing with the group of evaluators each word in the questionnaire and its subsequent use in practice for constant improvements, which should continue to suffer adjustments with the feedback. We suggest further researches on evaluation tools for attitudinal skills, enhancing the definition of subjective items according to the learning objectives of each medical training phase.
As the validity of a tool is a continuous process19,20, and thus we recommend that scale items be continually reevaluated for constant improvements, and we emphasize the importance of homogenizing evaluation parameters among evaluators on each item prior to scale application, clarifying the learning objectives required for each training level, especially when it comes to less objective attitudinal evaluation items, such as demonstration of respect, confidence and empathy. In addition, we also suggest complementing the evaluation of communication skills by considering other sources and viewpoints such as patients, colleagues, and staff.