The oral part of EDIC (the Part 2) was sat on 14th July 2018 by 185 candidates, examined by 137 examiners and in 6 centers. Most candidates were from Europe (49%), India (37%) or Middle East (14%). The exam as sat comprised 299 items. 17 (6%) were eliminated post hoc by the examination committee as per the process described above (15 items from CCSs and 2 items from CBSs). The details of facility index/discrimination characteristics of the deleted items are can be found in Supplementary Appendix Part 2.
A
|
Max points
|
BRM pass mark
|
MAM pass mark
|
Absolute
|
Relative
|
Pass rate
|
Absolute
|
Relative
|
Pass rate
|
CBS 1 imaging
|
34
|
|
20
|
71%
|
20
|
60%
|
71%
|
CBS 2 curves
|
26
|
|
15
|
53%
|
14
|
55%
|
63%
|
CBS 3 labs
|
26
|
|
15
|
67%
|
15
|
59%
|
67%
|
All CBSs
|
|
|
|
67%
|
|
|
70%
|
CCS 1
|
75
|
|
41
|
63%
|
45
|
60%
|
47%
|
CCS 2
|
77
|
|
36
|
56%
|
45
|
59%
|
14%
|
CCS 3
|
61
|
|
36
|
56%
|
38
|
62%
|
45%
|
All CCSs
|
|
|
|
61%
|
|
|
32%
|
Exam
|
|
|
|
50%
|
|
|
30%
|
B
|
Max points
|
BRM
|
MAM
|
Absolute
|
Pass mark
|
Pass rate
|
Absolute
|
Relative
|
Pass rate
|
CBS 1 imaging
|
33
|
|
20
|
71%
|
|
20
|
71%
|
CBS 2 curves
|
25
|
|
15
|
53%
|
|
14
|
63%
|
CBS 3 labs
|
26
|
|
15
|
67%
|
|
15
|
67%
|
All CBSs
|
|
|
|
67%
|
|
|
67%
|
CCS 1
|
69
|
|
41
|
63%
|
|
43
|
59%
|
CCS 2
|
69
|
|
35
|
59%
|
|
40
|
28%
|
CCS 3
|
60
|
|
36
|
56%
|
|
37
|
50%
|
All CCSs
|
|
|
|
62%
|
|
|
46%
|
Exam
|
|
|
|
51%
|
|
|
41%
|
Table 1. Results of EDIC Spring 2018 series before (A – top) and after (B – bottom) elimination of 17 items. Note: BRM = borderline regression method, MAM = modified Angoff method
Comparison was made on the performance of the two standard setting techniques both for the original set of 299 questions and again for the reduced set of 282. Removal of the 17 questions made little difference to the performance of BRM (overall pass rate 50% for all 299 questions and 51% for the 282 question set), but made a substantial difference when MAM was used (30% pass rate for the 299 questions and 41% for the 282 question set (Table 1). The use of MAM without removal of the 17 questions would have resulted in a pass rate well below the lowest pass rate in EDIC history and also below the expectation of the committee and the examiners. Even after removal of the 17 questions, the pass rate remained lower by MAM than by BRM (41% vs 51%).
The CCS question types was the main factor in the different effects seen between the two techniques when the 17 questions were removed. Removing these 17 questions made little difference to the pass mark for CBS questions with either MAM or BRM. MAM suggested an exceptionally low pass rate for CCS questions when these 17 were included, only moving towards a more usual or expected pass rate when they were excluded. The pass rate when judged by BRM varied little with or without these 17 questions and either way fell within the expected range.
The borderline regression method (BRM) requires, as described above, a contemporaneous judgement by the examiners along a global assessment scale as to where the candidate’s performance fell along a pass-fail spectrum. Using this, we employed a sub-group analysis on the candidates deemed to be borderline. In this sub-group, the pass rate was close to 50% if BRM was used in both the full 299 questions and the 282 after the 17 were removed, but much lower if MAM was used to set the standard (See Table 2). Of note, 22%, 73% and 23% candidates rated as “clear pass” or “superior performance” by examiners during CCSs 1-3, respectively would have failed that station had MAM been used for standard setting.
Pass rate of borderline candidates
|
|
Borderline Regression Method
|
Modified Angoff Method
|
All 299 items
|
Cleaned (282 items)
|
All 299 items
|
Cleaned (282 items)
|
CBS 1 (n= 34, eliminated 1)
|
51.9%
|
51.9%
|
51.9%
|
51.9%
|
CBS 2 (n= 26, eliminated 1)
|
42.3%
|
42.3%
|
51.9%
|
51.9%
|
CBS 3 (n= 26, eliminated 0)
|
52.9%
|
52.9%
|
52.9%
|
52.9%
|
CCS 1 (n= 75, eliminated 6)
|
35.3%
|
35.3%
|
11.8%
|
29.4%
|
CCS 2 (n= 77, eliminated 8)
|
47.4%
|
59.6%
|
3.5%
|
15.8%
|
CCS 3 (n= 61, eliminated 1)
|
52.0%
|
52.0%
|
36.0%
|
44.0%
|
Table 2.: Pass rates of subgroup of candidates judged as “borderline” by the examiner(s) in all exam stations before and after elimination of 17 poorly performing items.
In summary, our primary results showed that BRM produced believable or acceptable results with or without the 17 questions removed, and in both CCS and CBS type questions. In contrast, MAM produced an unacceptable and improbable pass mark, particularly on the full 299 question dataset and particularly when used on CCS type questions. We then looked more closely at CCS type questions for reasons for this.
We noticed that CCSs had much higher proportion of harder (low facility index) and poorly discriminatory items (27%) as compared to CBSs (3%) – See Supplementary Appendix Part 4 for details. We therefore hypothesized that elimination of poorly performing exam items would be affecting standard setting by MAM, but not by BRM. In order to further evaluate this, we looked at the effect on the two methods that progressive removal of an increasing proportion of the harder items would have. We iteratively removed items in a stepwise manner starting with those with item that were harder (lowest facility index) and with lowest discrimination (See Fig.2) and calculated the impact of such operation on the hypothetical pass rates achieved by application of MAM and BRM to set the standard. As shown in Fig. 2, the pass rate set by BRM remains almost unaffected, whilst pass rate set by MAM increases and converges with the pass rate obtained by BRM. Of note, the difference between pass rates by MAM and BRM at baseline is proportional to the proportion of low-facility-low-discrimination items.
In order to explore why MAM is influenced by the presence of hard items, we looked more closely at the relation between the observed and MAM-predicted facility index of individual exam items in the subgroup of candidates marked as borderline by the examiners (on average n = 55 [30%] candidates, range 51 [28%] to 68 [37%] depending on exam station). Ideally, the intercept of such a relation should be 0 as items with facility index = 0 (nobody gave the correct answer) should have the predicted facility index of 0. However, in all exam stations the intercept was in fact positive (0.25 to 0.46). The slope, which should ideally equal 1.0 was in fact lower (0.26 to 0.59). Indeed, as shown in Figure 3, CBSs had significantly lower intercept and higher slope than CCSs, and thus were significantly (p=0.012) closer to the ideal line. It can be inferred that MAM leads to the same results as BRM at around the real facility index of 65%. For harder questions (lower difficulties), Angoff rates underestimate the hardness of the question (i.e. leads to overestimation the item facility index). Or, using other words, MAM overestimates the percentage of candidates who will give the correct answer for harder questions.
The hardness of an exam overall should not affect the pass rate for any given cohort. The ideal standard setting technique should take this into account, setting a higher pass mark for an easier question set and vice versa. By progressively removing an increasing percentage of the questions from the harder end of the spectrum, we were able to test the effect of the two standard setting methods on the pass mark and pass rate with progressively an easier on average question set. BRM provided a more constant pass rate as the average question set difficulty was varied, which was reassuring. MAM, however, progressively resulted in a higher pass rate as the average question set became easier; conversely, MAM made the exam harder to pass if the question set was harder on average. MAM predicted the harder questions to be easier than they really were, so a greater failure rate on these harder questions pulled the average pass rate down when they were included.
In our data, when plotting pass rate we found that if the question set hardness and discrimination was such that the pass mark was assessed at around 65%, both techniques concurred. At the point of convergence of the pass rate by both methods, the pass mark was in range 63-66% (See Table S1 in Part 5 of Supplementary appendix for detailed results).
The intercept of the line in Fig. 3 is the predicted facility index of a hypothetical item with a real difficulty index of 0 (no candidate getting it right). For example, an extreme question with an answer that nobody could possibly guess or know. Yet, MAM predicts a percentage of correct answers (albeit low).