Dataset description
We analysed 3605 records consisting of 479760 words, of which 17496 (3.65%) were PII. Record length and PII prevalence differed across datasets (Table S1). The most frequent forms of PII were names (6901/17496, 39.4%), of which the majority were healthcare professional names (6870/6901, 99.6%) (Table S2). In common with previous research, only a minority of names were patient names (31, 0.4%)30.
The next most frequent PII categories were ‘other unique identifiers’ (4758, 27.2%; comprising of professional details, names of external healthcare organisations, and names of hospitals or healthcare units), dates (3641, 20.8%), medical record numbers (1408, 8.0%), and telephone numbers (334, 1.9%). All other PII categories had a prevalence of < 1%. There were no occurrences of fax numbers, health plan beneficiary numbers, vehicle/device identifiers, or IP addresses.
Inter-annotator results
There was excellent agreement between clinician annotators. Pairwise-F1 for classification of PII/non-PII was 0.977 (0.957–0.991), precision 0.967 (0.932–0.993), and recall 0.986 (0.971–0.997) (Table S2). All discrepancies between annotators were due to cases in which one annotator did not notice a PII-word. Once identified, there were no disagreements about whether a word should be classified as PII. The BLEU score between the original, unredacted records, and manually redacted records was 0.931 (0.923–0.934), reflecting the prevalence of PII in the dataset (Table 2). The Levenshtein distance was 67.0(63.3–70.8).
Table 2
String similarity between original and redacted text. Per model BLEU scores and Levenshtein distances are shown.
Model type | Model name | Number of shots | BLEU score (95% CI) | Levenshtein distance (95% CI) |
Inter-annotator | N/A | 0.931 (0.923–0.934) | 67.0 (63.3–70.8) |
Proprietary de-identification software | Microsoft Azure de-identification service | N/A | 0.929 (0.927–0.932) | 35.7 (34.1-37.49) |
AnonCAT | No fine-tuning | 0.948 (0.946–0.950) | 30.2 (28.9–31.6) |
Fine-tuned | 0.948 (0.946–0.950) | 29.3 (28.0-30.6) |
Large language models | Gemma-7b-IT | 0 | 0.071 (0.068–0.075) | 749.3 (726.1-773.9) |
1 | 0.032 (0.030–0.035) | 860.4 (838.1-883.8) |
5 | 0.025 (0.023–0.026) | 758.2 (735.6-781.6) |
10 | 0.031 (0.029–0.033) | 741.8 (717.1-765.33) |
Llama-3-8B-Instruct | 0 | 0.259 (0.250–0.269) | 465.0 (448.7-482.2) |
1 | 0.500 (0.491–0.510) | 145.2 (139.4–151.0) |
5 | 0.769 (0.760–0.778) | 71.3 (67.2–75.1) |
10 | 0.693 (0.682–0.703) | 106.0 (99.3-113.5) |
Phi-3-mini-128k-instruct | 0 | 0.396 (0.383–0.407) | 836.3 (790.6-882.7) |
1 | 0.515 (0.498–0.533) | 603.8 (566.2–640.0) |
5 | 0.482 (0.468–0.497) | 384.8 (369.0-400.3) |
10 | 0.663 (0.649–0.676) | 281.4 (262.4-302.1) |
GPT3.5-turbo-base | 0 | 0.838 (0.832–0.843) | 87.6 (83.9–92.1) |
1 | 0.878 (0.873–0.883) | 73.0 (69.4–77.2) |
5 | 0.926 (0.921–0.930) | 51.6 (47.9–56.2) |
10 | 0.932 (0.928–0.936) | 47.2 (41.1–52.1) |
GPT-4-0125 | 0 | 0.920 (0.915–0.924) | 50.5 (46.5–55.3) |
1 | 0.922 (0.917–0.927) | 51.4 (47.3–56.1) |
5 | 0.925 (0.921–0.929) | 54.3 (49.7–59.2) |
10 | 0.920 (0.915–0.924) | 52.6 (48.4–57.3) |
Model results
PII vs. non-PII
There was substantial variation in performance between comparators (Fig. 1, Table S3). The Microsoft Azure de-identification service had the highest F1 score 0.933 (95%CI 0.928–0.938), precision of 0.916 (0.930 − 0.922), and recall of 0.950 (0.942–0.957), approaching clinician performance. The fine-tuned AnonCAT (FT-AnonCAT) model had an F1 score 0.873 (0.864–0.882), precision 0.981 (0.977–0.985) and recall 0.787 (0.773-0.800), when not including redaction of healthcare professional titles, such as ‘Dr’. With the requirement to redact professional titles, the performance of the FT-AnonCAT model was lower, with F1 score 0.800 (0.843–0.858), precision 0.981 (0.977–0.985) and recall 0.676 (0.665–0.686).
The best performing LLM was GPT-4-0125 with ten-shot learning, with F1 score 0.898 (0.876–0.915), precision 0.924 (0.914–0.933), and recall 0.874 (0.834–0.905)(Figure S1). This was followed by GPT-3.5-turbo-base with ten-shot learning, with F1 score 0.831 (0.807–0.851), precision 0.856 (0.812–0.892), and recall 0.807 (0.788–0.825).
There was improvement in GPT3.5-turbo-base with few-shot learning: the F1 score rose from 0.530 (0.514–0.547) at zero shots to 0.831 (0.807–0.851) with 10 shots, driven by improved precision; recall remained similar through all iterations of zero- and few-shot learning. On qualitative examination with none, or fewer in-context examples, the LLM over-redacted records, including clinically relevant information such as diagnosis, or details of pathology.
The performance of other models was more modest. The next best was Phi-3-mini-128k-instruct, also improving with few-shot learning, with F1 score 0.146 (0.140–0.153) at zero-shots, improving to 0.448 (0.430–0.467) with ten-shots. However, there was significant imbalance between precision and recall. At ten-shots, precision was 0.297 (0.282–0.314) and recall 0.904 (0.892–0.915). This was consistent with our findings on qualitative examination, showing over-redacted records.
Llama-3-8B-Instruct showed the same pattern of over-redaction, showing best performance at five-shots, with an F1 0.198 (0.181–0.216), precision 0.077 (0.069–0.085) and recall 0.990 (0.983–0.995). We did not observe any change in performance from zero- to few-shot learning with Gemma-7b-IT. The F1 at zero-shots was 0.089 (0.086–0.092), and at ten-shots 0.041 (0.037–0.044). At 10-shots, precision was 0.021 (0.019–0.023) and recall was 0.905 (0.885–0.923). Qualitative examination Gemma-7b-IT output showed hallucinatory content was universally present.
Evaluation of text similarity and LLM hallucinations
BLEU scores were high, reflecting close text similarity post-redaction to the original text, for both proprietary models, the Microsoft Azure de-identification service and FT-AnonCAT, 0.929 (0.927–0.932) and 0.948 (0.946–0.950), respectively. This was similar to the BLEU score recorded between clinician redaction and reference text (Table 2). The Levenshtein distances for the Microsoft Azure deidentification service and FT- AnonCAT were 35.74 (34.13–37.5) and 29.3 (28.0-30.6), both lower than the distance reported for clinician redaction. Qualitative examination of the output of both proprietary models showed no evidence of hallucination.
The best performing LLMs, GPT-4-0125 and GPT3.5-turbo-base, had consistently similar BLEU scores and Levenshtein distances to values recorded for clinician redaction, and showed no evidence of hallucination on qualitative examination.
Both Phi-3-mini-128k-instruct and Llama-3-8B-Instruct showed improved BLEU scores and Levenshtein distances across zero- and few-shot learning. On qualitative examination, we confirmed that both Phi-3-mini-128k-instruct and Llama-3-8B-Instruct showed evidence of hallucinatory behaviour at zero-, one- and five-shot learning. This included explanations of the task or output (e.g., ‘This text does not contain explicit identifiers, therefore the text remains unchanged’) alongside nonsensical string (e.g., long spans of punctuation). We did not observe any hallucinations at ten-shots.
We report consistently low BLEU scores and high Levenshtein distances across zero- and few-shot learning for Gemma-7b-IT. The output was grossly hallucinatory, including hallucinated medical history (‘Historical factors include prior trauma-related injury sustained one month back’), translations into other languages, and treatment recommendations (‘She’ll have an appointment to see her Dr tomorrow so we can discuss it then’). We therefore did not include a further evaluation of recall for individual PII categories for Gemma-7b-IT.
PII redaction per category
Names and dates were redacted consistently by all models (Table 4). However, medical record numbers, phone numbers, and the other unique identifiers had variable redaction across models. The Microsoft Azure de-identification service, GPT-4-0125, Llama-3-8B-Instruct and Phi-3-mini-128k-instruct had consistently high recall across PII categories. For these models, the lowest recall was for ‘other unique identifiers’. GPT3.5-turbo-base and FT- AnonCAT had more variation in recall per PII category. Both GPT3.5-turbo and FT- AnonCAT had lower recall for ‘other unique identifiers’, at 0.676 (0.648–0.705) and 0.575 (0.542–0.610) respectively.
Table 4
Recall per PII category. Results for AnonCAT are shown for fine-tuning with 365 examples; results for LLMs are shown using 10 shot learning.
Model type | Model name | Names | Other unique identifiers | Dates | Medical record numbers | Phone numbers |
Proprietary de-identification software | Microsoft Azure de-identification service | 0.985 (0.981–0.989) | 0.890 (0.870–0.909) | 0.984 (0.979–0.989) | 0.948 (0.933–0.963) | 0.972 (0.954–0.988) |
FT-AnonCAT | 0.883 (0.870–0.896) | 0.575 (0.542–0.610) | 0.917 (0.901–0.932) | 0.884 (0.853–0.914) | 0.966 (0.945–0.984) |
Large language models | Llama-3-8B-Instruct | 1.00 (1.00–1.00) | 0.932 (0.891–0.966) | 0.980 (0.963–0.993) | 0.989 (0.971-1.000) | 1.000 (1.000–1.000) |
Phi-3-mini-128k-instruct | 0.978 (0.969–0.985) | 0.814 (0.790–0.837) | 0.940 (0.926–0.953) | 0.973 (0.961–0.983) | 0.983 (0.963–0.997) |
GPT-3.5-turbo-base | 0.911 (0.893–0.928) | 0.676 (0.648–0.705) | 0.882 (0.860–0.902) | 0.866 (0.840–0.890) | 0.935 (0.900-0.967) |
GPT4-0125 | 0.988 (0.983–0.992) | 0.841 (0.818–0.862) | 0.970 (0.960–0.979) | 0.966 (0.953–0.977) | 0.994 (0.986-1.000) |
PII redaction per dataset
Of the best performing models, AnonCAT had the least performance shift per dataset, followed by the Microsoft Azure de-identification service (Fig. 2, Table S4). GPT-4-0125, the best performing LLM, had a wide range of performance across datasets, with highest performance in the general histopathology dataset, with an F1 0.949 (0.938–0.958), and lowest in the musculoskeletal radiograph dataset, with an F1 0.672 (0.580–0.744). Likewise, GPT-3.5-turbo-base, Llama-3-8B-Instruct and Phi-3-mini-128k-Instruct had more performance shift across datasets than both proprietary models.