Is Multiclass Automatic Text De-Identification Worth the Effort?

Academic Article


  • © Georg Thieme Verlag KG 2018. Summary Objectives: Automatic de-identification to remove protected health information (PHI) from clinical text can use a binary model that replaces redacted text with a generic tag (e.g., ), or can use a multiclass model that retains more class information (e.g., ). Binary models are easier to develop, but result in text that is potentially less informative. We investigated whether building a multiclass de-identification is worth the extra effort. Methods: Using the 2014 i2b2 dataset, we compared the accuracy and impact on document readability of two models. In the first experiment, we generated one binary and two multiclass versions trained with the same machine-learning algorithm Conditional Random Field (CRF). Accuracy (recall, precision, f-score) and secondary metrics (e.g, training time, testing time, minimum memory required) were measured. In the second experiment, three reviewers accessed the readability of two redacted documents using the binary and multiclass methods. We estimated a pooled Kappa to estimate the inter-rater agreement. Results: The multiclass model did not demonstrate a clear accuracy advantage, with lower recall (-1.9%) and only slightly better precision (+0.6%), despite requiring additional computing resources. Three raters reached a very high agreement (Kappa = 0.975, 95% Confidence Interval (0.946, 1.00), p < 0.0001) that both binary and multiclass models have the same impact on document readability. Conclusions: This study suggests that the development of more sophisticated classification of PHI may not be worth the effort in terms of both system accuracy and the usefulness of the output.
  • Digital Object Identifier (doi)

    Author List

  • Bui DDA; Redden DT; Cimino JJ
  • Start Page

  • 177
  • End Page

  • 184
  • Volume

  • 57
  • Issue

  • 4