De-identifying Swedish EHR text using public resources in the general domain

Chomutare, Taridzo; Yigzaw, Kassaye Yitbarek; Budrionis, Andrius; Makhlysheva, Alexandra; Godtliebsen, Fred; Dalianis, Hercules

Publisert versjon (PDF)

Dato

2020

Type

Journal article
Tidsskriftartikkel
Peer reviewed

Forfatter

Chomutare, Taridzo; Yigzaw, Kassaye Yitbarek; Budrionis, Andrius; Makhlysheva, Alexandra; Godtliebsen, Fred; Dalianis, Hercules

Sammendrag

Sensitive data is normally required to develop rule-based or train machine learning-based models for de-identifying electronic health record (EHR) clinical notes; and this presents important problems for patient privacy. In this study, we add non-sensitive public datasets to EHR training data; (i) scientific medical text and (ii) Wikipedia word vectors. The data, all in Swedish, is used to train a deep learning model using recurrent neural networks. Tests on pseudonymized Swedish EHR clinical notes showed improved precision and recall from 55.62% and 80.02% with the base EHR embedding layer, to 85.01% and 87.15% when Wikipedia word vectors are added. These results suggest that non-sensitive text from the general domain can be used to train robust models for de-identifying Swedish clinical text; and this could be useful in cases where the data is both sensitive and in low-resource languages.

Forlag

IOS Press

Sitering

Chomutare, Yigzaw, Budrionis, Makhlysheva, Godtliebsen, Dalianis H. De-identifying Swedish EHR text using public resources in the general domain. Studies in Health Technology and Informatics. 2020;270:148-152

Metadata

Vis full innførsel

Samlinger

Artikler, rapporter og annet (matematikk og statistikk) [332]