University of California, Berkeley Palo Alto, CA, United States
Austin Ho1, Zara Izadi2, Gabriela Schmajuk3, Jinoos Yazdany4, Suzanne Tamang5 and Milena Gianfrancesco6, 1University of California, Berkeley, Berkeley, CA, 2University of California San Francisco, San Francisco, CA, 3UCSF / SFVA, San Francisco, CA, 4UCSF, San Francisco, CA, 5Stanford Center for Population Health Sciences, Redwood City, CA, 6University of California, San Francisco, San Francisco, CA
Background/Purpose: Varicella zoster virus (VZV) infection can be associated with significant morbidity in immunosuppressed individuals. However, infections of VZV are often documented in unstructured fields (e.g., clinical notes). Natural language processing (NLP) methods are increasingly being explored in healthcare to efficiently label unstructured data related to medication safety outcomes. In this study, we utilized NLP methods to capture and classify VZV infection into active or historical occurrences using EHR data from a large academic center.
Methods: We used structured and unstructured data from an EHR with over 800,000 patients from a university-based health system from 2012-2019. Individuals prescribed an immunosuppressant with ≥ 2 encounters in the EHR ≥ 30 days apart were selected. For training purposes, we used a 20/80 train-test split. A rule-based text-mining algorithm (CLEVER) was first used to identify positive mentions of VZV across clinical notes of this patient population (Figure 1). Using these preprocessed notes, next we determined if a clinical note contained a zoster vaccine or disease mention of the zoster virus. We applied tokenization, lemmatization, and stemming before using the Bag of Words (BOW) Model in order to find the frequency of the word 'vaccine' and its related forms (ex. vaccination, immunization). We further classified disease mentions as either active occurrences or historical occurrences using a k-nearest neighbor (KNN) classifier. Data on antiviral medications to treat zoster infection and corresponding dosage information within 30 days of the clinical note mention were obtained from structured fields and also included in the model. We evaluated our approach by calculating the sensitivity, specificity, and accuracy (proportion of mentions correctly classified) of the BOW model and the KNN classifier in the test set which was manually chart reviewed.
Results: The training set and test set included data from 16,344 individuals, 27,977 notes, and 36,042 medication orders. Demographics of these individuals are described in Table 1. Using the BOW model, we were able to categorize zoster vaccine versus disease mentions with an accuracy of 92%. Our KNN classifier achieved an accuracy of 64% in classifying historical versus active occurrences of zoster infection. Sensitivity and specificity results relating to the two models are shown in Table 2.
Conclusion: The application of NLP used in this study was able to achieve a high accuracy rate in terms of determining zoster vaccine versus disease mentions within our clinical note database. However, there is still room for improvement in classifying historical versus active occurrences of VZV infection in patients. Results demonstrate the ability to create automated processes that will streamline the labeling of clinical notes for use in clinical research. In the future, adding more features in the KNN classifier as well as preprocessing the dataset even further may help improve performance. Automated labelling pipeline for clinical notes.
Demographics of individuals receiving immunosuppressants in the dataset (Nf16,344).
Predictive performance of the BOW model and the KNN model. Disclosures: A. Ho, None; Z. Izadi, None; G. Schmajuk, None; J. Yazdany, AstraZeneca, Gilead, Bristol-Myers Squibb(BMS), Aurinia, Astra Zeneca, Pfizer; S. Tamang, None; M. Gianfrancesco, Pfizer.