(474.19) Sentiment Analysis Machine Learning Model Congruence: A Case Study Using Neuroscience Module Evaluations
Sunday, April 3, 2022
10:15 AM – 12:15 PM
Location: Exhibit/Poster Hall A-B - Pennsylvania Convention Center
Poster Board Number: C19 Introduction: AAA has separate poster presentation times for odd and even posters. Odd poster #s – 10:15 am – 11:15 am Even poster #s – 11:15 am – 12:15 pm
Jeffrey Plochocki (UCF College of Medicine), Jonathan Kibble (UCF College of Medicine)
Introduction Medical school faculty and administrators regularly assess sentiment in student-generated textual data, such as instructor and course evaluations. Machine learning models (MLMs) that automate and systematize sentiment analysis are commercially available. However, the congruency of MLMs has not been extensively tested. We compare sentiment polarity derived from human analysis and five MLMs to test the hypothesis they yield significantly correlated output.
Methods Student evaluations (n=116) of the neuroscience module at the UCF College of Medicine were collected and anonymized. Students were asked to evaluate the strengths of the module (n=108) and provide suggestions for improvement (n=102). Responses were subjected to sentiment analysis by five commercially available MLMs and two module faculty reviewers. Sentiment was classified as either positive (1), neutral (0), or negative (-1).
Results Sentiment polarity as assessed by the reviewers was significantly correlated (r=0.66, plt;0.05). Reviewer assessments were congruent for 73.8% (n=155) of responses. Congruence was greatest for responses on strengths of the module (92.6%, n=100) compared to suggestions for improvement (54.4%, n=103). Congruency among the MLMs was 38.1% (n=80) for all responses, 60.1% (n=80) for module strengths (n=65), and 14.7% (n=15) for suggestions for improvement. Correlation matrix showed moderate correlations among the reviewers and MLMs (range of r=0.41-0.62, plt;0.05). Congruence among all reviewers and MLMs occurred for 34.3% (n=72) responses, with maximal incongruence occurring for only 2.4% (n=5) of responses. Again, congruence was greatest for responses on module strengths (58.3%, n=63) compared to suggestions for improvement (8.8%, n=9). All methods assessed the responses on strength of the module to be more positive than suggestions for improvement. With all methods combined, responses on strengths of the module scored an average of 0.79 compared to suggestions for improvement, which scored -0.34.
Conclusion Sentiment polarity derived from human analysis and MLMs is significantly correlated, although the coefficients reflect only modest linear relationships. MLMs are less congruent than the human observers. All methods demonstrate greater congruence when assessing responses as more positive (i.e., module strengths) than negative (i.e., suggestions for improvement). Additional refinement of MLMs may be necessary before for they can be applied with consistency in medical education settings.