Dissemination & Implementation Science
McCall Schruff, B.A.
Research Assistant
University of Mississippi
Oxford, Mississippi
Carolyn E. Humphrey, None
Student
University of Mississippi
Aurora, Illinois
Jeffrey M. Pavlacic, M.A.
PhD Candidate
University of Mississippi
Oxford, Mississippi
John Young, Ph.D.
Professor
University of Mississippi
University, Mississippi
Introduction: Technological innovation has dramatically influenced aspects of mental health science (Holmes et al., 2014), yet techniques for evaluating patient-provider interaction remain largely unchanged (Campbel et al., 2014). Natural language processing (NLP) and artificial intelligence (AI) software may be particularly useful in efficiently assessing treatment adherence and therapist performance. A commercially available software (Lyssn.io) exists that facilitates AI evaluation of therapeutic interactions to automatically assign a Cognitive Therapy Rating Scale (CTRS; Young & Beck, 1980) score. The CTRS is one of the most commonly used instruments for evaluation of general CBT fidelity, which can be helpful in dissemination and implementation of these techniques (Goldberg et al., 2020). This aspect of the Lyssn software has thus far not been extensively examined for consistency with a human rater, however, which forms the basis for the current study. Given statements of internal piloting and refinement on the Lyssn website, it was expected that scores from all sources would be highly reliable. It should be noted that none of the authors have any affiliation with Lyssn, they are merely users of their commercially available product.
Method: Data collection is ongoing and only initial pilot data are reported here. This entailed evaluation of 20 role-play sessions conducted by three inexperienced undergraduates (all female; ages 21, 22, and 24) using a self-determined module from the Unified Protocol (Barlow et al., 2017). Two experienced raters (the first and second authors) and Lyssn assigned CTRS scores to each video. The consistency of these ratings was analyzed using mixed-model Intraclass Correlations (ICC; McGraw & Wong, 1996; Shrout & Fleiss, 1979), first between the two human raters and then including the AI’s scores.
Results: Reliability of human raters compared to each other was good (ICC = 0.76). Similarly, when the AI’s scores were also considered in aggregate, reliability remained moderately high (ICC = 0.72), although very slightly lower in comparison to the human-only ratings.
Discussion: These results suggest that the AI is reliable and produces results similar to those derived from trained human raters. Given the CTRS ratings produced by the AI, this approach represents a potentially powerful technique for understanding treatment fidelity and competence. This may be particularly useful in environments that do not have substantial resources to evaluate these constructs (i.e., community mental health). It may also be very useful in training environments as a tool to enable clinician trainees detailed, automated feedback on their application of therapeutic techniques at a frequency that far exceeds what might be feasible from their human supervisors. Further, the software also offers tools to track changes in performance over time, which could enable evaluation of the efficacy of specific practice exercises directed toward clinical improvement.