Comparing multivariate datasets constitutes the basis for identifying the important factors affecting the structure and composition of ecological, biological and evolutionary systems. Procrustes analysis is a powerful and popular method for matching and comparing two multivariate data matrices. However, Procrustes analysis is based on the least sum of squares, which is sensitive to outliers, extreme values, and missing data. The objective of this study is to evaluate resistant Procrustes methods robust to outliers and to demonstrate their application in ecology. There are various types of resistant methods in the literature including Repeated median, Huber, and Biweight. However, there lacks a thorough evaluation and comparison of these resistant methods in their ability to handle outliers and compare ecological datasets. In this work, we conducted a comprehensive simulation study to compare the performance of three resistant methods (i.e., Huber, Biweight and Repeated median) together with standard Procrustes analysis. We simulated scenarios of matching data matrices P and Q with 13%, 20% and 40% outliers respectively.
Results/Conclusions
According to the simulation results, Biweight constantly outperformed the other three methods followed by Repeated median. Biweight remained resistant even when there were 40% outliers, with the averages of its recovered angle, scaling and translation to be 29.5 degrees, 2.0 and 1.0 respectively. Repeated median performed well under 13%-outliers and 20%-outliers scenarios, but its performance significantly decreased as the proportion of outlies increased to 40%. Huber and Procrustes analysis were vulnerable to outliers. We also considered the potential influence of different sample sizes (e.g., 15, 54, and 90 landmarks) on the method performance, and we found that different sample sizes did not affect the performance of these four methods. Overall, we recommend Biweight and Repeated median for comparing ecological data matrices with substantial outliers.