Identifying spurious variables in overfit spatial models

Thursday, August 5, 2021

Link To Share This Poster: https://cdmcd.co/dEyWdw
Live Discussion Link: https://cdmcd.co/3wqGX7

Volker Bahn, Department of Biological Sciences, Wright State University, Dayton, OH

Presenting Author(s)

Volker Bahn
Department of Biological Sciences, Wright State University
Dayton, OH, USA

Background/Question/Methods
With the advent of large spatial datasets and computing power in the 90ies, correlative spatial modeling became an important tool in ecology and conservation. Correlative spatial models search for patterns and explanatory variables in an exploratory fashion. Adding modern flexible models, the approach is prone to overfitting: finding correlations where there are none, or overstating their explanatory or even predictive power based on inflated goodness-of-fit. I’ve previously introduced tools for true model evaluation, not merely goodness-of-fit, namely spatially independent leave-one-out cross-validation (SILOOCV). Here I investigate the usefulness of SILOOCV for identifying variables that appear to have a functional connection to a spatial pattern, but actually don’t. I identify these variables by omitting them from the model and contrasting the decrease in goodness-of-fit with the decrease in predictive power as determined by SILOOCV. My model system are species distributions models in a realistic simulation model on a 50 x 50 grid, which allows me to know and control the actual functional relevance of candidate variables and thus allows me to evaluate my system of detecting spurious variables with misleadingly high goodness-of-fit.
Results/Conclusions
Without using independent data in model selection, species distribution models had an elevated probability of selecting functionally unconnected variables. This was caused by an elevated probability of spatially autocorrelated variables to intercorrelate by random chance. Variables not functionally connected to the distribution of virtual species showed a significantly higher drop in performance from an evaluation based on goodness-of-fit (aka resubstition or training data = test data) to an independent evaluation (SILOOCV) than variables that actually were functionally involved in the distribution of the simulated species. In particular, unconnected variables that showed an initially high correlation with distributions by random chance, and thus contributed to the overfit of the model, experienced a significant rise in error estimates when tested rigorously. Variables with matching spatial autocorrelation patterns (range) were prone to correlate by random chance. The performance of SILOOCV in identifying such spuriously fitting variables was comparable to identifying them by an evaluation based on new, independent test data, which are hard to obtain in real life. Therefore, I recommend the use of SILOOCV to prevent spurious results in correlative spatial models that tend to misidentify unimportant variables as supposedly good explanatory variables or predictors.