|Reineking, B; Schröder, B: Constrain to perform: Regularization of habitat models, Ecological Modelling, 193, 675-690 (2006)|
Predictive habitat models are an important tool for ecological research and conservation. A major cause of unreliable models is excessive model complexity, and regularization methods aim to improve the predictive performance by adequately constraining model complexity. We compare three regularization methods for logistic regression: variable selection, lasso, and ridge. They differ in the way model complexity is measured: variable selection uses the number of estimated parameters, the lasso uses the sum of the absolute values of the parameter estimates, and the ridge uses the sum of the squared values of the parameter estimates. We performed a simulation study with environmental data of a real landscape and artificial species occupancy data. We investigated the effect of three factors on relative model performance: (1) the number of parameters (16, 10, 6, 2) in the 'true' model that determined the distribution of the artificial species, (2) the prevalence, i.e. the proportion of sites occupied by the species, and (3) the sample size (measured in events per variable, EPV). Regularization improved model discrimination and calibration. However, no regularization method performed best under all circumstances: the ridge generally performed best in the 16-parameter scenario. The lasso generally performed best in the 10-parameter scenario. Variable selection with AIC was best at large sample sizes (EPV >= 10) when less than half of the variables influenced the species distribution. However, at low sample sizes (EPV 10), ridge and lasso always performed best, regardless of the parameter scenario or prevalence. Overall, calibration was best in ridge models. Other methods showed overconfidence, particularly at low sample sizes. The percentage of correctly identified models was low for both lasso and variable selection. Variable selection should be used with caution. Although it can produce the best performing models under certain conditions, these situations are difficult to infer from the data. Ridge and lasso are risk-averse model strategies that can be expected to perform well under a wide range of underlying species-habitat relationships, particularly at small sample sizes. (C) 2005 Elsevier B.V. All rights reserved.