A new robust model selection method in GLM with application to ecological data
 D. M. Sakate^{1}Email author and
 D. N. Kashid^{1}
https://doi.org/10.1186/s4006801600607
© Sakate and Kashid. 2016
Received: 23 December 2015
Accepted: 8 February 2016
Published: 24 February 2016
Abstract
Background
Generalized linear models (GLM) are widely used to model social, medical and ecological data. Choosing predictors for building a good GLM is a widely studied problem. Likelihood based procedures like Akaike Information criterion and Bayes Information Criterion are usually used for model selection in GLM. The nonrobustness property of likelihood based procedures in the presence of outliers or deviation from assumed distribution of response is widely studied in the literature.
Results
The deviance based criterion (DBC) is modified to define a robust and consistent model selection criterion called robust deviance based criterion (RDBC). Further, bootstrap version of RDBC is also proposed. A simulation study is performed to compare proposed model selection criterion with the existing one. It indicates that the performance of proposed criteria is compatible with the existing one. A key advantage of the proposed criterion is that it is very simple to compute.
Conclusions
The proposed model selection criterion is applied to arboreal marsupials data and model selection is carried out. The proposed criterion can be applied to data from any discipline mitigating the effect of outliers or deviation from the assumption of distribution of response. It can be implemented in any statistical software. In this article, R software is used for the computations.
Keywords
Background
In the last two decades, generalized linear models (GLM) have emerged as a useful tool to develop models from ecological data to explain the nature of ecological phenomena. GLM encompass a wide range of nature of response variable like ‘presenceabsence’ and ‘count’. It can also be used to estimate the survivorship as can be seen in the conservation literature. GLM builds a predictive model for a response variable based on the predictors. Given a data on response and predictors, the model is fitted using maximum likelihood estimates (MLE) of the unknown regression coefficients. Under certain regularity conditions, the MLE is consistent asymptotic normal estimator of regression coefficients in GLM (McCullagh and Nelder 1989). In the presence of over dispersion, maximum quasilikelihood estimation (MQLE) (Wedderburn 1974; McCullagh and Nelder 1989; Heyde 1997) is a popular estimation method. In the process of model building, the researcher may be confronted to a pool of predictors of which some might be redundant in nature. If such predictors are included in the model, the response will be predicted with less accuracy. The fitted GLM may contain some predictors which are redundant in nature and are required to be eliminated from the model based on the observed data.
In the linear regression set up, Murtaugh (2009) evaluated the prediction power of various variable selection methods for ecological and environmental data sets. GLM is a wider class of models with linear regression as a particular case when distribution of response is normal. In GLM, there are many methods available in the literature for variable selection. When the likelihood is known, Akaike information criterion (AIC) (Akaike 1974), Bayes information criterion (BIC) (Akaike 1978) and distribution function criterion (DFC) (Sakate and Kashid 2013) find applications. Sakate and Kashid (2014) proposed a deviance based criterion (DBC) for model selection in GLM which uses MLE of parameters. BIC and DBC are consistent model selection criteria while AIC is not. Sakate and Kashid (2014) empirically established the superiority of DBC over BIC. They also showed that DBC performs better than \(\bar{R}^{2}\) proposed by Hu and Shao (2008).
In practice, the data available for fitting a GLM may be contaminated and the MLE fit of the GLM may not be appropriate. In fact, both MLE and MQLE share the same non robustness property against contamination. Non robustness of MLE in the GLM is extensively studied in the literature (Pregibon 1982; Stefanski et al. 1986; Künsch et al. 1989; Morgenthaler 1992; Ruckstuhl and Welsh 2001). Hence, the use of MLE or MQLE in the presence of contaminated data may give misleading results. The nonrobustness of MLE to contamination results in nonrobustness of AIC, BIC, DFC and DBC. Hence, using MLE based model selection criterion in presence of contaminated data may be erroneous.
To overcome the problem of contamination in GLM, Cantoni and Ronchetti (2001) introduced robust estimation of regression coefficients. Müller and Welsh (2009) proposed a robust consistent model selection criterion by extending the method in Müller and Welsh (2005) to GLM. It is based on a penalized measure of predictive ability of GLM that is estimated using moutofn bootstrap method. It is flexible as it can be used with any estimator. Further, Müller and Welsh (2009) empirically established that its performance is best with the robust estimator due to Cantoni and Ronchetti (2001). However, this method is computationally intensive.
In this article, we propose a new robust model selection criterion in GLM. We show that it is a consistent model selection criterion in the sense that as sample size tends to infinity, the model selected coincides with the true model with probability approaching to one. A simulation study is presented to compare its performance with its competitors. The proposed model selection criterion along with the other criteria is applied to a data on diversity of arboreal marsupials (possums) in montane ash forest (Australia) for model selection.
Results and discussion
Robust estimation
Robust quasideviance
Müller and Welsh (MW) model selection criterion
 Step I:

Compute and order Pearson residuals from the full model.
 Step II:

Set the number of strata K between 3 and 8 (Cochran 1977, pp. 132–134) depending on the sample size.
 Step III:

Set stratum boundaries at the \(K^{  1} ,2K^{  1} , \ldots ,\left( {K  1} \right)K^{  1}\) quantiles of the Pearson residuals.
 Step IV:

Allocate observations to the strata in which the Pearson residuals lie.
 Step V:

Sample \(\frac{{\left( {{\text{number}}\;{\text{of}}\;{\text{observations}}\;{\text{in}}\;{\text{stratum}}\;K} \right)m}}{n}\) (rounded if necessary) rows of (y, X) independently with replacement from stratum K so that the total sample size is m.
 Step VI:

Use these data to construct the estimator \(\widehat{\varvec{\beta}}_{\alpha ,m }^{\varvec{*}}\), repeat steps V and VI, B independent times and then estimate the conditional expected prediction loss by \(\widehat{\sigma }^{2} A_{2} \left( {M_{\alpha } } \right)\), whereand \(E_{\varvec{*}}\) denotes expectation with respect to the bootstrap distribution. Combining Eqs. (10) and (11) we get an estimate of the criterion function given in Eq. (9) as$$\begin{aligned} A_{2} \left( {M_{\alpha } } \right) &= \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} w\left( {\varvec{X}_{i,\alpha } } \right)\rho\left[ {\frac{{y_{i}  g^{  1} \left( {\varvec{X}_{i,\alpha }^{T} \left\{ {\widehat{\varvec{\beta}}_{\alpha ,m }^{\varvec{*}}  E_{\varvec{*}} \left( {\widehat{\varvec{\beta}}_{\alpha ,m }^{\varvec{*}}  \widehat{\varvec{\beta}}_{\varvec{\alpha}} } \right)} \right\}} \right)}}{{\widehat{\sigma }V\left( {\varvec{X}_{i}^{T} \widehat{\varvec{\beta}}} \right)}}} \right] \end{aligned}$$(11)Müller and Welsh (2009) suggest using \(\frac{n}{4} \le m \le \frac{n}{2}\) for moderate sample size n (50 ≤ n ≤ 200) and if n is large, m can be smaller than \(\frac{n}{4}\).$$\hat{A}\left( {M_{\alpha } } \right) = \widehat{\sigma }^{2} \left\{ {A_{1} \left( {M_{\alpha } } \right) + \frac{1}{n}\delta (n)p_{\alpha } + A_{2} \left( {M_{\alpha } } \right)} \right\}$$(12)
The robust model selection criterion A(M _{ α }) due to Müller and Welsh (2009) requires computations of the quantities given in Eqs. (10) and (11). Also, a computer intensive proportionally allocated, stratified moutofn bootstrap is required to compute the quantity in Eq. (11). This makes its implementation by a researcher quite difficult. There is a need of a robust criterion which is easy to implement. We propose a robust version of deviance based criterion (DBC) called robust DBC (RDBC).
Proposed robust model selection criterion
Condition 1
Theorem 1
Condition 2
\(C\left( {n,p_{\alpha } } \right) = o\left( n \right)\) and \(C\left( {n,p_{\alpha } } \right) \uparrow \infty\) as \(n \to \infty .\)
Theorem 2
Bootstrap RDBC
It is well known fact that the robust estimators of the unknown regression coefficients in the GLM are not unbiased and their bias is non negligible for small to moderate sample sizes. RDBC is based on the robust estimator due to Cantoni and Ronchetti (2001) which is a biased estimator. Therefore, we propose a modified version of RDBC using proportionally allocated, stratified moutofn bootstrap.
 Step VI*:

Use these data to construct the estimator \(\widehat{\varvec{\beta}}_{\alpha , m,j }^{\varvec{*}}\), repeat steps V and VI*, B independent times and then compute \(\Lambda_{QM}^{*}\).
Simulation results
Percentage of optimal model selection
\(\epsilon\)  γ  n  RDBC  BRDBC  A(M _{ α })  

P _{1}  P _{2}  P _{1}  P _{2}  
0.05  2  64  89  93  98  99  98 
128  92  95  99  100  99  
192  94  96  100  100  100  
5  64  87  93  93  96  99  
128  89  93  99  99  100  
192  91  93  99  99  99  
10  64  86  91  91  96  98  
128  91  94  98  99  99  
192  92  95  99  99  100  
0.10  2  64  88  91  96  98  97 
128  90  93  98  99  99  
192  92  95  99  100  99  
5  64  78  85  78  87  99  
128  85  89  94  97  100  
192  86  90  96  98  100  
10  64  78  83  70  80  98  
128  84  88  92  96  99  
192  86  91  97  99  100 
Real data application
Selected models
Selection criterion  Selected variables in the best model 

RDBC  Stags, bark, acacia, habitat, aspect 
BRDBC  Stags, habitat 
MW method based on Mallows’ quasilikelihood estimator  Stags, habitat 
MW method based on bias corrected Mallows’ quasilikelihood estimator (stratified bootstrap)  Stags, habitat 
AIC  Stags, bark, acacia, habitat, aspect 
BIC  Stags, bark, acacia, aspect 
The GLM are providing a satisfactory answer to many practical problems in the emerging quantitative analysis in the fields like environmental science and ecology. The data produced in the studies of air pollution, ozone exceedance, ground water contamination, avian population monitoring, boreal treeline dynamics, aquatic bacterial abundance, conservation biology, marine and fresh water fish populations, etc. can be analyzed using GLM. GLM will provide a satisfactory solution to model based inference if only relevant variables are included in the model and there is no deviation from the assumed distribution of response. Identification and safe removal of the redundant predictors from the model in the presence of slight deviation from the assumed distribution of the response can be effectively done by the proposed criterion. Our criterion is robust to outliers which are common in any real data. It is also shown to be a consistent model selection criterion. Hence, our criterion is a good addition to easy implement and consistent model selection toolbox of researchers.
Methods
Simulation design
The empirical comparison of the proposed and existing model selection criteria is done using simulation study. The simulated data was generated according to a Poisson regression model with canonical link (log) and three predictors with intercept i.e. \({ \log }\mu_{i} = \beta_{0} + \beta_{1} X_{i1} + \beta_{2} X_{i2} + \beta_{3} X_{i3}\). The predictors were generated from the standard uniform distribution i.e., X _{ ij } ~ U(0, 1), j = 1, 2, 3. The observations on the responses Y _{ i }’s were generated from Poisson distribution P(μ _{ i }) and a perturbed distribution of the form \(\left( {1  \epsilon } \right)P\left( {\mu_{i} } \right) + \epsilon P\left( {\gamma \mu_{i} } \right)\), where, \(\epsilon = 0.05, 0.10\) and γ = 2, 5, 10.
To simulate the data, the regression parameters were set to β _{0} = 1, β _{1} = 1, β _{2} = 2 and β _{3} = 0. The choice of these parameters is not intentional but only for the purpose of illustration. We considered three different sample sizes, n = 64, 128 and 192. To compute BRDBC and A(M _{ α }), we divided the entire sample into eight equalsized strata based on the Pearson residuals from the full model. In case of sample size n = 64, we draw 3 observations from each strata with replacement so that the sample size becomes 24. Similarly, for n = 128 and 192, we draw 5 and 7 observations and sample size becomes 40 and 56 respectively. This is in the accordance with the algorithm mentioned in section “Results and discussion”. In such a way, we obtain B = 50 bootstrap samples for each sample size. To implement RDBC and BRDBC, we used the penalty functions \(P_{1} = p_{\alpha } log\left( n \right)\) and \(P_{2} = p_{\alpha } \left( {log\left( n \right) + 1} \right)\) for C(n, p _{ α }). The Huber score function with tuning constant c = 2 was used to compute the robust estimator due to Cantoni and Ronchetti (2001). It can be easily computed using the robustbase (Rousseeuw et al. 2014) package in R software. This experiment was repeated 1000 times and the percentage of optimal model selection using these three criteria was obtained.
Conclusions
We proposed a robust model selection criterion in GLM called as RDBC. RDBC takes into account goodness of fit as well as complexity of the model. The consistency property of RDBC is also established. Performance evaluation and comparison with MW method is done using simulation study. These methods are also applied to the real ecological data. We also defined a bootstrap version of RDBC and called it as BRDBC. Any suitable penalty function can be used without changing the form of RDBC and BRDBC.
In case of quantitative analysis of environmental and ecological data using GLM, the distribution of response may deviate from the assumed distribution in the model and there might be some redundant predictors present in the model which are to be identified and safely removed from the model. The proposed criterion can be used effectively to perform model selection in GLM. It is robust to slight deviations from the assumed response distribution and the presence of outliers in the data. Overall, the proposed model selection criterion is robust, consistent and easy to implement model selection criterion as compared to its competitors.
Declarations
Authors’ contributions
DS has defined the proposed method, stated and proved the theorems, performed the simulation study and illustrated model selection for arboreal marsupials data. DK formulated the concept behind the method and contributed in writing, drafting the manuscript and revising it critically for intellectual content. Both authors read and approved the final manuscript.
Acknowledgements
The authors wish to thank the Editor and anonymous referees for their suggestions which led to the improvement in the paper.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723View ArticleGoogle Scholar
 Akaike H (1978) A Bayesian analysis of the minimum AIC procedure. Ann Inst Stat Math 30:9–14View ArticleGoogle Scholar
 Cantoni E (2004) Analysis of robust quasideviances for generalized linear models. J Stat Softw 10:i04Google Scholar
 Cantoni E, Ronchetti E (2001) Robust inference for generalized linear models. J Am Stat Assoc 96:1022–1030View ArticleGoogle Scholar
 Cochran WG (1977) Sampling techniques, 3rd edn. Wiley, New YorkGoogle Scholar
 Hampel FR (1974) The influence curve and its role in robust estimation. J Am Stat Assoc 69:383–393View ArticleGoogle Scholar
 Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA (1986) Robust statistics: the approach based on influence functions. Wiley, New YorkGoogle Scholar
 Heyde CC (1997) Quasilikelihood and its application. Springer, New YorkView ArticleGoogle Scholar
 Hu B, Shao J (2008) Generalized linear model selection using R^{2}. J Stat Plan Inference 138:3705–3712View ArticleGoogle Scholar
 Huber PJ (1981) Robust statistics. Wiley, New YorkView ArticleGoogle Scholar
 Künsch HR, Stefanski LA, Carroll RJ (1989) Conditionally unbiased boundedinfluence estimation in general regression models, with applications to generalized linear models. J Am Stat Assoc 84:460–466Google Scholar
 Lindenmayer DB, Cunningham RB, Tanton MT, Smith AP, Nix HA (1990) The conservation of arboreal marsupials in the montane ash forests of the Victoria, SouthEast Australia: I. Factors influencing the occupancy of trees with hollows. Biol Conserv 54:111–131View ArticleGoogle Scholar
 Lindenmayer DB, Cunningham RB, Tanton MT, Nix HA, Smith AP (1991) The conservation of arboreal marsupials in the montane ash forests of the Central Highlands of Victoria, SouthEast Australia: III. The habitat requirements of Leadbeater’s Possum Gymnobelideus leadbeateri and models of the diversity and abundance of arboreal marsupials. Biol Conserv 56:295–315View ArticleGoogle Scholar
 McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, LondonView ArticleGoogle Scholar
 Morgenthaler S (1992) Leastabsolutedeviations fits for generalized linear models. Biometrika 79:747–754View ArticleGoogle Scholar
 Müller S, Welsh AH (2005) Outlier robust model selection in linear regression. J Am Stat Assoc 100:1297–1310View ArticleGoogle Scholar
 Müller S, Welsh AH (2009) Robust model selection in generalized linear model selection. Stat Sin 19:1155–1170Google Scholar
 Murtaugh PA (2009) Performance of several variableselection methods applied to real ecological data. Ecol Lett 12:1061–1068View ArticleGoogle Scholar
 Pregibon D (1982) Resistant fits for some commonly used logistic models with medical applications. Biometrics 38:485–498View ArticleGoogle Scholar
 Rousseeuw P, Croux C, Todorov V, Ruckstuhl A, SalibianBarrera M, Verbeke T, Koller M and Maechler M (2014) Robustbase: basic robust statistics. R package version 0.922. http://CRAN.Rproject.org/package=robustbase
 Ruckstuhl AF, Welsh AH (2001) Robust fitting of the binomial model. Ann Stat 29:1117–1136View ArticleGoogle Scholar
 Sakate DM, Kashid DN (2013) Model selection in GLM based on the distribution function criterion. Model Assist Stat Appl 8:321–332Google Scholar
 Sakate DM, Kashid DN (2014) A deviancebased criterion for model selection in GLM. Statistics 48:34–48View ArticleGoogle Scholar
 Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464View ArticleGoogle Scholar
 Shao J (1993) Linear model selection by crossvalidation. J Am Stat Assoc 88:486–494View ArticleGoogle Scholar
 Shao J (1996) Bootstrap model selection. J Am Stat Assoc 91:655–665View ArticleGoogle Scholar
 Simpson JR, Montgomery DC (1998) The development and evaluation of alternative generalized mestimation techniques. Commun Stat Simul Comput 27:1031–1049View ArticleGoogle Scholar
 Stefanski LA, Carroll RJ, Ruppert D (1986) Optimally bounded score functions for generalized linear models with applications to logistic regression. Biometrika 73:413–424Google Scholar
 Wedderburn RWM (1974) Quasilikelihood functions, generalized linear models, and the Gauss–Newton method. Biometrika 61:439–447Google Scholar
 Wisnowski JW, Simpson JR, Montgomery DC, Runger GC (2003) Resampling methods for variable selection in robust regression. Comput Stat Data Anal 43:341–355View ArticleGoogle Scholar