Soybean yield modeling using bootstrap methods for small samples

Gustavo H. Dalposso, Miguel A. Uribe-Opazo, Jerry A. Johann


One of the problems that occur when working with regression models is regarding the sample size; once the statistical methods used in inferential analyzes are asymptotic if the sample is small the analysis may be compromised because the estimates will be biased. An alternative is to use the bootstrap methodology, which in its non-parametric version does not need to guess or know the probability distribution that generated the original sample. In this work we used a set of soybean yield data and physical and chemical soil properties formed with fewer samples to determine a multiple linear regression model. Bootstrap methods were used for variable selection, identification of influential points and for determination of confidence intervals of the model parameters. The results showed that the bootstrap methods enabled us to select the physical and chemical soil properties, which were significant in the construction of the soybean yield regression model, construct the confidence intervals of the parameters and identify the points that had great influence on the estimated parameters.


multiple linear regression; model selection; bootstrap global influence diagnosis; bootstrap confidence intervals

Full Text:



Aiken LS, West SG, 1991. Multiple regression: Testing and interpreting interactions. Sage Publications, Thousand Oaks, CA, USA. 224 pp.

Akaike H, 1973. Information theory and an extension of the maximum likelihood principle. Proc. 2nd Int. Symp. on Information Theory; Petrov BN, Csaki F (eds.). pp: 267–281. Akadémia Kiado, Budapest.

Alakukku L, Elomen P, 1995. Long-term effects of a single compaction by heavy field traffic on yield and nitrogen uptake of annual crops. Soil Till Res 36(3-4): 141-152.

Al-Marshadi AH, 2011. New weighted information criteria to select the true regression model. Aust J Basic Appl Sci 3(3): 317-312.

Austin P, Tu J, 2004. Bootstrap methods for developing predictive models. Am Stat 58(2): 131–137.

Beyaztas U, Alin A, 2013. Jackknife-after-bootstrap method for detection of influential observations in linear regression models. Commun Stat Simulat C 42(6): 1256-1267.

Busscher WJ, Bauer PJ, Camp CR, Sojka RE, 1997. Correction of cone index water content differences in a coastal plain soil. Soil Till Res 43(3-4): 205-217.

Chaves-Neto A, Faria, TMB, 2015. Bootstrap for order identification in Arma(p,q) structures. Ind J Manag Prod 6(1): 169-181.

CONAB, 2015. Soja – Brasil: Série histórica de produtividade. [24 March 2015].

Cook RD, 1977. Detection of influential observation in linear regression. Technometrics 19(1): 15-18.

Cunha WJ, Colosimo EA, 2003. Intervalos de confiança bootstrap para modelos de regressão com erros de medida. Rev Mat Estat 21(2): 25-41.

Davison AC, Hinkley DV, 1997. Bootstrap methods and their application. Press syndicate of the University of Cambridge, Cambridge, UK. 582 pp.

Dourado Neto D, Dario GJA, Barbieri APP, Martin TN, 2014. Biostimulant action on agronomic efficiency of corn and common beans. Biosci J 30(1): 371-379.

Dubreuil S, Berveiller M, Petitjean F, Salaün M, 2014. Construction of bootstrap confidence intervals on sensitivity indices computed by polynomial chaos expansion. Reliab Eng Syst Safe 121: 263-275.

Efron B, 1979. Bootstrap methods: Another look at the jackknife. Ann Stat 7(1): 1-26.

Efron B, 1982. The jackknife, the bootstrap and other resampling plans. SIAM, Philadelphia, PA, USA. 93 pp.

Efron B, 1992. Jackknife-after-bootstrap standard errors and influence functions. J R Stat Soc 54: 83-127.

Efron B, Tibsshirani R, 1986. Bootstrap methods for standard errors, confidence intervals and other measures of statistical accuracy. Stat Sci 1(1): 54-75.

EMBRAPA, 2013. Sistema brasileiro de classificação de solos, 3ª Ed. – Centro Nacional de Pesquisa de Solos, EMBRAPA – SPI, Rio de Janeiro. 412 pp.

Freddi OS, Carvalho MP, Veronesi-Jr V, Carvalho GJ, 2006. Relationship between maize yield and soil mechanical resistance to penetration under conventional tillage. Eng Agric 26(1): 113-121.

Freedman DA, 1981. Bootstrapping regression models. Ann Statist 9(6): 1218-1228.

Freud RJ, Littell RC, 2000. SAS system for regression, SAS Inst., Cary, NC, USA. 264 pp.

García-Gallego JM, Chamorro-Mera A, García-Galán MM, 2015. The region-of-origin effect in the purchase of wine: The moderating role of familiarity. Span J Agric Res 13(3): e0103.

Garcia-Paredes JD, Olson KR, Lang JM, 2000. Predicting corn and soybean productivity for Illinois soils. Agric Syst 64(3): 151-170.

Hao L, Naiman DQ, 2010. Assessing inequality. Sage, Thousand Oaks, CA, USA. 149 pp.

Hoerl R, Snee RD, 2012. Statistical thinking: Improving business performance. John Wiley & Sons, Hoboken, USA. 544 pp.

Hoogenboom G, 2000. Contribution of agrometeorology to the simulation of crop production and its applications. Agric For Meteorol 103: 137-157.

Ireland CR, 2010. Experimental statistics for agriculture and horticulture. Cambridge University Press, Cambridge, UK. 384 pp.

Junges AH, Fontana DC, 2011. Agrometeorological-spectral model to estimate wheat yield in the state of Rio Grande do Sul, Brazil. Rev Ceres 58(1): 9-16.

Kamo K, Yanagihara H, Satoh K, 2013. Bias-corrected AIC for selecting variables in poisson regression models. Commun Stat A – Theory 42(11): 1911-1921.

Khakural BR, Robert PC, Huggins DR, 1999. Variability of corn/soybean yield and soil/landscape properties across a southwestern Minnesota landscape. In: Precision Agriculture; Robert PC, Rust RH, Larson WE (eds.). pp: 573-579. Am. Soc. Agron., Madison, WI, USA.

Kulcheski FR, Molina LG, Fonseca GC, Morais GL, Oliveira LFV, Margis R, 2016. Novel and conserved microRNAs in soybean floral whorls. Gene 575(2): 213-223.

Levy P, Lemeshow S, 1980. Sampling for health professionals. LLP, Belmont, CA, USA. 320 pp.

Lobell DB, Ortiz-Monasterio I, Asner GP, Naylor RL, Falcon WP, 2005. Combining field surveys, remote sensing, and regression trees to understand yield variations in an irrigated wheat landscape. Agron J 97: 241-249.

Losada B, Blas C, García-Rebollar P, Cachaldora P, Méndez J, Ibáñez M, 2015. Short communication: Prediction of apparent metabolisable energy content of cereal grains and by-products for poultry from its chemical composition. Span J Agric Res 13(2):06SC02.

Martin MA, Roberts S, 2010. Jackknife-after-bootstrap regression influence diagnostics. J Nonparametric Stat 22(2): 257-269.

Meloun M, Militký J, 2001. Detection of single influential points in OLS regression model building. Anal Chim Acta 439(2): 169-191.

Mercante E, Lamparelli RAC, Uribe-Opazo MA, Rocha JV, 2010. Linear regression models to soybean yield estimate in the west region of the state of Paraná, Brazil, using spectral data. Eng Agríc 30(3): 504-517.

Oliveira IP, Costa KAP, Faquin V, Maciel GA, Neves BP, Machado EL, 2009. Effects of calcium sources on Grass growth in monoculture and intercropping. Ciênc Agrotec 33: 592-598.

Paes AT, 1998. Essential items in biostatistics. Arq Bras Cardiol 71(4): 575-580.

Penalba OC, Bettolli ML, Vargas WM, 2007. The impact of climate variability on soybean yields in Argentina. Multivariate regression. Meteorol Appl 14: 3-14.

Peng RD, 2008. Simpleboot: Simple bootstrap routines. R package version 1.1-3.

Pettigrew WT, 2008. Potassium influences on yield and quality production for maize, wheath, soybean and cotton. Physiol Plant 133: 670-681.

Popp JS, Griffin TW, Popp MP, Baker WH, 2002. Profitability of variable rate phosphorus in a two crop rotation. J Ark cad Sci 56: 125-133.

R Core Team, 2014. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

Rahman MS, 2014. Coefficient estimation of regression model and hypothesis testing by bootstrap method. Res Rew J Stat 3(2): 1-7.

Rao P, 1971. Some notes on misspecification in multiple regressions. Am Statistician 25(5): 37-39.

Rizopoulos D, 2009. BootStepAIC: Bootstrap stepAIC. R package version 1.2-0.

Rosolem CA, Foloni JSS, Tiritan CS, 2002. Root growth and nutrient accumulation in cover crops as affected by soil compaction. Soil Till Res 65:109-115.

Sabaghnia N, Dehghani H, Alizadeh B, Mohghaddam M, 2010. Interrelationships between seed yield and 20 related traits of 49 canola (Brassica napus L.) genotypes in non-stressed and water-stressed environments. Span J Agric Res 8(2): 356-370.

Shasha D, Wilson M, 2011. Statistic is easy. Morgan & Claypool Publishers, San Rafael, CA, USA. 162 pp.

Siegel S, 1956. Nonparametric statistics for the behavioral sciences. McGraw-Hill, New York. 312 pp.

Sutton NJ, Cho S, Armsworth PR, 2016. A reliance on agricultural land values in conservation planning alters the spatial distribution of priorities and overestimates the acquisition costs of protected areas. Biol Cons 194: 2-10.

Tao F, Yokozawa M, Liu J, Zhang Z, 2008. Climate-crop yield relationships at provincial scales in China and the impacts of recent climate trends. Clim Res 38: 83-94.

Vera-Diaz MC, Kaufmann RK, Nepstad DC, Schlesinger P, 2008. An interdisciplinary model of soybean yield in the Amazon Basin: The climatic, edaphic, and economic determinants. Ecol Econ 65(2): 420-431.

Zheng H, Chen L, Han X, Zhao X, Ma Y, 2009. Classification and regression tree (CART) for analysis of soybean yield variability among fields in northeast China: The importance of phosphorus application rates under drought conditions. Agric Ecosyst Environ 132: 98-105.

DOI: 10.5424/sjar/2016143-8635