One of the problems that occur when working with regression models is regarding the sample size; once the statistical methods used in inferential analyzes are asymptotic if the sample is small the analysis may be compromised because the estimates will be biased. An alternative is to use the bootstrap methodology, which in its non-parametric version does not need to guess or know the probability distribution that generated the original sample. In this work we used a set of soybean yield data and physical and chemical soil properties formed with fewer samples to determine a multiple linear regression model. Bootstrap methods were used for variable selection, identification of influential points and for determination of confidence intervals of the model parameters. The results showed that the bootstrap methods enabled us to select the physical and chemical soil properties, which were significant in the construction of the soybean yield regression model, construct the confidence intervals of the parameters and identify the points that had great influence on the estimated parameters.

_{c}/dm

^{3})

^{3}

_{1}, from 0 to 0.1 m

_{2}, from 0.1 to 0.2 m

_{3}, from 0.2 to 0.3 m depths

^{3})

_{c}/dm

^{3})

^{3})

^{3})

^{2}

_{Adj}(adjusted coefficient of determination)

_{1}(soil penetration resistance, MPa) from 0 to 0.1 m depth

_{2}(soil penetration resistance, MPa) from 0.1 to 0.2 m depth

_{3}(soil penetration resistance, MPa) from 0.2 to 0.3 m depth

Soybean (

Considering that determining the values of certain variables is often a burdensome and arduous task, in some cases analyses are carried out on small samples

An alternative to traditional inference methods is the use of the bootstrap, a simulation method developed by

By comparing the results obtained from bootstrap methods with results of asymptotic methods,

The objective of this work was to utilize bootstrap methods to select explanatory variables, investigate the existence of influential points through diagnostic analysis, and obtain confidence intervals for the parameters of a multiple linear regression model for soybean yield considering physical and chemical soil properties as explanatory variables.

The data used are from the agricultural year 2013/2014 and from a commercial farming area of 167.35 hectares located in the western region of Paraná, Brazil, near the city of Cascavel, with center coordinates latitude 24°57’18’’S and longitude 53°34’29’’W and average altitude of 714 m (_{1}, SRP_{2} and SRP_{3} (soil penetration resistances, MPa, from 0 to 0.1 m, 0.1 to 0.2 m and 0.2 to 0.3 m depths, respectively), Ca (calcium, cmol_{c}/dm^{3}), Mg (magnesium, cmol_{c}/dm^{3}), K (potassium, mg/dm^{3}), P (phosphorus, mg/dm^{3}), Mn (manganese, mg/dm^{3}), Des_{1}, Des_{2} and Des_{3} (soil densities, g/cm^{3}, from 0 to 0.1 m, 0.1 to 0.2 m and 0.2 to 0.3 m depths, respectively) have all been considered for each productivity value. The use of physical and chemical soil properties as explanatory variables is common practice in field surveys, as variations in soil properties account for most of crop yield variations, according to

Descriptive statistics of the variables under study were calculated and a multicollinearity

To determine the bootstrap replicates of the parameters of the regression model we used the paired bootstrap method (

(a) Consider the matrix [^{(1)},^{(1)}] making a resampling with replacement of matrix rows [^{(1)} from [^{(1)},^{(1)}] in the same manner as ˆ^{(}_{b}^{)} , b = 1,…,B.

In order to determine the bootstrap intervals for the parameters of the regression models we used the percentile method (_{i}*, i = 1,…,B, and excluding (α/2)% from the replicates situated in its ends. The technique employed to build the BC confidence interval utilizes a value known as constant-bias-correcting to fit the bootstrap distribution of ˆ

For the models selection the bootstrap method was used as proposed by

(a) Consider the matrix [

In order to investigate the existence of influential points it was held the method proposed by _{i} (

_{i} using JaB.

(a) Adjust the proposed model to the original dataset and estimate D_{i}, i = 1,…,n; (b) build B bootstrap samples using paired bootstrap method; (c) (JaB step) for each x_{i} sample of the original dataset consider the bootstrap samples set which do not contain the x_{i} sample (approximately B/e groups_{i} value is outside this interval then x_{i} is marked as an influential point.

The JaB technique provides another resource for establishing the effect of individual observations on the bootstrap distribution through development of the JaB plot (_{(i)},_{(i)}] obtained by deleting the i row in the original dataset and calculate the statistic of interest, denoted by s_{(i)}. The jackknife influence function for the statistic of interest is defined by:

where

Intuitively, points with high positive or negative values of u_{i}{s} have a high influence on the calculated statistic. To provide a clearer interpretation, the relative jackknife influence function shown in

After calculating the jackknife influence values for each point i, of the dataset, seven ordered pairs are determined, namely (_{i}^{↑} {_{k}), k = {5,10,16,50,94,90,95} where P_{k} represents the k-th percentile of the bootstrap distribution formed with bootstrap replicates calculated from those bootstrap samples which do not have point i. For each percentile the neighboring ordered pairs are linked thus forming graphics, which are compared with dashed line segments perpendicular to the ordinate axis in points P_{k}, k = {5,10,16,50,84,90,95}, calculated from full bootstrap distribution formed by 3000 bootstrap replicates. The analysis is performed highlighting those points surpassing the cutoff point and comparing bootstrap distributions.

The analyses carried out in this work were developed in R statistical software (

Descriptive statistics of the explanatory variables indicated homogeneous behavior of the variables, with no multicollinearity found. The multiple linear regression model of soybean yield, estimated through OLS considering all explanatory variables (^{2}_{Adj}) of 0.41 and root mean square error (RMSE) of 0.33.

It could be observed that estimates for those parameters associated with SRP_{1}, SRP_{3}, Des_{1} and Des_{2} variables showed negative signs, indicating that an increase in the value of these variables implies a reduction in soybean yield (_{2} and Des_{3} variables from the

It was observed that the vast majority of the confidence intervals, determined by the bootstrap technique, contained zero indicating, that with exception of the variable P, the other explanatory variables may not be individually significant. In search for a more appropriate multiple linear regression model it was applied the model selection method using bootstrap considering 1000 resamples (

Other variables selected for most models were Des_{2} with a selection percentage of 87%, Ca with 81% and SRP_{1} with 79%. Analyzing the signs of the estimated parameters associated with these variables in the models in which they were selected it is highlighted that in 94% of models in which the Ca variable was selected the sign of its estimated parameter was positive, suggesting the increase in value of this variable contributes for increasing soybean yield. For those estimated parameters associated with SRP_{1} and Des_{2} variables, in 98% of models in which they were selected the signals were negative. It is clear that some variables may not be useful to explain soybean yield behavior. For example, among the 1000 models obtained, the Des_{1} variable was selected in only 500 and additionally for 180 of those the estimated parameter sign was positive and for 320 of those the sign was negative, thus, this set of oscillations is a guarantee that this variable is not significant, therefore, can be deleted without causing damage to the modeling. A similar case occurs with SRP_{3} variable, as well as being selected in only 460 models, the appropriate sign of its estimated parameter cannot be identified considering that 230 models had a positive sign and 230 had a negative sign. As per the parameters estimates associated to SRP_{2} and Des_{3 }variables, it showed opposite signals from the expected scenario (_{81}, M_{79}, M_{75} and M_{71} (

Regressors present in the M_{81} model can explain only 37% of the soybean yield variation, a result lower than that obtained when considering the model containing all the explanatory variables (^{2}_{Adj} = 0.41). The M_{75} (^{2}_{Adj} = 0.42) and M_{71} (^{2}_{Adj} = 0.49) models provided a greater degree of explanation between the explanatory variables and soybean yield than the full model, while the M_{79} (^{2}_{Adj} = 0.41) model provided an equivalent level of explanation, however, these models had a higher RMSE compared to the complete model (RMSE = 0.33) and that difference is most evident in the M_{79} model (RMSE = 0.39). As the M_{71} model explained 49% of the soybean yield variation and RMSE of this model (RMSE = 0.34) is close to RMSE of the complete model (RMSE = 0.33) the M_{71} model was chosen as best adjusted model to soybean yield and analysis was performed using JaB to investigate the existence of influential points.

It is noteworthy to mention no points were detected as influential when value 1 is established as the cutoff point (_{i} is higher than the median of the distribution F of Snedecor with freedom degrees of p = 6 and n - p = 24 once the cutoff point is 2.50 to these, thus they were also not detected as influential points. Considering 4/n ≈ 0.13 as cutoff point, the points 15, 23 and 29 were detected as influential indicating these points can change the estimation of the parameters in the regression model, so it is important to investigate the model behavior without the use of these points. It should be emphasized that only point 23 was detected as being influential through analysis using Cook’s distance (D_{i}) with JaB methodology.

JaB graphs^{ }were created to help identify influential points, they give a visual interpretation of how a particular point affects the bootstrap distribution for the estimation of parameters in M_{71} (

Two new models were adjusted to the variables P, Des_{2}, Ca, SRP_{1}, Mn, Mg as to measure the effect of influential points in modeling. The M_{71-{15,23,29}} model was adjusted to the data set without points (15, 23, 29), as these were detected as influential by traditional Cook distance method with cutoff point of 4/n. The M_{71-{10,15,23,29}} was also adjusted to the data set without points (10, 15, 23, 29) for these were considered as influential by analysis using JaB (

The M_{71-{15,23,29}} model adjusted to data set without sample elements 15, 23 and 29, which were identified as influential by the traditional method, is more explicative than M_{71} model attained from the complete set of points for after removal of these points the percentage of soybean yield variation that can be explained by the regressors increased from 49% to 63% (_{71-{10,15,23,29}} model prepared without points 10, 15, 23 and 29 it was observed that the adjusted coefficient of determination (0.65) was higher than the adjusted coefficient of determination obtained from M_{71-{15,23,29}} model, resulting in a more explanatory model. When comparing RMSE of M_{71-{15,23,29}} and M_{71-{10,15,23,29}} models (_{71-{10,15,23,29}} model as the best model suited to soybean yield and determine bootstrap confidence intervals for the parameters associated with the explanatory variables (

By comparing the confidence intervals of parameters of the M_{71-{10,15,23,29}} model with the respective intervals obtained from the multiple linear regression model generated with all the explanatory variables and all sampling points (_{71-{10,15,23,29}} model had lower amplitude, indicating estimates of this model was more accurate.

The average soybean productivity in the monitored area (4.305 t/ha) is considered high compared with other regions, according to data from

The negative sign of estimates for parameters associated with SRP_{1}, SRP_{3}, Des_{1} and Des_{2} variables (_{2} and Des_{3} variables show opposite signals from the expected scenario; however, since it is verified that multicollinearity was non-existent, it is also prudent to investigate the significance of such variables. The positive estimate signal from the associated parameter to K variable is expected, once and in accordance with

The comparison of confidence intervals can be done in terms of their amplitudes according to

The fact that predictor variable P is selected in a large share of models (^{3}) which according to _{1} and Des_{2} variables are used to assess the state of soil compaction, their effect on soybean yield is the opposite, for plants exhibit alterations in depth, branch and distribution of roots in response to soil compaction (

The model selection method using bootstrap is effective in determining the significant variables resulting in a more parsimonious model. Although the model determined by this method (M_{71}) has been the same selected by the conventional method using Akaike, the application of this methodology serve to attest the model selected by the Akaike criterion is not super-parameterized, which can occur when the amount of samples is small.

Analyzing _{1}, it is seen that points 15 and 10 are detected as influential. Point 15 has a negative influence (-3.7) and its removal reduces bootstrap distribution amplitude, a fact that occurs mainly due to a shift in the initial percentiles if one considers the empirical distribution formed with 3000 replicates, P_{5 }= -0.373, P_{10} = -0.330, P_{16} = -0.302 and considers the empirical distribution formed only by bootstrap replicates with bootstrap samples not containing point 15 (1124 samples), P_{5 }= -0.336, P_{10} = -0.295, P_{16} = -0.270. The influence of point 10 is positive (2.6). It is observed that when considering the bootstrap distribution formed with those bootstrap samples that do not contain point 10 (1039 samples) the values considered percentile decrease, causing distribution displacement and reduction of its range from 0.865 to 0.727.

JaB graph in _{2} variables; it also indicates point 23 has a positive influence on bootstrap distributions of the parameters associated with Mg and Mn variables and a negative influence on bootstrap distribution of the parameter associated with variable P. Point 29 has a positive influence on bootstrap distribution of the parameter associated with variable P and point 10 has a positive influence on bootstrap distribution of the parameter associated with variable Des_{2}.

Comparing all the points that were detected as influential in JaB graphs (

Regarding diagnostic analysis it is seen that the influential points determination method using JaB methodology together with Cook’s distance (

Regarding the significance of the explanatory variables in M_{71-{10,15,23,29}} model it is observed that only the parameter associated with Mn variable showed confidence intervals containing zero in both bootstrap techniques (

It is noteworthy that the bootstrap methods are fundamental to obtaining a more descriptive and more accurate model, as aside from the model M_{71-{10,15,23,29} }to furnish a higher percentage of explanation of the soybean productivity (65%) than the initial model in _{71-{10,15,23,29}} to be under satisfactory terms taking into account to be built only by physical and chemical features of the soil. The soybean productivity percentage variation not covered by such model (35%) is due to variables not considered, for example, the agricultural-meteorological, since climate has a significant impact upon the growth and development of crops (

The results showed that the bootstrap methods enabled us to select the physical and chemical soil properties, which were significant in the construction of the soybean yield regression model, construct the confidence intervals of the parameters and identify the points that had great influence on the estimated parameters.

There is no accepted definition of what constitutes a small sample, as such sample size depends on a number of factors, including the reliability of the estimate, and the relative variance of the variable under consideration (

Multicollinearity refers to high correlation among the independent variables and its existence tends to inflate the variances of the parameter estimates (^{2}) where the R^{2} is the coefficient of determination of the regression of independent variable x on all other independent variables in the postulated model. As a rule of thumb, when the VIF >10 we conclude that multicollinearity is a problem and that we should not base our decisions on the magnitude and sign of the regression coefficients (

Given the sample set {y_{1},…,y_{n}} the probability of y_{j} not being included in a bootstrap sample is (1-n^{-1})^{n }= ^{-1}, thus in B bootstrap samples the number of simulations that do not include y_{j} is approximately B.^{-1} (