Prediction of soil macronutrients using fractal parameters and artificial intelligence methods

Aim of study: To evaluate artificial neural networks (ANN), and k- Nearest Neighbor ( k- NN) to support vector regression (SVR) models for estimation of available soil nitrogen (N), phosphorous (P) and available potassium (K). Area of study: Two separate agricultural sites in Semnan and Gorgan, in Semnan and Golestan provinces of Iran, respectively. Material and methods: Complete data set of soil properties was used to evaluate the models’ performance using a k- fold test data set scanning procedures. Soil property measures including clay, sand and silt content, soil organic carbon (SOC), electrical conductivity (EC), lime content as well as fractal dimension (D) were used for the prediction of soil macronutrients. A Gamma test was utilized for defining the optimum combination of the input variables. Main results: The sensitivity analysis showed that OC, EC, and clay were the most significant variables in the prediction of soil macro nutrients. The SVR model was more accurate compared to the ANN and k- NN models. N values were estimated more accurately than K and P nutrients, in all the applied models. Research highlights: The accuracy of models among the test stages illustrated that using a single data set for investigation of model performance could be misleading. Therefore, the complete data set would be necessary for suitable evaluation of the model.


Introduction
Soil nutrients play a considerable role in regulating plant growth. In soils with low nutrient content, limited productivity is expected from plants. On the other hand, excessive use of fertilizers may have detrimental consequences on environmental pollution, in the case they enter into the surface or the groundwater bodies (Wang et al., 2009;Fu et al., 2010). The yield of agricultural crops is completely dependent on the application of soil fertilizers, especially nitrogen (N), phosphorus (P), and potassium (K) (Foth, 1982). Ayoubi & Sahrawat (2011) illustrated that the soil total N and available P are the most important factors in predicting barley yield. Soil properties, including clay content, lime content as well as pH are negatively correlated with the mineral nutrients and may decrease the availability of plant nutrients by fixing them or by forming insoluble compounds (Chaudhari et al., 2012). Additionally, electrical conductivity (EC) has significant influence on crop yield, crop suitability, activity of soil microorganisms, and availability of plant nutrients, which can consequently influence the key soil processes (Adviento-Borbe et al., 2006).
The most accurate way for direct measurement of soil macronutrients is laboratory analysis. However, this analysis is difficult in arid and semi-arid regions of Iran, which have with calcareous soils. Alternatively, during the past three decades, data-driven models (e.g. artificial neural networks (ANN), support vector regression (SVR) and k-Nearest Neighbor (k-NN)) have emerged as rapid mathematical methods, which correlate the role of independent variables to dependent ones (Simon, 1995;Moosavi & Sepaskhah, 2012;Shiri et al., 2017). A limited number of approaches have been proposed for the prediction of soil macronutrients and soil chemical properties utilizing data driven models (Li et al., 2013;Keshavarzi et al., 2015;Jeong et al., 2017). For instance, Jeong et al. (2017) combined different vegetation indices with some attributes derived from digital elevation map in order to predict the spatial soil nutrients. Their results indicated that SVR is an efficient method for prediction of N and P nutrients. Li et al. (2016) investigated the capability of ANN model merged with kriging, in order to predict relationships among soil properties and environmental factors. Keshavarzi et al. (2015) found that the values of soil P could be predicted more accurately by the ANN-based pedotransfer function and by including topographic variables in the model. Safaa & Maxwell (2015) predicted pasture N content using ANN models and thermal images. Their results indicated that ANN models could be well fitted for pasture N content. Zolfaghari et al. (2016) applied the k-NN approach to obtain cation exchange capacity (CEC) using soil particle size distribution (PSD) and soil organic matter (SOM). Their results illustrated that the ANN and k-NN approaches have similar accuracies in estimating soil CEC.
In most reported literature on developing and testing data mining models, a single random data set has been utilized, and subsequently, using a portion of the whole data, models have been trained and been tested using the remained data pattern. In some cases, the efficiency of the models with suitable results, drastically decreases by substitution of the test data. Hence, it is more suitable to determine the efficiency of the models based on the k-fold testing method (Shiri et al., 2017). In this study, we have investigated the capability of the ANN, SVR, and k-NN models in predicting the availability of soil macronutrients, using the k-fold procedure. In addition, the sensitivity analysis of the input variables has been determined using Gamma-test technique. To predict the availability of soil macronutrients, fractal parameters derived from the PSD were also used as the model input variables.

Study area and soil analysis
This study was conducted in two separate agricultural sites in Semnan and Gorgan, respectively from Semnan and Golestan provinces of Iran. Semnan region, in the Semnan province, is located in the central part of Iran. Semnan region has an arid climate with high evapotranspiration and very little precipitation. Soil samples in this area were selected from rangelands and agricultural lands. Rangelands contain poor vegetation cover, while 90% of agricultural lands are used for the production of wheat, barley and corn (Babaei et al., 2018). The mean annual temperature, precipitation and potential evapotranspiration in the Semnan region are 18.5 ºC, 138 mm and 2500 mm yr -1 , respectively. Based on the FAO World Reference Base for Soil Resources (WRB) most soils of Semnan are classified as Regosols, Calcisols and Cambisols.
Gorgan, in the Golestan province, is located in the northern part of Iran. The Gorgan region has a humid climate. Soil samples on this area were selected from agricultural lands which were used for the production of wheat, barley, rapeseed and watermelon. The mean annual temperature, precipitation and potential evapotranspiration in the Gorgan region are 17.5 ºC, 550 mm and 985 mm yr -1 , respectively. Based on the WRB, most soils of this area are classified as Luvisols, Leptosols and Cambisols.
This work was conducted at five steps ( Fig. 1): i) soil properties including pH, EC, OC, CaCO 3, clay, silt and sand were measured; ii) step fractal parameters including D and ω from PSD were determined; iii) the Gamma test was utilized for identifying the relevant variables for modeling of soil macronutrients; iv) soil macronutrients were predicted using ANN, SVR and k-NN models; v) the best model was selected based on the cross validation.
For determining the relation of soil macronutrients (i.e. soil N, available P and available K) with easily measurable soil properties, 130 soil samples from the top 30 cm of soil profile were collected from two separate agricultural 3 Prediction of soil macronutrients using fractal parameters and artificial intelligence methods sites in Semnan and Gorgan, respectively from Semnan and Golestan provinces of Iran. The sampling points were determined using a random sampling scheme.
In the second step the Pore-Solid Fractal model was fitted on PSD data (Bird et al., 2000): where M s(r<rᵢ ) is the cumulative mass of particles smaller than rᵢ, D is the fractal dimension of the PSD, and ω is the composite scaling constant.

Defining the optimal combination with the Gamma test
The Gamma test is an appropriate tool for identifying the optimum combination of the input variables (Yazdani & Zolfaghari, 2017). In this research, the Gamma test was applied in order to find the most effective parameters. Then the least sensitive parameters were excluded from the models. We considered the following set: where x i WRᵐ are considered as m dimensional input vectors (with a record length of M) with real numbers (R) and, accordingly, Y i ϵR are output scalper. In addition, it is assumed that x vectors have the capability of predicting factors affecting the output Y.
The following equation explains the relationship between x and Y vectors.
where ƒ refers to a smooth function and e shows a random noise with zero mean.
The Gamma statistic (Γ) refers to an estimate of output variance in the model. The Gamma test is defined by NE [i,k], that are the k th (1≤ k ≤n) nearest neighbors. Specifically, the Gamma test is derived from the Delta function of the input vectors: where | ... | refers to the Euclidean distance. Subsequently, the corresponding Gamma function of the output values obtains as: where YNE [i,k] is the corresponding Y-value for the k th nearest neighbor of xi in Eq. (9). For calculating Γ, a least-square regression line is determined for the n points (δᴍ (k) , γ ᴍ (k) ) (11) where A indicates the slope of linear regression. Intercept of the above regression line is equal to Γ. A higher slope indicates a higher level of complexity in the model. The Γ statistics can be standardized for the creation of the V ratio. The V ratio was computed as follows: where σ 2 (Y) refers to variance of the output variable. Lower V ratio indicates a better prediction of the output variable, while higher V represents a greater prediction error (Moghaddamnia et al., 2009;Yazdani & Zolfaghari, 2017).
Standard error (SE) of the regression give us an indication of how much the point estimate is likely to vary from the corresponding data. It can be calculated as square root of the estimated error variance on the sum of square of the independent variable. Therefore, if the SE value of linear regression in Eq. (9) is close to zero, we have more confidence in the value of the Gamma statistic as an estimate for the noise variance on the given output.

Artificial neural network (ANN)
ANN include the input, output and the hidden layers we used. Mathematically, it is expressed by: where Ok and Xi are the output and input variables of the neural network, Wij and Wjk are the weights between the input and hidden layers, and the weights between the hidden and output layers, respectively. S refers to a transfer function, m and n, the numbers of input and hidden neurons, respectively (Haykin, 1994). The number of hidden layers represents the complexity of the ANN. It is certified that various soil properties could be successfully predicted using one hidden layer in ANN (Were et al., 2015;Taghizadeh-Mehrjardi et al., 2016). Therefore, in the current study a topology with one hidden layer was utilized. There are various methods for optimizing the number of neurons inside the hidden layer. For obtaining the best neuron number in the hidden layer, ANNs with 2 to 20 neurons in the hidden layer were compared. The best estimated neuron number in the hidden layer was chosen by a cross-validation using a 5-fold process over the training set. In this study, we applied the sigmoid function as the transfer function for prediction of soil properties. ANN can be trained by a back propagation algorithm. Among various back-propagation training algorithms, we Prediction of soil macronutrients using fractal parameters and artificial intelligence methods selected the Levenberg-Marquardt due to its efficiency and simplicity. Neural Network Toolbox of MATLAB was utilized for performing ANN modeling (Mathworks, 2010).

k-nearest neighbors (k-NN)
The k-NN algorithm proposed by Nemes et al. (2006), Botula et al. (2013) and Zolfaghari et al. (2016) was used to predict soil macronutrients. MATLAB environment was utilized for implementing the k-NN algorithm (Mathworks, 2011). This approach does not need any predefined mathematical function for estimating the target variable. In the k-NN technique, soils which have the most similarity to the target soil (test sample) are chosen from a reference data set. The efficiency of the k-NN technique is greatly dependent on the correct selection of the 'most similar' soils. The details of the k-NN algorithm are: -Step 1: Defining the test (target) and the reference data set.
-Step 2: Calculating the Euclidean distance between test and training samples. The similarity between the target (test data set) and the reference soils obtains using the Euclidean distance was calculated: where di refers to the "Euclidean distance" of the i th soil from the target soil and ∆aij indicates the difference among the i th soil and the target soil in the j th soil attribute and np shows the number of soil properties considered for the model. The term 'distance' determines the similarity; the distance will be smaller for soils which have higher similarity with the target soil regarding the input attributes. -Step 3: Determining k nearest neighbors by sorting the distances in ascending order, and then choosing k samples with relative minimum distances.
The normalization procedure is necessary because of significant differences among soil variables. For this reason, the input variables are first transformed to temporary attributes with a distribution having zero mean and one standard deviation.
Finally, the reference soils data set are sorted in ascending order and according to their distance from the target soil. The distances of the selected k neighbors from the reference soil can be obtained using a weighting procedure as below: where k refers to the number of selected neighbors, wi shows the weight associated with the ith nearest neighbor, and di(rel) defines the relative distance of the i th selected neighbor, calculated as: (16) where p is a power term which considers different possible weight-distance relationships.
After calculation of wi, soil macronutrients would be predicted as: (17) where Yi, are the observed and the predicted soil macronutrients for i th soil sample.

Support vector regression (SVR)
SVR is a strong data mining approach that can be generalized to non-linear models applying a kernel function (Vapnik, 1995). The kernel function is a set of mathematical functions, which is used to transform the input data space into a higher dimensional space. In this study ε-SVR was used for the prediction of soil macronutrient (Smola & Schölkopf, 2004). In ε-SVR a threshold (ε) set is given by the user. Data points with an absolute difference greater than the threshold, contribute to the regression fit. Since the squared residuals are not used in ε-SVR, large outliers have limited effects on the regression equation. Also samples which fit suitably on the model (i.e., with small residuals) have no effect on the regression equation. The best SVR fit in the features space utilizes a loss function in order to find the regression line that minimizes the model residual. The loss function in ε-SVR applies absolute residuals over the threshold ε; however it ignores all errors within ± ε. This approach shows robust effects that are not sensitive to outliers on the prediction (Cherkassky & Mulier, 2007). With ε, the loss function can add a penalty (cost) for large residuals. There is a relationship between the ε and the cost parameter (Kuhn & Johnson, 2013). At the initial step, 0.1 (default value) was chosen for ε, and kept fixed while tuning the value of cost via 10-fold cross validation. In this study, the radial basis kernel function (Eq 18) performed the greatest efficiency because its parameters were easier to adjust compared to those of other kernels (Ballabio, 2009;Kuhn & Johnson, 2013).
where G represents the kernel function, x indicates the input vectors, σ refers the scaling parameter, and shows the Euclidean norm. The radial basis function needs to adjust the sigma (σ) parameter that controls its width. In this research, σ was automatically optimized in MATLAB software by minimizing k-fold cross-validation.

Data splitting and model assessment
A k-fold testing data assignment approach was applied for the analysis of efficiency of the studied models, namely k-NN, ANN and SVR, which were used for the prediction of soil macronutrients. In this step, the variables significant for prediction of each nutrient (e.g. N, K, and P), determined using the Gamma test, were used. The complete data was partitioned into 10 subsets. Each time, the models were trained and were tested utilizing a part of the available patterns. Using this approach, all the available input-target patterns were considered by the models to construct the final estimation model (Marti et al., 2013;Shiri et al., 2014Shiri et al., , 2017. Accordingly, the k-NN, ANN and SVR models were trained and tested 30 times (3 approaches 10-folds). Assessing the accuracy of the models' performance using the k-fold testing would circumvent from acquiring relatively rational conclusions that might be drawn using traditional data management scenarios (Marti et al., 2013;Shiri et al., 2014). This was possible because no invisible input patterns would remain in the development of the models. The performance accuracy of the applied models was evaluated based on four statistical metrics: correlation coefficient (R), root mean square error (RMSE), mean error (MR) and relative root mean square error (r-RMSE). MR illustrates systematic errors in the applied approach, RMSE represents the accuracy of the approach; r-RMSE is a dimensionless indicator that can provide a proper and consistent contrast among the models when they depend on different targets. This is extremely decisive as the influence of the targets' magnitude is neglected because in calculation of r-RM-SE, only their mean values are taken into account. These metrics can be obtained by: where n represents the number of data; Y i and î Y refer to the observed and the predicted soil macronutrients for the i th soil sample, respectively; and the ̅ and ̅ are the mean observed and predicted soil macronutrients, respectively.

Data summary statistics
Relevant statistics of input and output soil properties of the existing and the newly evolved models are represented in Table 1. The data sets embed a wide range of soil factors. Among the input variables, soil PSD covers a good range and can be expandable to many regions with different climatic conditions. Soil pH changed from 5.95 to 8, with an average of 7.5. Soil EC varied from 0.34 to 60 dS/m in the studied regions, which indicates a wide and expandable range (Table 1). Moreover, organic carbon had acceptable fluctuation ranging from 0.01 to 7%. The values of macronutrients showed considerable fluctuation. Nitrogen content had negligible values in poor and infertile soils but showed noticeable values in fertile soils. Variation of soil P was higher than the other macronutrients but its average indicated P deficiency in most soils. The texture of the parent material showed high K content. Large variation in soils macronutrients was largely due to the use of chemical, manures, and mismanagement.

Sensitivity analysis
The Gamma value was calculated for a combination of nine input variables listed in Table 1. Recognizing their role, the input variables were omitted one by one from the initial combination (Table 2). In addition to the Gamma values, V ratio and the SE were also calculated. Increase of the Gamma value after omitting a variable means that the omitted variable has a profound positive effect on the model, and therefore it should be included in the initial combination. Conversely, decrease of the Gamma value after omitting a variable, means that the omitted variable has a weak influence on the model and should be excluded from the initial combination.
Based on the results, Gamma values for N, P and K in the first combination were 0.0002, 0.081 and 0.084, respectively. Calculated Gamma values from other combinations were compared with the preceded numbers and subsequently the effective parameters of the model were defined. The results showed that for N, five variables have positive effects on Gamma value. The highest effect was observed from the OC, EC, silt content, lime content and ω (Table 2). Kovačević et al. (2010) and Jeong et al. (2017) reported a strong positive correlation between C and N. For P, four variables (pH, OC, clay and fractal dimension (D)), had significant influence on Gamma value. The most important variables for predicting K were pH, EC, OC, clay and lime content. On the other hand, the results indicated that D had the lowest effect for the prediction of N (Gamma value = 0.0005) while for P and K the lowest effects were related to sand and silt, respectively. The results illustrated the importance of clay content as an effective variable for the prediction of K and P. These findings are reasonable because clay minerals transformation is strongly dependent on K amounts of soil and also the dynamics causing either release or removal of K to or from the soil (Raheb & Heidari, 2012). Raheb & Heidari (2012) reported a strong correlation between K and clay content in the soil. Moreover, EC and OC were evaluated as the most important factors in predicting soil macronutrients. In addition, Gamma values between CaCo3 and N and K values were noticeable. Decomposition of SOM increases ions in soil liquid phase. Such modifications have negative impact on soil EC, which itself is tuned by several soil fertility variables like K, Ca, Mg, pH, N, P, CaCO3, CEC, and OC (Bronson et al., 2005;Aimrun et al., 2009;Peralta & Costa, 2013).
Sensitivity analysis showed that the soil pH was an important variable in the prediction of K since most of the studied soils were slightly alkaline (mean and median of pH in the studied areas were 7.5 and 7.6, respectively). By increasing pH in soils, K enters the hexagonal cavities. Consequently, K fixation increases and it becomes inaccessible. The negative relation between pH and K obtained by the regression analysis proves this hypothesis.
A higher V-ratio also suggests the importance of combination for modeling. This V-ratio corresponds well to the resulting Gamma value (Fig. 2). Therefore, a higher Gamma value corresponds to a higher V-ratio. The greatest and the least V-ratios in the prediction of N were related to OC (0.376) and D, respectively. The highest V-ratio in the prediction of P and K was related to sand (0.251) and silt (0.317), respectively. On the other hand, the lowest V-ratio in the prediction of P and K was related to soil pH (0.50, 0.42). These results explain that soil pH

Comparison of models
ANN, k-NN and SVR models ran using the best input attributes which were selected from the Gamma sensitivity analyses. Three-layer ANN including input layer, hidden layer, and output layer were trained for prediction of soil macronutrients. In the input layer, the number of neurons were defined as the number of independent variables (i.e. five variables). In the output layer, the number of neurons were determined considering the number of dependent variables (i.e. soil macronutrients). Amini et al. (2005) reported that enormous neurons can lead over fitting, whilst too few hidden neurons can cause under fitting. In this study, for finding the most equitable number of neurons in the hidden layer, neural networks with various combinations of the neuron numbers (ranging from 2 to 20) were investigated. The most suitable number of hidden neurons to predict N and K varied from 6 to 9, while the best number of hidden neurons for that of P varied between 6 and 10. Table 3 shows the statistical summary of the studied ANN, k-NN and SVR models. The results indicated that the SVR was the best model in prediction of all the studied nutrients with the lowest median of RMSE. Also comparison of median of RMSE among 10 studied subsets showed that the SVR (median RMSE=3.63) and the k-NN (median RMSE=3.66) models had similar accuracies in the prediction of P nutrient. The k-NN model had smaller RMSE than the ANN for the prediction of P. But the RMSE of ANN in predicting N and K was smaller than that of the k-NN. Comparison of the models' performance among the studied soil nutrients displayed that the most accurate result was obtained for the N prediction models, Figure 2. Sensitivity analysis of the output variables (nitrogen, phosphorus and potassium nutrients) to the input attributes of the models. EC: electrical conductivity; OC: organic carbon; D: fractal dimension of the particle size distribution (PSD); ω: composite scaling constant. which might be linked to the relationship among the input and the output variables. Measured and estimated values of the studied nutrients (N, P, and K) are compared in Fig. 3. As it can be illustrated from the scatterplots, the SVR has the nearest estimates to the measured values of all the three nutrients. Fig. 3 shows that, for N nutrient, all the applied models illustrated the least scattered estimates and their estimates were close to the measured N values compared to those of K and P nutrients. For the prediction of N, the k-NN method gave the worst estimates. In this case, RMSE of the SVR model was 0.63 mg g -1 , which was lower than that of the ANN (0.91 mg g -1 ) and the k-NN (0.97 mg g -1 ) models (Table 4). This means that the SVR model had 31% and 35% less RMSE than the ANN and the k-NN models, respectively. According to Table 4, the values of RMSE for ANN, k-NN, and SVR for K were 212, 204.4 and 186 (mg kg -1 ), respectively. These results demonstrate that the k-NN model had lower RMSE compared to the ANN model. For K, the SVR model had 12% and 8% less RMSE than ANN and k-NN models, respectively. For P, SVR also revealed a lower RMSE compared to k-NN and ANN models but both ANN and k-NN models had similar RMSE. In this case, the SVR model had 8% less RMSE than both ANN and k-NN models. MR for estimating N and P values using the three models was negative (Table 4). These findings indicate that the studied models overestimated the values of P and N. For K, the ANN and k-NN models had negative MR but SVR had a positive MR, showing that both ANN and k-NN models overestimate the K values but SVR underestimates the K values. Zolfaghari et al. (2016) reported that the bias in k-NN in the prediction of CEC was positive. Nemes et al. (2006) also reported a positive MR for estimation of soil moisture content using k-NN. Fig. 4 illustrates r-RMSE values split up per test stage for all three nutrients and models. For N, the individual accuracy of the ANN model per test stages fluctuated more than k-NN and SVR models. But the k-NN model approach illustrated the maximum r-RMSE (0.76) value (minimum degree of accuracy). For K, the values from the ANN models fluctuated more than k-NN and SVR models. For this nutrient, the maximum r-RMSE (0.46) was obtained by both k-NN and ANN models. For P, r-RMSE fluctuation of all the studied models was the same. Our results indicate that the SVR tends to be a more accurate model for prediction of macronutrients. These results were different from Shiri et al. (2017), who reported that support vector machine (SVM) gives the worst estimates of soil CEC. The performance fluctuations among test stages indicate the need for evaluation of the models' performances using k-fold testing data set assignment approach and not only considering a single data set assignment.  Prediction of soil macronutrients using fractal parameters and artificial intelligence methods In summary, easily measurable soil properties, including soil PSD, pH, EC, CaCO 3, OC and fractal parameters (D and ω) were used to predict soil macronutrients (i.e. N, P, and K) using nonparametric methods like ANN, k-NN and SVR. Sensitivity analysis obtained by Gamma test showed that OC, CaCO 3, EC, ω, and silt for N; pH, EC, OC, clay and D for P; pH, EC, CaCO 3, OC and clay for K increased the Gamma value and subsequently had profound positive effects on the performance of the studied models. Results demonstrated that SVR has the best performance for the prediction of all studied nutrients. Also, there was no significant difference in accuracy of SVR and k-NN in prediction of P. In k-NN method the user can incorporate additional data by appending to or replacing the reference database without the need for developing new equations. In addition, the user is able to improve a specific local data by inserting available local data to a reference database without any significant influence on other available parts of the data, in the reference database. Therefore, we suggest to use the k-NN approach for prediction of P nutrient. There was no significant difference in the performance of k-NN and ANN models in prediction of N and K nutrients. Thus, we conclude that the k-NN approach could well compete with many other methods of pedotransfer functions, like ANN models. The lowest and the highest uncertainties were found in the prediction of N and K nutrients, respectively.