PCA versus ICA for the reduction of dimensions of the spectral signatures in the search of an index for the concentration of nitrogen in plant

The vegetation spectral indices have been widely used as estimators of the nutritional status of the crops. This study has evaluated if it is possible to improve the effectiveness of these indices to estimate the nitrogen concentration using dimension reduction techniques to process the spectral signatures. It has also demanded that the model is valid in a wide range of growing conditions and phenological stages, thus increasing the predictive power guarantee and reducing the implementation effort. This work has been done using an agronomic trial with dual-purpose triticale (X Triticosecale Wittmack) whose design included plots with different planting densities, number of grazing and fertilizer doses. The spectral signatures of the leaves were recorded with the ASD-FieldSpec3 spectroradiometer and the nitrogen concentrations were determined by Kjeldahl method. The factors with effect on nitrogen concentration were identified by the analysis of variance and pairwise comparisons and, then, the mean spectral signature was calculated for each of the groups formed. The dimensional reduction was performed with both PCA and ICA. The analysis of the relationships between components and nitrogen concentration showed that only the components obtained with PCA generated a significant model ( p = 0.00) with a R 2 = 0.68. The best spectral vegetation index in this test, the reflectance in green, obtained a R 2 = 0.31. Although further confirmation is needed, this study shows that the PCA may be a viable alternative to spectral vegetation indices.

Estimation of the nitrogen nutritional status by spectral analysis with PCA and ICA Abbreviations used: Adj R 2 (adjusted correlation coefficient); ICA (independent component analysis); MSE (mean square error); NDVI (normalized difference vegetation index); PCA (principal component analysis); RMSE (square root of the mean square error); SNR (signal to noise ratio); SWIR (shortwave Infrared); VNIR (visible and near-infrared). ture, this work is in that line. The purpose of this study is to present a methodology which improves, in easily reproducible conditions on farms, the effectiveness of the methodologies based on the classical vegetation spectral indices and also offers greater guarantees of space-time generalization.
With the dual-purpose triticale (X Triticosecale Wittmack), the crop used in this study, there is the possibility of letting livestock grazing on more than one occasion without ruining the final harvest. The plant after each grazing has to regenerate the above-ground part and with it the ability to synthesize chlorophyll, but the below-ground part is unaffected so that the plant's nitrogen absorbing capacity remains intact. The result is an imbalance in the first weeks after each cut, which is manifested by a yellowing of the plant. This is not a symptom of nitrogen deficiency, since there has been neither a decrease in the concentration of this element nor an interruption in plant growth.
The loss of greenness is not always related to a nutritional deficiency, this peculiarity, entailed by the crop chosen for the study, requires that the radiometric model is based on spectral features different from those used in many of the vegetation spectral indices such as the index of the reflectance in green. These new spectral features have to be more closely related to the concentration of nitrogen because they respond correctly under conditions in which the classical indices err.
There is no competition between old and new spectral features, it is a necessity dictated by the crop, when developing models for other cereals without the dualpurpose both features can be integrated and thus obtain better levels of effectiveness.
One way to increase the guarantees of predictive power is limiting the number of models, only one model will be developed for the entire period during which it would be possible to correct the shortcoming nitrogen in the crop. The changes in the plant during its development (Marschner, 1995;Azcón-Bieto and Talón, 2003) complicate the development of generic models, but if that development is possible, then the model would provide greater guarantees, the Occam's razor is applicable. In addition, a unique development would reduce the participation of specialists in the implementation, which would reduce costs and would facilitate transfer of technology.

Introduction
The demand of nitrogen by the crop throughout its development is well known (Alaru et al., 2004) and both the excessive and the deficit have a negative impact on production, operating costs and environmental conservation.
The technology available today can get huge volumes of information at a reasonable cost; the challenge is the correct interpretation of that information (Moran et al., 1997). Reflectance measurements can be used to obtain the values of the most widely used spectral indices (Guyot et al., 1988;Yoder and Pettigrew-Crosby, 1995;Gitelson and Merzlyak, 1996;Blackburn, 1998;Gitelson et al., 1999;Daughtry et al., 2000;Ustin et al., 2004) as indicators of chlorophyll concentration. The suitability of these indices to estimate the concentration of nitrogen in plants is limited;Li et al. (2010) showed that the predictive power of these indices can reach a R 2 = 0.5 for certain growth stages. Heege et al. (2008) reached higher values, but they linked the spectral signature with a dose of fertilizer applied and not with the nutritional status of the plant.
The work of Li et al. (2010) revealed one of the challenges to overcome, the poor spatial and temporal generalization of the models developed. Another difficulty was evidenced in the work of Rodriguez-Moreno and Llera-Cid (2011), the tests are conducted under conditions difficult to reproduce in real farms. In a real farm, for example, there is not a panel of experts dedicated to calibrate the methodology, so the tasks of identifying acceptable cuts in the procedure and finding out the true effectiveness of the methodology are left to farmers.
In this context, only the large farms (large areas) can bet strong for these new technologies, because the small improvements per unit area represent a significant increase in production and benefits, which compensate for the salary of specialist staff and equipment costs required for implementation. This is a sad reality as precision agriculture, besides trying to maximize profits, also seeks sustainability and protecting the environment (Moran et al., 1997).
Studies, some of them six years old (Waheed et al., 2006), have shown the high potential of radiometry and artificial intelligence in the field of precision agricul- The huge volume of data obtained in a hyperspectral sampling is very complex to analyze and its processing has a high computational cost. For contexts such as these were devised dimension reduction techniques, which filter the noise, identify redundancies and reveal the structure hidden. The most widely used dimensional reduction techniques are principal component analysis (PCA) (Rao, 1964) and independent component analysis (ICA) (Hyvärinen and Oja, 2000). This study will examine whether the components identified by these techniques contain the information necessary to estimate, by a linear regression model and under the conditions described above, the nitrogen concentration in the plants.
The anticipated results would be evidences to support three hypotheses, the existence of features in the spectral signatures closely related to nitrogen concentration, the ability to develop models valid for a wide phenological range and the appropriateness of the dimension reduction techniques to process of the spectral signatures preserving the information on the nutritional status.

Material and methods
As part of a study at the "La Orden-Valdesequera" Research Centre (Badajoz, Spain) to determine the optimal combination of factors for the cultivation of the triticale, the reflectance of the leaves, at different stages of crop development, were measured.
The experimental design used was a split-split-plot with four replicates. The first factor was seeding density (400, 500 and 600 plants m -2 ), the second the number of times the crop was cut to simulate grazing (0, 1 and 2 grazing), and the third the dose of nitrogenous fertilizer (0, 75 and 125 kg ha -1 ). Each factor had three levels, so that there were 108 experimental plots in total, each of 30 m². The leaf reflectance measurements were made at 80, 117, 132, and 164 days after seeding. Together with these measurements, crop samples were taken, which were sent to the laboratory for the determination of the concentration of total nitrogen by the Kjeldahl method. The correspondence between the number of days after seeding and the crop's phenological stage, along with its description, is presented in the Table 1. The growth stages were determined using the Zadoks (Zadoks et al., 1974) and Feekes (Large, 1954) scales.
In accordance with the experimental design, the influence of each factor on the nitrogen concentration meas-ured for each of the 108 plots on each sampling day was analyzed (it is unknown whether all the factors at all levels have an effect on the concentration of nitrogen). The split-split-plot analysis of variance (ANOVA) and the pairwise comparisons, Fisher's LSD Procedure, determined a grouping of the plots according to the nitrogen concentration (p = 0.05). Figure 1 is a flowchart explaining this analysis. This and the rest of calculations were done using R 2.9 (R Development Core Team, 2004).
On each sampling date, 20 leaves at random were collected in each of the 108 elementary plots. Ten estimates of the reflectance (each averaging 50 readings) were made of these samples, using the ASD FieldSpec 3 spectroradiometer for this. This device has a spectral range of 350-2500 nm, a sampling interval (the spacing between sample points in the spectrum) of 1.4 nm for the range 350-1000 nm and 2 nm for the range 1000-2500 nm, and a spectral resolution (the full-width-half-maximum of the instrument response to a monochromatic source) of 3 nm at 700 nm and 10 nm from 1400 nm to 2100 nm. Readings were performed using a plant probe plus leaf clip. The light source of the plant probe is a halogen bulb with a colour temperature of 2901 ± 10 K.
The reflectances for the wavelengths in which the transition between the spectroradiometer sensors (VNIR -SWIR1 and SWIR1 -SWIR2) occur were removed of the spectral signatures, regions where instrumental errors could be found.
For each of the different groups of elementary plots, formed according to their nitrogen concentration on each sampling date, the mean reflectance was calculated by averaging the readings taken in their respective plots. As the number of models has been limited to one, from this point all the pairs of nitrogen con- centration -spectral signatures come together in one set, regardless of the date of sampling, since the model had to provide a correct estimate for all of them without knowing that information.
The dimension reduction techniques filter the noise, identify redundancies and reveal the structure hidden. There are different strategies; each employs different assumptions about the original components and the mixing process, mathematical assumptions that may not fit perfectly with the data set. This is the reason why two techniques, with statements so distant, have been tested.
The principal component analysis (PCA) searches for a new base, linear combination (this restriction simplifies the search) of the original base (in which the data were collected), that best expresses the data. For the PCA, the dynamics of interest is the one with better signal-to-noise ratio (SNR), this is the search criteria of the new base. Reducing the number of dimensions is got by eliminating noise and redundancy (Rao, 1964).
The independent component analysis (ICA) is the other technique that was tested. This dimension reduction technique seeks that the information contained in the new components is statistically independent (Hyvärinen and Oja, 2000). The ICA has been proved successful in many cases in which the PCA fails (Ozdogana, 2010).
The PCA returns as many components as inputs. In this study it was identified the smallest group of new components necessary to explain, at least, the 95% of the total variance, excluding other components; this is the way in which dimensional reduction was achieved. The ICA is different; one must indicate the number of independent components to generate. In this study the number of principal components (PCA) employed has been used as an estimation of the number of independent components (ICA) needed, performing several tests around that number.
It has built a linear regression model for each of the different sets of components generated (PCA and ICA) with nitrogen concentration (%). The goodness of fit of each model was evaluated by calculating the correlation coefficient squared (R 2 ), the adjusted squared correlation coefficient (adjR 2 ), the square root of the mean square error (RMSE) and the statistical significance (p-value) of the model (analysis of the variance). Figure 2 is a flowchart that summarizes the whole process.
The score to improve is 0.31, the value of R 2 obtained by the green reflectance index, which is the highest correlation found between the nitrogen concentration and the classical spectral indices. Calculation obtained with the same dataset that this study and performing the evaluation under the same conditions. The list of spectral indices analyzed in the comparative and other details of the study is in Rodriguez-Moreno and Llera-Cid (2011).

Results and discussion
The split-split-plot ANOVA (Table 2) and the pairwise comparisons (Fisher's LSD procedure) identified the factors with effect on nitrogen concentration (Level of significance in all tests of 0.05).
It was observed that the factor seeding density had no effect at any time, the levels may not be appropriate or perhaps the effects were felt later.
The effect of the factor number of grazing was not analysed in the first dataset, since the first cut was made after this sampling. In the dataset of the second and third sampling, with two levels (0 and 1 cut), it was determined that the factor number of grazing had effect on the concentration of nitrogen. The analysis of the Those parcels that differ from each other by a factor-level without effect are repetitions of the same configuration.
A certain factor-level could have a significant effect on a given time and not in another, the analysis was performed for each day of sampling.
Looking for a valid model for the 4 days of sampling, so all the data were grouped into a single set.
In the 432 determinations of nitrogen (108 experimental plots x 4 days of sampling) found 14 different values. The factors number of grazing (by livestock) and doses of fertilizers have effects, but not on all levels. Planting density has no effect on the concentration of nitrogen in plant.  fourth dataset, the first with three levels since the second cut was made after the third sampling, revealed that the three levels of the factor number of grazing had significant effect on plant nitrogen concentration.
The factor dose of fertilizer had effect on the concentration of nitrogen in all the datasets, but only in the fourth dataset the levels 75 and 125 kg ha -1 had different effects. The fact that the effects of the two higher doses of fertilizers do not differ is natural. The triticale takes the nitrogen from the soil along its development, but most absorption occurs in a developmental stage later than the dates on which the samplings were done (Lance et al., 2007).
Those plots that differed only by a factor or level without effect were repetitions, so the representative value for the concentration of nitrogen and the spectral signature were obtained by averaging the data from plots with the same configuration. The 14 different nitrogen concentrations recorded in the 108 plots and the four days of sampling were: 1. 02, 1.19, 1.22, 1.31, 1.36, 1.42, 1.52, 1.54, 1.64, 1.95, 2.32, 2.74, 3.14 and 3.40 percentage of nitrogen.
Cumulative variance explained by the first 10 components obtained using PCA reached 99.5% of the total variance. This percentage was considered sufficient, so only the first 10 components were taken into account in developing the linear regression.
Unable to make the same test to determine the appropriate number of components for the ICA, it was chosen to calculate 7 linear regressions (using 7 to 13 components). If the 99.5% of the variance of the reflectance could be explained with only 10 components in the case of PCA, ICA had to need something similar.
It has built a linear regression for each of the different sets of components generated (PCA and ICA) with nitrogen concentration (%). The goodness of fit of each model (Table 3) was evaluated by calculating the R 2 , the adj R 2 , the RMSE and the p-value of the model. The criteria that guides the transformation is improving the signal-to-noise ratio. The new base should be a linear combination of the original.
Number of components: The smallest set that explains 95% of the variance.

Independent component analysis (ICA)
The information contained in each of the new components is statistically independent A good estimation of the number of independent components is the optimal number of principal components (ONPC), number determined in the PCA. Seven ICA were done, the numbers of independent components were: ONPC, ONPC+1, ONPC+2, ONPC+3, ONPC-1, ONPC-2, ONPC-3 Dependent variable:

Plant nitrogen concentration (%)
Calculation of linear regression: [Nitrogen concentration] = a 1 * Component 1 + … + a n * Component n + b Relationship evaluated by the correlation coefficient squared, the adjusted squared correlation coefficient, the square root of the mean square error and the statistical significance of the model.

Spectral signatures 432 (108 plots × 4 sampling days)
Calculation of the spectral signatures representative of each nitrogen concentration, averaging all signatures obtained on plots with the same concentration.

Independent variable:
Principal components

Independent variable:
Independent components (7 dataset) Estimation of the nitrogen nutritional status by spectral analysis with PCA and ICA The best fit, R 2 = 0.68, was reached in the linear regression with the components of the PCA. That result showed that the spectral signatures of crops meet the suppositions on the PCA is based (linearity in the change of base, higher variance means greater importance of the variable in the dynamics and that the principal components are orthonormal).
The winner model is presented in the Table 4. Except the fourth and ninth components, the rest were included in the model (level of significance of 0.05). The eighth component is the one that gets the highest coefficient, but the seventh, the tenth and the second are close, so one cannot conclude that the concentration of nitrogen can be identified with a particular component, one must use the derived model.
This study provides evidences that data mining is an effective technique for analyzing the spectral signatures in the search for estimators of the nutritional status of the crop.
This work shows that the changes in the crop throughout its development are not sufficient to prevent the development of a single model. This means that the implementation of this methodology would require, at most, a calibration study per crop campaign. At this point it is worth recalling the high variability in the experimental plots in terms of growing conditions, which means that the effort to adjust the model could be valid for a large area.
In an evaluation under the same conditions and with the same dataset, of the potential of spectral  indices of vegetation to estimate the concentration of nitrogen in plant, it was determined that the best index was the reflectance in green, which reached an R 2 = 0.31 (Rodriguez-Moreno and Llera-Cid, 2011). The results of this study placed the strategy with the PCA over the spectral indices such as NDVI. This is not surprising because it is not the first work that improves their effectiveness. An example is the result obtained by Waheed et al. (2006) who was able to develop a decision tree with a classification hit rate above 90% in a similar experiment. The models developed by Waheed et al. (2006), more than 5 years ago, have been unable to replace the spectral indices of vegetation; the NDVI keeps its hegemony in the scientific and commercial uses.
The NDVI needs to determine the reflectance at two wavelengths, the methodology presented in this article needs to apply the complete hyperspectral signature. More information improves the estimates, but to overcome the NDVI, the simplicity and cost effectiveness are as important as the effectiveness.
The complexity of the new methodology could be reduced by identifying the wavelengths with greater weight in the components built into the model and developing with them a model that, instead of estimating the concentration of nitrogen, estimates if the plant is deficient in nitrogen. In that case one would only need to know the reflectance in a few wavelengths and the data processing would be easier.
Competing with the NDVI in terms of profitability would be possible with more studies supporting the greater effectiveness of this method in all scenarios and that is a valid methodology for large areas that only requires a calibration study per crop campaign. While both aspects do not improve simplicity and profitability, the methodology presented will not end the hegemony of the spectral indices of vegetation. Li et al. (2010) presented the model with more predictive power; in their development the spectral indices of vegetation and the brute force search were tested, by the difficulties of the brute force search one can say that the methodology presented in this article has a similar complexity. The models developed by Li et al. (2010) are specific for certain phenological stages and their best model reaches an R 2 = 0.5; this work achieves a small improvement, R 2 = 0.68, with a valid model for the whole period during which it could act to correct the deficiency in the crop.
It is very likely that this study has determined the lower threshold of efficacy, the dual-purpose triticale supports the grazing by livestock throughout its development without ruining the final harvest; this makes it a special crop with additional difficulties for the development of radiometric models (details given in the introduction). It is hoped that the models developed for other cereals can get higher scores, but studies are needed to quantify it.
Developing this model has required the processing of over nine million data. In the case of developing a similar model with data taken in various scenarios (other locations, different weather, varieties, etc.) the data volume will grow exponentially, making it impossible to process, even for supercomputing centres. Proving that dimensional reduction techniques are effective is the first step required to initiate such studies.
The progress made has the limitation of requiring the spectral signatures of the leaves; an on-going investigation is going to determine if the model could be adjusted to operate with measures of vegetation canopy reflectance.