Screening of transgenic maize using near infrared spectroscopy and chemometric techniques

The applicability of near infrared (NIR) spectroscopy combined with chemometrics was examined to develop fast, low-cost and non-destructive spectroscopic methods for classification of transgenic maize plants. The transgenic maize plants containing both cry1Ab/cry2Aj-G10evo proteins and their non-transgenic parent were measured in the NIR diffuse reflectance mode with the spectral range of 700–1900 nm. Three variable selection algorithms, including weighted regression coefficients, principal component analysis -loadings and second derivatives were used to extract sensitive wavelengths that contributed the most discrimination information for these genotypes. Five classification methods, including K-nearest neighbor, Soft Independent Modeling of Class Analogy, Naive Bayes Classifier, Extreme Learning Machine (ELM) and Radial Basis Function Neural Network were used to build discrimination models based on the preprocessed full spectra and sensitive wavelengths. The results demonstrated that ELM had the best performance of all methods, even though the model’s recognition ability decreased as the variables in the training of neural networks were reduced by using only the sensitive wavelengths. The ELM model calculated on the calibration set showed classification rates of 100% based on the full spectrum and 90.83% based on sensitive wavelengths. The NIR spectroscopy combined with chemometrics offers a powerful tool for evaluating large number of samples from maize hybrid performance trials and breeding programs. Additional keywords: facile screening method; Zea mays; transgenic maize selection; discrimination model. Abbreviations used: ANN (Artificial Neural Network); Bw (Weighted Regression Coefficient); ELM (Extreme Learning Machine); KNN (K-nearest Neighbor); KS (Kennard-Stone); NBC (Naive Bayes Classifier); NIR (Near Infrared); PC (Principal Component); PCA (Principal Component Analysis); PCR (Polymerase Chain Reaction); PLS-DA (Partial Least Squares Discrimination Analysis); RBFNN (Radial Basis Function Neural Network); SIMCA (Soft Independent Modeling of Class Analogy); SVM (Support Vector Machine). Authors’ contributions: Conceived and designed the experiments: XF and HY. Performed the experiments: XF, CZ and CP. Analyzed the data: XF. Wrote the paper: XF and YH. Citation: Feng, X.; Yin, H.; Zhang, C.; Peng, C.; He, Y. (2018). Screening of transgenic maize using near infrared spectroscopy and chemometric techniques. Spanish Journal of Agricultural Research, Volume 16, Issue 2, e0203. https://doi.org/10.5424/sjar/201816211805 Supplementary material (Fig. S1) accompanies the paper on SJAR’s website. Received: 31 May 2017. Accepted: 19 Jun 2018. Copyright © 2018 INIA. This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International (CC-by 4.0) License. Funding: 863 National High-Tech Research and Development Plan (Project No: 2013AA10030401); State Key Laboratory Breeding Base for Zhejiang Sustainable Pest and Disease Control (No. 2010DS700124-KF1712). Competing interests: The authors declare no competing financial interests. Correspondence should be addressed to Yong He: yhe@zju.edu.cn


Introduction
Plant breeding uses molecular biology to produce new crop varieties or lines with desirable properties by using techniques to select and introduce genetic modifications and desirable traits into plants (Liu et al., 2015;Yadav et al., 2015;Yang et al., 2017).One major technique of plant breeding is selection, the process of selectively propagating plants with desirable traits and eliminating those with less desirable traits (Schart et al., 2016).This requires plant breeders to screen large populations of crops for individuals that possess the characteristics of interest.Currently, there are various molecular methodologies for plant breeding, such as polymerase chain reaction (PCR) (Taverniers et al., 2004), enzyme linked immunosorbent assays (Kamle et al., 2011) and microarrays (Xu et al., 2005).However, these DNA-and protein-based methods for identification of transgenic plants are time consuming and costly when studying large numbers of samples, and thus unsuitable for on-line application.Therefore a method for the selection of transgenic samples after transformation that does not require any wet chemistry, particularly the procedure of leaf DNA extraction, would be advantageous where many sample analyses are required.
Near infrared (NIR) spectroscopy is an alternative to traditional chemistry procedures for qualitative and quantitative analysis of biological materials (Wu et al., 2014).The NIR region of 700-2500 nm can gather information related to different hydrogen bonds (C-H, N-H and O-H), which are the primary structures of organic molecules.In contrast to biochemical assays, NIR spectroscopy does not require technical expertise or complex techniques, and the spectrophotometer can be installed anywhere with no requirement of reagents or complicated protocols (García-Molina et al., 2016).
The NIR spectroscopy has been widely used for decades for qualitative and quantitative analysis in agriculture and food research, and has been used for determining the moisture content of peanut kernels (Jin et al., 2015), rice wine composition (Yu et al., 2015), vine water potential (De Bei et al., 2011) and, more recently, to estimate carotenoids in tomato products (Saad et al., 2017) and berry shrivel (Beghi et al., 2015).The application of NIR spectroscopic technology in the genetic field and especially in transgenic foods is now feasible (Alishahi et al., 2010).García-Molina et al. (2016) applied NIR spectroscopy to discriminate transgenic wheat lines with low gliadin content from non-transgenic lines.Guo et al. (2014) identified clear differences between transgenic and non-transgenic tomatoes using VIS-NIR together with discriminant partial least squares regression with excellent classification accuracy of up to 100%.The basis of this technology for application in transgenic field is that it can identify phenotypic changes caused by genotypic changes that ultimately bring about changes on organic molecular bonds (Alishahi et al., 2010).However, due to the overlapping bands in the NIR region, the spectral analysis is not straightforward and requires chemometric methods to extract important information and classify the mass data set from transgenic and non-transgenic samples (Murayama et al., 2000).Chemometric approaches applied to spectra, using principal component analysis (PCA) and partial least squares discrimination analysis (PLS-DA) as well as support vector machines (SVM), have proved effective in distinguishing transgenic plants and food from nontransgenic samples (Liu et al., 2014;García-Molina et al., 2016;Feng et al., 2017).
Thus, the objectives of this study were to (1) evaluate the possibility and accuracy of using NIR spectra to discriminate transgenic maize plants for breeding screening purposes, (2) identify sensitive wavelengths that attribute differences between transgenic and non-transgenic maize plants and (3) evaluate the performance of five discriminate models and establish an optimal model for classification.

Leaf samples
Seeds of transgenic maize (Zea mays L.) (containing both cry1Ab/cry2Aj-G10evo genes) and its parental line were provided by the Institute of Insect Sciences, Zhejiang University, China.The transgenic maize line contained both herbicide and insect tolerance traits created by Agrobacterium tumefaciens mediated transformation.The seeds were sown in plastic buckets in a 1:1:1 mix of soil:calcined clay:torpedo sand.The plants were grown in a greenhouse for 2 months.The youngest fully expanded leaf on a shoot and the second or third leaf formed were selected for NIR scanning.PCR was used to check the integrity of copies of the genes introduced during the breeding phase and the expression of the inserted exogenous gene.

NIR scanning and pretreatment
Maize leaf samples were scanned using a field portable NIR spectroradiometer NIRez (Isuzuoptics, Taiwan, China) with spectra range of 900-1700 nm.Reflectance spectra were collected every 10 nm within 900-1700 nm.Each sample was analyzed in three duplicates to reduce measurement errors.Maize leaf samples were placed directly in the diffuse reflection accessory.A total of 326 maize leaves were sampled, comprising 163 transgenic and 163 non-transgenic samples, with at least one leaf collected from each plant.Using the Kennard-Stone (KS) algorithm (Saptoro et al., 2012), the whole dataset was divided into two groups: calibration and prediction sets.The KS algorithm calculates the Euclidean distance of every two NIR spectra and chooses two spectra with farthest distance as the first pairs, then calculates the Euclidean distances of the rest samples with the first pairs, which made the samples in both sets were representatively of the population and could avoid overfitting to some extent.Therefore, based on the KS method, 120 transgenic and 120 non-transgenic samples were chosen for calibration set.The remaining 43 transgenic and 43 non-transgenic samples were selected to form the prediction set.Samples were classified according to the genetic background using a classification model, which were preferably close to the values used to codify the class.Unscrambler x10.1 (CAMO PROCESS AS, Oslo, Norway) and MATLAB version R2010b (The Math-Works, Natick, MA, USA) were used to process the data.In addition, origin Pro 7.0SR0 (Origin Lab Corporation, Northampton, MA, USA) software was used to design graphs.Model performances were evaluated by the classification accuracy of the calibration and prediction sets.

Chemometrics and data analysis
The first step involving classification was carried out using an exploratory analysis with PCA (Bryant & Yarnold, 1995).The PCA developed on the whole NIR spectral data was used to visualize the possible clusters and trends in the PCA score plot.In the second step, five classification methods including K-nearest neighbor (KNN) (Gil-Pita & Yao, 2009), Soft Independent Modeling of Class Analogy (SIMCA) (Waddell et al., 2014), Naive Bayes Classifier (NBC) (Islam et al., 2007), Extreme Learning Machine (ELM) (Huang et al., 2012) and Radial Basis Function Neural Network (RBFNN) (Kosic, 2015) were applied on the original raw spectral data (90 bands) to identify the transgenic samples.Variable (wavelength) selection in multivariate analysis is an important step because the removal of highly correlated variables produces better predictions and a simpler process.Here, three varia ble selec tion algorithms [weighted regression coefficient (Bw), PCA-loadings and second deri vative (2 nd derivative)] were used to extract sensi tive wave lengths that contri buted the most discri mination information to these genotypes.In the final stage of this study, the actual roles of the extracted sensitive wavelengths were evaluated by establishing discrimination models based on the sensitive wavelengths.Classification methods were carried out using only a few wavelengths selected in the previous step as input, and the results were compared with the classification obtained by using the whole spectra.Figure 1 illustrates the main steps for the whole procedure.

PCA
PCA was used to reduce the dimensions of the original spectra into a low dimen sional subspace, and an alternative set of coordinates called principal components (PCs) was projected (Rinnan et al., 2009).The number of PCs is less than or equal to the number of original variables, and the first few PCs contain most calculate how close each member of the training set is to the target row that is being examined.
SIMCA is performed to describe each group separately based on their similarities in a principal component space (Waddell et al., 2014).Objects are considered to belong to the class if their Euclidean distance from the constructed PC space is not significantly larger than the Euclidean distance of the class objects from their PC space.
NBCs are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features (Islam et al., 2007).NBC is calculated based on the simplifying assumption that the attribute values are conditionally independent of a given target value.
ELMs are feedforward neural networks for classification or regression with a single layer of hidden nodes, where the weights connecting inputs to hidden nodes are randomly assigned and never updated (Huang et al., 2012).ELM has one input layer and one linear hidden layer, and the optimal weights between the input and hidden layers are randomly chosen by minimal norm least square method.
RBFNN can separate a set of objects having different class memberships, which presents certain advantages including better approximation capabilities and shorter computational time (Kosic, 2015).In RBFNN, a radial basis function is used as the activation function for each node in the hidden layer, and nonlinear transformation from the input space to the hidden unit space applying a linear combination of the radial basis function is used in the network.

Spectroscopic analysis
The spectral data were collected over the range of 900-1700 nm.Only spectra of 947.07-1666.49nm were used for analysis as the head and the end of the spectra showed obvious noise caused by the instrument and the environment (Fig. S1 [suppl]).To eliminate the noise of the spectral data and improve the predictive ability for samples, raw spectra went through noise suppression by Savitzky-Golay smoothing algorithm with a window size of 7 and polynomial of order 2 (Pan et al., 2010).The trend of spectra between transgenic and nontransgenic plants was very similar, with similar peak and valley positions (Fig. 2A & B).Slight differences were found between the mean spectral reflectance value of transgenic and non-transgenic maize (Fig. 2C).As most of the spectral information overlapped, it was difficult to discriminate the transgenic maize plants directly by their characteristic spectral feature.Therefore, chemometric of the spectral information.For visual discrimination, we projected each of the spectra in the newly formed coordinate space of selected PCs (score plot), and the scores of the most significant PCs corresponding to each NIR spectra were used.PCA is described in detail by Rinnan et al. (2009).

Important wavelength selection
Variable selection is quite efficient in spectra analysis for handling collinearity problems and extracting the most important information.Many approaches are available for selecting sensitive wavelengths; and identifying prominent peaks and/or valleys with Bw, 2 nd derivative and PCA-loading are among the most commonly used (Barbin et al., 2012;Rodríguez-Pulido et al., 2013;Zhang et al., 2015).In the present study, important wavelengths were selected from the Bw plot in the PLS regression model (Zhang et al., 2015).The 2 nd derivative by Savitzky-Golay method was used to identify key wavelengths related to variations in classification (Barbin et al., 2012).Loadings resulting from PCA of the raw spectral data represent the regression coefficient, and indicate the most dominant wavelength (Rodríguez-Pulido et al., 2013).Simplified classification models were then developed using the selected wavelengths from the above three methods, and the results were compared with the classification accuracy obtained with the whole spectral data.

Discriminate models
To accurately identify transgenic plants from the parental line, pattern recognition approaches, including KNN, SIMCA, NBC, ELM and RBFNN, were used to establish discriminate models.These mentioned methods are the most commonly used in classification models.The details of related theory for these methods is found in the literature (Islam et al., 2007;Gil-Pita & Yao, 2009;Huang et al., 2012;Waddell et al., 2014;Kosic, 2015).Other applied discriminate models such as PLS-DA and SVM have been used by other researchers for discrimination of transgenic maize kernels and transgenic rice seeds (Liu et al., 2014;Feng et al., 2017).
The KNN method is used to classify objects based on the closest training examples in the feature space.By comparing the distance between unknown samples (testing set) and samples in the training set, samples are classified based on proximity to training set samples (Gil-Pita & Yao, 2009).For each row (spectra data) in the target dataset (the set to be classified), the K closest members (i.e. the KNNs) of the training dataset are located.A Euclidean distance measure is used to their comparisons are listed in Table 1.The prediction accuracy for each model was analyzed by the accuracy (in percentage) for the calibration and prediction sets.The accuracy of the classification was expressed as the fraction of correctly predicted samples to the total samples.Sensitivity of accuracy showed significant differences among the discriminate models calculated on entire spectral bands.
The best performance was for ELM, with classification accuracy of calibration and prediction sets exceeding 95%.The RBFNN model was less accurate than the ELM model, but was still acceptable.RBFNN and ELM are typical artificial neural networks (ANNs) (Lian et al., 2014) and can learn nonlinear functions from the NIR spectral data.In the calibration set, the respective accuracies were both 100% for the two ANNs.The SIMCA, KNN and NBC models of the two sample sets were not satisfactory, with classification accuracies of the calibration set less than 80%.The discrimination performance by NBC was the lowest with accuracy of approx.55% -many problems encountered by modern analytical chemists are nonlinear, and approaches such as NBC do not apply well.It is noteworthy that previous studies attempts to discriminate transgenic plants have also shown that linear classification methods were less satisfactory compared to those of SVM (Liu et al., 2014).methods were introduced to build a qualitative model for classification.
A PC model for exploratory purposes was first created to examine the qualitative difference of transgenic and non-transgenic maize leaves in PC space (Fig. 3).No distinct clustering was shown by scatter plots of PC1 vs. PC2 and PC3 vs. PC4 of transgenic and non-transgenic maize plants after PCA analysis (Fig. 3A & B).Transgenic and non-transgenic maize were clustered together in the projection of PC5 with PC6 and could not be effectively separated (Fig. 3C).The discrimination based on PCA was not effective in classing transgenic samples.It is worth mentioning that the overexpression of cry1Ab/cry2Aj-G10evo gene by transgenic editing technology improves glyphosate and insect resistance and ultimately changes organic molecular bonds, but there is no other phenotypic difference between transgenic and non-transgenic maize (Feng et al., 2017).As the PCA program failed to class transgenic maize from its parental line, other discriminant models were utilized for improved separation.

Classification performance based on entire spectral bands
Five discriminate models (KNN, SIMCA, NBC, ELM and RBFNN) were established on the full NIR spectra to evaluate the classification performance and   We used a chemometrics approach because the discriminate models were used to highlight the chemical differences between transgenic and nontransgenic maize plants.NIR spectroscopy can be used to identify transgenic samples as this technology can capture the phenotypic changes in chemical bonding of organic molecules that are altered as a result of genetic changes (Alishahi et al., 2010). Feng et al. (2017) developed a successful model to discriminate transgenic maize kernels based on the NIR hyperspectral imaging with the spectral range of 874.41-1733.91nm.They demonstrated that SVM and PLS-DA models established on the full range of NIR spectra had good classification performance.The hyperspectral imaging they used had the advantage of acquiring spectral and spatial information, which allowed the identification of transgenic maize kernels on the prediction maps.Compared to hyperspectral imaging, our simple instrument acquires small pointsource information from the sample and does not contain spatial information which is also important for discrimination.However, the NIR system that we used was portable and could be used from a USB flash drive without need of any installation, which is very helpful for the transgenic crop selection purposes of crop breeding laboratories.García-Molina et al. (2016) used spectral sensing in the region of 400-2500 nm to discriminate transgenic wheat grain with excellent accuracy.Moreover, NIR combined with chemometrics has proved effective in identification of transgenic soybean oils (Luna et al., 2013), rice mutant seeds (Liu et al., 2014) and transgenic tomato (Xie et al., 2007).That is to say, a discriminant analysis model based on NIR spectra obtained enough information to discriminate the transgenic from parental samples because of their differences in chemical components.This suggests that application of NIR spectroscopy with chemometrics could successfully identify transgenic crops, and it has advantages of being fast, time-saving and low cost compared with molecular methods.

Sensitive wavelength selection and classification analysis based on feature wavelengths
The neighboring NIR wavelengths are always collinear, therefore effective wavelength methods are applied to determine the contributions of individual wavelengths for identification (Feng et al., 2017).Certain wavelengths with obvious peaks and valleys were selected as sensitive wavelengths.Figure 4 shows the effective wavelengths that were selected by 2 nd derivative, PCA-loadings and Bw with the preprocessing method.The number of sensitive wavelengths was reduced to seven for PCA-loading, ten for 2 nd derivative and for Bw.The loading line plot for these selection methods showed similar prominent positive peaks at 1125.6, 1167.55, 1413.97, 1444.34 and 1520.78nm.The band at around 1125 nm belongs to the second overtone of the C-H stretch (Kumaravelu et al., 2017).The peak near 1167 nm is caused by the C-H stretching 2 nd overtone of CH 3 and -CH 2 -groups, and that at 1413 nm by the C-H stretching and C-H deformation vibration of CH 3 and -CH 2 -groups, respectively (Schaefer et al., 2013).The peak near 1444 nm is consistent with the N-H stretch (Boyd et al., 2006).Furthermore, a peak near 1520 nm is assigned to N-H stretch vibration (Minami & Iwahashi, 2011).These wavelengths are believed to correspond to NIR spectral bands relevant to maize property changes caused by the transgenic event.
Normally, the full spectra can contain hundreds of variables.According to Dai et al. (2015), sensitive wavelengths might be equally or more efficient than full spectra in multivariate analysis.The reduced number of wavelengths was sufficient to characterize most classification tasks.Judicious selection of wavelengths decreases sensitivity to non-linearity and discarding the uninformative wavelengths can expedite data processing and improve model accuracy and robustness.In the final stage of this study, the actual roles of the sensitive wavelengths selected by the above-mentioned three methods were evaluated.The newly proposed combined discriminate models were compared: PCA-loadings-SIMCA, 2 nd derivative-SIMCA, Bw-SIMCA, PCA-loadings-KNN, 2 nd derivative-KNN, Bw-KNN, PCA-loadings-NBC, 2 nd derivative-NBC, Bw-NBC, PCA-loadings-ELM, 2 nd derivative-ELM, Bw-ELM, PCA-loadings-RBFNN, 2 nd derivative-RBFNN, and Bw-RBFNN (Table 2).The identification of sensitive wavelength algorithms can improve the model performance, but some algorithms can reduce recognition ability of the model.The strongest discriminant model was developed by Bw-ELM with a classification rate of 90.83% for the calibration set and 86.90% for the prediction set.A correct classification rate of 95% was obtained in the calculation set based on the ELM discriminant model, which indicated that these selected emission peaks had reliable discrimination power for distinguishing transgenic maize plants.The RBFNN model established on sensitive wavelengths had poorer classification accuracy compared to all wavelengths.The NBC and SIMCA models had poor classification performance, showing the correct classification rate in the range of 64.17%-83.00%,although they had better recognition capability when using the sensitive wavelengths for the calibration and prediction sets.The recognition ability of the KNN model established on sensitive wavelengths selected by 2 nd derivative was higher than that for all wavelengths with classification rates of 78.33% and 88.10% for the calibration and prediction sets, respectively.These results showed that the most appropriate classification technique for the classification task was the ELM model, which tended to produce more robust results, although a good performance of prediction set was also obtained with RBFNN.
In summary, using NIR spectroscopy allowed us to monitor phenotypic changes in maize plants as a consequence of genetic changes.Seven classification methods were tested to determine which provided the best results.First, they were used on the entire spectral bands acquired by the system and then using only the most important selected wavelengths.Thus, in addition to obtaining the best combination of methods to select features and classify genotypes, the performance of the selected wavelengths was evaluated.The results showed an excellent classification by the neural network models ELM and RBFNN.An ELM model using the spectral and features peaks after appropriate data pretreatment had valuable and robust calibration and prediction abilities with a classification accuracy exceeding 90% on the calibration set.The use of NIR combined with chemometrics for screening transgenic maize in plant breeding programs is a very attractive platform and has potential for wide use in rapid and on-site screening because it is non-invasive, cost-effective and does not require pretreatment.

Figure 1 .
Figure 1.Flowchart of NIR spectral data analysis for discrimination of transgenic maize plants.

Figure 2 .
Figure 2. Profiles of original spectra (A: transgenic maize, B: non-transgenic maize) and mean spectra of transgenic and non-transgenic maize plants (C).The shaded areas represent the standard deviation in each wavelength.

Figure 4 .
Figure 4. Distribution of sensitive wavelengths of transgenic and non-transgenic maize leaves selected by 2 nd derivative, PCA-loadings and Bw.

Table 1 .
Discriminant analysis results of transgenic and non-transgenic maize leaves based on entire spectral bands.
2Par shows the parameters of the discrimination models, number of PCs for SIMCA, number of selected nearest neighbors for KNN, optimum number of hidden nodes for ELM and spread values for RBFNN.

Table 2 .
Results of discriminate models using important wavelengths Model parameters and abbreviations as in Table1.