Determination of soluble solids content in Prunus avium by Vis/NIR equipment using linear and non-linear regression methods

Aim of study: Developing models to determine soluble solids content (SSC) in cherry trees by means of Vis/NIR spectroscopy. Area of study: The Spanish Autonomous Community of Aragón (Spain). Material and methods: Vis/NIR spectroscopy was applied to Prunus avium fruit ‘Chelan’ (n=360) to predict total SSC using a range 400-2420 nm. Linear (PLS) and nonlinear (LSSVM) regression methods were applied to establish prediction models. Main results: The two regression methods applied obtained similar results (R cv2 =0.97 and R cv2 =0.98 respectively). The range 700-1060 nm attained better results to predict SSC in different seasons. Forty variables selected according to the variable selection method achieved R cv2 value, 0.97 similar than full range. Research highlights: The development of this methodology is of great interest to the fruit sector in the area, facilitating the harvest for future seasons. Further work is needed on the development of the NIRS methodology and on new calibration equations for other varieties of cherry and other species.


Introduction
The cherry tree (Prunus avium L.) is one of the lea ding species of stone fruit worlwide (Bujdosó and Hrotkó (2017)) and, in Spain alone, production reaches 114,433 tons (FAOSTAT 2017; http://www.fao.org/ home/en/). One of the main cherry-producing areas in Spain is the Ebro Valley, which includes Aragon and Catalonia. Since cherries are not climacteric fruits and thus do not ripen once picked from the tree, they should be harvested once they acheive the desired physico-chemical and organoleptic characteristics. This characteristic allows its sale and consumption for up to 7 days after harvest. As the cherry ripens on the tree, the soluble solids content (SSC) increases, while the acidity and firmness decrease. One of the most commonly used indexes used to determine optimal time of cherry picking is the SSC. The increase in the SSC, due to starch degradation, has traditionally been determined destructively through refractometry. As quality is being increasingly demanded by the consumer, the fruit and vegetable industry requires more effective non-destructive quality-control systems. For the study of the fruit maturity, optical techniques have been developed since the 1990s. One example is the nearinfrared spectroscopy (NIRS), which offers a number of advantages over previous techniques: it avoids the destruction of the fruit, provides faster in situ measurement, reduces cost, and determines several quality parameters in a single measurement. This technique consists on analysing the behaviour of a light beam incident on the surface of a test sample: part of the incident light undergoes specular reflection and is responsible for the gloss; another part is absorbed selectively by the pigments; while the rest of the incident light is diffusely reflected by the sample, producing the Visible/Near Infrared Reflectance (Vis/NIR) spectra. In the NIRS technique, this Vis/NIR spectra is analysed and the desired information is gathered. Near-infrared spectroscopy was widely used to analyse a large variety of fruits and vegetables, such as: apples (Torres et al., 2016), pears , tomatoes (Tiwari et al., 2013), avocados (Clark et al., 2003), oranges (Ncama et al., 2017), and cherries (Escribano et al, 2017).
The current trend is to look for non-destructive maturity control methods, comparing the Vis-NIR spectral information of the samples with reference values obtained with destructive techniques. In this way, statistical models are created to predict the desired quality parameter. These models are constructed using linear multivariate calibration methods, such as multiple linear regression (MLR; Jha et al., 2014); principalcomponents regression (PCR; Hamshidi et al., 2012), or partial least squares (PLS; Nicolaï et al., 2006). In many cases, this relationship may not be strictly linear, so non-linear methods such as least-squares support vector machines (LS-SVM; Chauchard et al., 2004) and artificial neural networks (ANN; Pérez-Marín et al., 2007;Shao et al., 2008) have also been proposed in a number of works.
LS-SVM is a regression model that has been used in recent years to predict parameters related to fruit ripening and other chemical and physical properties. Previous research proved the potential of this non-linear regression model for several quantitative applications in agro-food products (Altieri et al., 2017;Zhang et al., 2019). The development of the LS-SVM model includes the radial basis function kernel. Additionally, grid-search and cross-validation (LS-SVMLab25) have been used to achieve the optimal combination of gamma (γ) and sigma (σ) hyper-parameters of the model; gamma is used to maximize model performance and minimize model complexity while sigma is proportional to the width of the Radial basis function (RBF) kernel.
PLS is a regression method often used in agro-food applications to construct prediction models for reference parameters established by destructive techniques. Generally, a PLS is used to model a relationship between variable X (spectra) and variable Y (physiochemical parameter of interest) and it has been already successfully used in a variety of studies related to the prediction of indexes in fruits and vegetables (Lafuente et al., 2014;Li et al., 2016;Altieri et al., 2017;Tilahun et al., 2018).
PLS regression method is probably the most widely used technique for dealing with multivariate data in chemometrics. This method is particularly effective for tasks involving building a predictive model of variables (y) when there are many factors (x), and when these are highly collinear. The X matrix is substituted by a matrix of latent variables (LVs), which are themselves linear combinations of the x vectors that maximize the covariance of the Y matrix, using least squares to adjust both the latent variables as well as the regression coefficients, with a high (r 2 ) determination coefficient. The main characteristic of this approach is to seek the maximum correlation between the spectra (x variables) and the characteristic to be determined (y variables). This technique was used to forecast quality indexes in fruits and vegetables (Nicolaï et al., 2006;Sánchez et al., 2012;Ribera-Fonseca et al., 2016;Tilahun et al., 2018). L-fold cross-validation (L = 20) has been used to calculate the optimum number of latent variables, and to avoid overfitting in the development of the calibration models (Xiaobo et al., 2007;Zhang et al., 2013;Altieri et al., 2017).
Normally, multivariate regression methods make use of all the variables of the spectrum when building the calibration models. Applying these methods limits partially the impact of different problems, such as collinearity, band overlaps, and interactions. However, variables that are collinear or that do not contain relevant information may impair the construction of effective models (Xiaobo et al., 2010). Since some variables provide useful information whereas others do not, the prior selection of a small number of variables is necessary to achieve a better and simpler calibration.
The aim of this work was to develop models to determinate SSC in cherry trees by means of Vis/NIR spectroscopy. The calibration models were designed using regression PLS and LS-SVM methods. In addition, we proposed a subband selection of variables in order to

Reference data
The total SSC were calculated using a digital refractometer (ATAGO PR-101 Co. Model, Tokyo, Japan). The determined refractive index accuracy was ± 0.2 and the °Brix (%) range 0-53% with automatic temperature compensation. Two SSC values were recorded for each sample. Firstly, a piece of peel was removed and then a piece of flesh was extracted. Then the flesh tissue was squeezed to extract the juice. The measurement was performed on a drop of cherry juice, running the analysis twice. The same experienced user performed all SSC determinations.

Chemometric data treatment
The calibration equations were formulated using two methods of multivariate analysis: PLS and LS-SVM. The software used was Matlab R2014a (The MathWorks, Natick, USA), with its corresponding programming.
Different pretreatments were tested on the study spectra before establishing the calibration models: standard normal variate (SNV), multiplicative scatter correction (MSC), derivative in first and second order, and normalization. Finally, normalization was applied to each of the variables of the spectrum, using mean zero and standard deviation (SD) one. This type of normalization was applied individually to each variable considered, whether it be spectral or otherwise (Rossi et al., 2006).
detect the most influential wavelengths interval for the calibration model, and to obtain more stable and simpler models.

Fruit sample and measurements
A total of 360 cherries of Prunus avium cv. 'Chelan' cultivated on a commercial farm were collected in three different years: Season 1 (year 2011), Season 2 (year 2012) and Season 3 (year 2016). The farm is located at the boundary of La Almunia de Doña Godina (Zaragoza, Spain) (UTM: 41.500581, -1.324072). The trees were planted in the year 2000, on a planting grid of 5 × 3 m. Drip irrigation was applied at a rate of 25,000 L ha -1 h -1 . The cherries were collected weekly during the harvest period (May-June) in the first season, at a rate of 50 cherries/week. In the second year, 35 samples were harvested per collection day every three days for two weeks (May-June). In 2016 (June), 78 cherries were collected in optimal harvest dates (i.e., when fruits have ripeness parameters -SSC, firmnesswithin commercialization range). All samples were immediately transported to the analysis laboratory and underwent near-infrared spectroscopy. After collecting Vis/NIR spectra, the SSC was determined for each fruit at the exact same points used for NIR analysis.

NIR analysis
The equipment used was QualitySpec Pro 2600 modular reflectance equipment (Analytical Spectral Devices, INC. Colorado, USA) ( Figure 1). Fast-scanning spectrophotometry was performed with a measuring range of 350-2500 nm, spectral resolution of 1 nm and a tungsten halogen lamp (12V/45W) as light source. The detection system consists of a monochromator, with a double InGaAs detector. The scanning speed was 10 scans/sec. The light energy was collected through a bundle of specially formulated optical fibres. The fibreoptic cable has a conical view subtending a full angle of approximately 25 degrees. Spectrometer was calibrated using dark and white spectral energy (Alamar et al., 2007).
The equipment measures the reflection spectra in units of optical density (D.O=log1/Reflectance). Two spectral measurements were made per fruit at two opposing fixed positions at the equator of the fruit, using the mean of the two spectra for the calibration processes. The initial and final parts of the spectra were removed to reduce noise, so the working range was 400-2420 nm. In the model calibration, the samples were divided into subsets: calibration and validation groups. The calibration group was used to develop the calibration model and included data from fruit samples belonging to first and second season (years 2011 and 2012). The validation group included data from fruit samples belonging only to year 2016. The samples were divided according to the ratio 2:1.

Performance metrics
For both PLS and LS-SVM methods, the cali bration models were tested to predict SSC of the samples within the validation set. The best calibration models were selected based on the highest coefficient of determination for cross validation (R cv 2 ), together with the lowest standard error of cross validation (SECV) (Williams, 2001).
Additionally, the residual predictive deviation (RPD) statistic parameter, calculated as the ratio of the SD of the reference data to the SECV (Williams, 2001) was used. This latter statistic enables SECV to be standardized, facilitating the comparison of results found with sets of different means (Williams, 2001). Williams (2001) points out that RPD values between 3 and 5 indicate a good Vis/NIR prediction efficiency.
With respect to the validation, the effect of the di fferent settings on the performance of the model was evaluated by comparing the root mean square error of prediction (RMSEP) and the coefficient of external validation (r 2 p ). Figure 2 shows examples of cherry-absorption spectra found using the Vis/NIR equipment. As it can be observed, within the visible range, the absorption peak of the chlorophyll (≈675 nm) decreases when the cherry ripens. In the infrared range (700-2000 nm), several absorption bands appear with a higher absorption rate: 980, 1470, and 1940 nm. The absorption peak around 980 nm may be associated with the second vibrational overtones of O-H bond stretching associated with water absorption (Shuxiang et al., 2016). The peaks around 1470 nm and 1940 nm are related with water absorption bands (William, 2001). These peaks are characteristic of sugar absorption, although the latter band was also strongly influenced by the high water content. Spectra with similar bands for cherries were reported by Lu (2001) and Escribano et al. (2017). Table 1 shows a description of the samples included in the calibration and validation groups. The SSC range of samples included in the validation group is within the SSC range of samples included in calibration group, which helps improve the prediction model accuracy. The values of standard deviation and coefficient of variation indicate high variability in each group.

Calibration models
Different spectral ranges were checked to develop calibration models, such as full spectrum, 400-700 nm, 700-1060 nm, 700-2420 nm. In terms of R cv 2 and SECV, the spectral range of 700-1060 nm provided the best results. The selection of this specific spectral range is in agreement with previous studies (Carlini et al., 2000;Zude et al., 2006;Nicolaï et al., 2008;Travers et al., 2014;Escribano et al., 2017).  Very similar performance results were obtained by the application of both regression models (PLS and LS-SVM) when determining SSC (respectively R cv 2 = 0.97 vs 0.98, SECV=0.86 ºBrix vs 1.03 ºBrix) ( Table 2). The RPD values ranged between 4.5 and 5.5. R cv 2 values were very similar, whereas SECV and RPD values were better for the PLS method, as SECV was 0.86º Brix, lower than for LS-SVM method. RPD-PLS was one point above RPD LS-SVM, showing a better calibration for PLS. These results are consistent with studies such us Carlini et al. (2000) that obtained models with high accuracy (R 2 c =0.97, SEC (standard error of calibration = 0.49). Similarly, Lu (2001) developed very robust models for two cherry varieties; their model R c ranged from 0.83 to 0.94. When compared with these studies, we registered improved R cv 2 values. However, the fact that a different cherry variety ('Hedelfinger' and 'Sam' cherries) was used in these studies, might also contribute to the differences found.
The correlation coefficients were used for weighing the sensitivity of each wavelength regarding the fruit quality parameter SSC. The sensitive wavelengths iden tified appeared in one wavelength region, which was 940-980 nm. Model performance was established using these 40 wavelengths (r 2 p =0.88). Similar results were reported by Qing et al. (2007), Travers et al. (2014) and Escribano et al. (2017) with a similar range of variables.

Validation
Validation is necessary to ensure an independent measurement of the precision for the calibration models through RMSEP and r p 2 . Figure 3 shows the prediction statistics for SSC of the set of cherries from the third season using Vis/NIR calibrations.
The predictive ability of different models was vali dated by RMSEP and r 2 p , using validation data coming from a different season. The validation statistics were similar for both regression methods, though somewhat better for the linear method (PLS) when estimating the total SSC. RPD v values were greater (2.78) for PLS than LS-SVM (2.52), pointing to a linear relationship between SSC and spectra. RMSEP for PLS model was lower than for LS-SVM prediction (1.15 ºBrix and 1.27 ºBrix, respectively). Besides RPD v values were better to PLS model (RPD v =2.78 in PLS vs RPD v =2.52 in LS-SVM), which translates into a greater prediction accuracy (Fig. 3). A similar study (Escribano et al., 2017) analysing the same cherry variety (Chelan) developed a model which reached an RMSEP = 1.01 ºBrix, rp 2 = 0.69, RPD v = 1.28. The difference in results could be due to the use of a different selection of variables (729-975 nm). Accurate and robust predictions might be declined by using noisy regions of the spectra.
In conclusion, NIRS technology using different regression strategies provided highly effective models  for the prediction of SSC in 'Chelan' cherry. The development of this methodology is of great interest to the fruit sector in the area, facilitating the harvest for future seasons. In this study we developed predictive models to determine SSC in Prunus Avium cv. 'Chelan' via NIR spectroscopy using linear (PLS) and non-linear (LS-SVM) regression methods. The use of non-linear regression methods (LS-SVM) showed to be similar to PLS, as they got similar results. An important outcome of this work is the use of data from samples belonging to a different harvest season to build the validation test, and results show that a powerful prediction of SSC can be attained using calibration methods designed with data from previous season. Finally, the selection of variables (940-980 nm) reduced the number of wavelengths to ~ 40, with similar results to those corresponding to the entire interval. This enables the development of much simpler equipment for determining the maturity index of the cherry. Further work is needed on the development of the NIRS methodology and on new calibration equations for other varieties of cherry and other species.