A quantitative multivariate methodology for unsupervised class identification in pistachio ( Pistacia vera L.) plant leaves size

Aim of study: Genetic diversity of pistachio, can be evaluated by using different descriptors, as adopted in international certification systems. Mainly the descriptors are morphological traits as leaf, which represents an important organ for its sensibility to growth conditions during the expansion phase. This study adopted a rapid and quantitative non-hierarchic clustering classification (k-means), to extract size classes basing on the contemporary combination of different morphological traits (i.e., leaf stalk length, terminal leaf length, terminal leaf width and terminal leaf ratio) of a varietal collection composed by 21 pistachio cultivars. Area of study: Worldwide. Material and methods: The unsupervised non-hierarchic clustering technique was adopted to the entire samples of pistachio leaves from k=2 to k=15 for both four morphological variables (i.e., leaf stalk length, terminal leaf length, terminal leaf width and terminal leaf ratio) and three morphological variables (i.e., terminal leaf length, terminal leaf width and terminal leaf ratio). Main results: A classification model only on the three morphological variables (for results of statistical analysis in which the groups resulted to be more separated and different for all the variables), with k= 5 (five groups), was constructed using a non-linear artificial neural network approach. The percentages of bad prediction in both training and testing resulted equal to 0%. The “terminal leaf length” returned the higher impact (44.89%). Research highlights: The contemporary combination of different morphological leaf traits, allowed to create an automatic classification of size classes of great importance for cultivar identification and comparison.


Introduction
Pistachio (Pistacia vera L.) is one of the most popular tree nuts in the world, appreciated for its nutritional value, its health and sensorial attributes and its economic importance (Kashaninejad & Tabil, 2011). It includes about twenty species but only P. vera is edible and worldwide marketable (Fares et al., 2009). The pistachio is native to the Central Asia and was introduced into Mediterranean Europe by the Romans at the beginning of the Christian era (Crane, 1978). Pistachio cultivation extended from its origin to Italy, Spain, and other Mediterranean regions of Southern Europe, North Africa, and the Middle East, as well as to China and, more recently, to the United States and Australia (Hormaza et al., 1998). Nowadays, Iran, the United States, Turkey, and Syria are the main pistachio producers in the world (FAOSTAT, 2017). As reported by Massimo et al. (2020), Italy has a large pistachio production, especially in Bronte (Sicily).
However, as reported by Sheikhi et al. (2019) its production is affected by the undesired physiological characteristics of alternate bearing, shell indehiscence, blank nuts and susceptibility to abiotic stresses and fungal foliar and root diseases. For these reasons, genetic improvement should be an attempt for future breeding to produce superior pistachio cultivars. Generally, plant genetic resources preserved in ex situ gene bank collection could provide material for breeders in the development of cultivars with improved qualities such as increased productivity, adaptability to different agro-climatical and agro-pedological contexts, better resistance to diseases, and higher qualitative, organoleptic and nutritional characteristics (Bacchetta et al., 2015;Gharaghani et al., 2017).
Genetic diversity can be determined by evaluating morphological (Kafkas et al., 2002;Hofer et al., 2014), phenological (Chao et al., 2003;Chatti et al., 2017) and agro-pomological characteristics (Asma & Ozturk, 2005;Scheldeman et al., 2006) as well as determined by the application of DNA markers (Pazouki et al., 2010;Hofer & Peil, 2015). Part of these studies on morphological traits are based on descriptors which have been adopted by the International Union for the Protection of New Varieties of Plants (UPOV, 2020). A descriptors list regarding morphological and carpological traits of pistachio was developed by the International Plant Genetic Resources Institute (IPGRI, 1997). As reported by Antonucci et al. (2012), this document provides an international format producing a universally understood "language" for plant genetic resources data collection assisting, with the standardization of descriptor definitions, both the researcher, for the management and maintenance of the collection, and the users of the plant genetic resources.
Usually, in plant species morphological characterization is evaluated by analyzing leaf, flower and fruit descriptors (Hassoon et al., 2018). Leaves represent an extremely important organs for plant (both trees and herbaceous species) because it is very sensitive to growth conditions during the expansion phase (Bayramzadeh et al., 2008). As consequence, leaf characteristics could effectively be used to classify different species (Lin et al., 1984;Kafkas et al., 2002), and to discriminate among varieties (Chatti et al., 2017). Sabzi et al. (2020) designed an imaging computer vision system to automatically classify different types of tree leaf images. The analysis of morphological leaf traits supplies deep insight into the taxonomy, genetics, biogeography and evolution (Balduzzi et al., 2017), which are parts of the major classification of scientific areas related to a successful conservation of natural ecosystems. When using leaf to discriminate varieties, various methods to quantitatively evaluate botanical shapes have been suggested. The most common methodology is based on elliptic Fourier descriptors (Costa et al., 2011;Sun et al., 2012;Chitwood & Otani, 2017), which was successfully used on leaves (Jensen et al., 2002;Neto et al., 2006;Wu et al., 2007;Chitwood et al., 2014;Kadir, 2015;López-Santos & Page, 2018), leaflets (Furuta et al., 1995;Olsson et al., 2000), kernels (Iwata et al., 2015), roots (Iwata et al., 1998), flowers (Yoshioka et al., 2004), and fruit (Currie et al., 2000;Goto et al., 2005;Antonucci et al., 2012). This method mathematically describes the entire shape of an object by transforming the contour into Fourier coefficients.
For efficient management and effective utilization, germplasm collection can be studied through morphological, biochemical and/or genetic methods (Berthaud, 1997;Badenes et al., 1998;Asma & Ozturk, 2005;Scheldeman et al., 2006;Bassil et al., 2009;Hofer et al., 2014;Bacchetta et al., 2015;Gharaghani et al., 2017). In crops species as well as in pistachio, morphological characterization is time-consuming because a lot of traits must to be registered. To reduce time of analysis, in this study the diversity in pistachio germplasm collection maintained at National Fruit Tree Germplasm Collection based on morphological leaf parameters was analyzed. The aim was to adopt a rapid and quantitative non-hierarchic clustering classification (k-means), to extract size classes basing on the contemporary combination of different morphological traits (i.e., leaf stalk length, terminal leaf length, terminal leaf width, and terminal leaf ratio) of a varietal collection composed by 21 cultivars of pistachio.
In particular, leaf stalk length, terminal leaf length, terminal leaf width and terminal leaf ratio ( Fig. 1) were measured with a digital caliper, on 10 fully developed leaves, from the middle third of current season shoots, on about three to five samples for each cultivar for two years (2018 and 2019).

3
Automatic size classes identification of Pistachio leaves

Cluster analysis
In this study it was adopted the unsupervised non-hierarchic clustering technique (k-means) implemented in the study of Antonucci et al. (2012). Generally, in k-means technique the clusters are represented by centers of mass of their members. The algorithm assigns cluster membership for each data vector to the nearest cluster center. Then, it computes the center of each cluster as the centroid of its member data vectors as equivalent to minimize the sum of distances from each object to its cluster centroid, over all clusters (Zha et al., 2002). Moreover, this algorithm moves objects between clusters until the sum cannot be decreased further and the result is a set of clusters that are as compact and well separated as possible. Using the distances of the points from their cluster center it is possible to determine whether the clusters are compact.  In particular, the intra-cluster distance is the distance between a point and its cluster center meanwhile, the inter-cluster distance, or the distance between clusters, should be as big as possible and it is calculated as the distance between cluster centers and takes the minimum of this value. Only the minimum of this value was taken and since both of these measures determine a good clustering, the ratio between these two measures was calculated and indicated as "validity": The clustering which gives a minimum value for the validity measure is the ideal value of k in the k-means procedure. This measurement was proposed by Ray & Turi (1999) and modified in the work of Antonucci et al. (2012). This procedure, which introduces an iteration (1,000 cycles) to smooth the stochastic attitude of the k-means procedure, was adopted to the entire samples of pistachio leaves from k=2 to k=15 for both four morphological variables (i.e., leaf stalk length, terminal leaf length, terminal leaf width and terminal leaf ratio) and three morphological variables (i.e., terminal leaf length, terminal leaf width and terminal leaf ratio). The coefficients of morphological variables for each sample were clustered (k-means) with the unsupervised k-means clustering technique, attributing the group which follows the mode value for each sample.

Statistical analysis
To visualize the groups (extracted from the cluster analysis) distribution considering both four morphological variables (i.e., leaf stalk length, terminal leaf length, terminal leaf width and terminal leaf ratio) and three morphological variables (i.e., terminal leaf length, terminal leaf width and terminal leaf ratio), principal component analysis (PCA) were carried out. In addition, box plots and Analysis of Variance (ANOVA) were performed to evaluate statistical significance differences among the same groups. All these analyses were performed with the software PAST (v. 2.17; Hammer et al., 2001).

Classification analysis
Basing on the k-means grouping suggestion a classification model has been constructed using an artificial intelligence approach. k-means clustering suggested two different kinds of classification based on four morphological variables (i.e., leaf stalk length, terminal leaf length, terminal leaf width and terminal leaf ratio) or on three morphological variables (i.e., terminal leaf length, terminal leaf width and terminal leaf ratio). Once chosen the best number of variables (basing on the results of the statistical analysis of PCA, box plot and ANOVA) a classification model will be constructed using a non-linear artificial neural network (ANN) approach.
The ANNs were developed basing on an input layer (x-block) to estimate the dummy output layer (groups attribution; y-block) developing a multi-layer perceptron (MLP) structure. The MLP is a feedforward neural network, which is widely used in classification and pattern recognition problems (for the procedure see Mossalam & Arafa, 2017;Proto et al., 2020). From the 400 observations, to avoid overfitting only 320 samples (80%) were used to construct the models. The remaining 80 samples (20%) were then used to test the performance of the models (internal test). The partitioning of the two datasets was optimally chosen with Euclidean distances, based on the algorithm developed by Kennard & Stone (1969), which selects objects without a priori knowledge of a regression model. The percentage of bad prediction on train and test sub-sets were reported. To better visualize the results, some numerical statistical validation measures for the test (i.e., accuracy, sensitivity and specificity) and some powerful classification performances as the receiver operating curve (ROC), the area under the curve (AUC) and the f1-score were extracted. The ROC represents a technique for visualizing, organizing and selecting classifiers based on their performance (Fawcett, 2006), while the AUC represents the degree or measure of separability describing how much the model is capable of distinguishing between classes. The ROC curve was obtained by plotting the true positive rate (TPR) as a function of the false positive rate (FPR), and then the area under curve (AUC) was calculated (Guan et al., 2020). In addition, the confusion matrices of both training and the test of the MLP model were reported.
Moreover, the variable impact neural network analysis was performed to assess the relative importance of each variable (Abdou et al., 2012). This index is similar to the linear regression variable importance in the projection (VIP) scores (Febbi et al., 2015). The ANN analysis has been performed using Palisade Neural Tools 7.6.
The variable impact Δ k for the k-th independent variable, was calculated in the following way (Palisade Knowledge Base, 2020). The training set was considered made of m samples. Each sample in the training set is a row vector x with n columns. The samples were stored in a matrix X having m rows and n columns. As a result, a generic element of that matrix is X . The output y is a column vector with m rows obtained by applying the operator N to the matrix X:

y=NX
(2) In particular, y=NX k is a row number representing the class of the k-th element of the training set and X k is the row vector representing the k-th row of the X matrix. In this study N is a nonlinear operator and can be expressed as the tensor product of several linear and nonlinear operators. It represents indeed the MLP. The first row X 1 of the matrix X was considered and the operator N was applied n times to X 1 choosing each time a different value of X 1 1 among the values of X 1 , the latter being the first column of the matrix X. The 1-case dependent variable impact of the first variable ∆ 1 1 was then defined as: This procedure was repeated for all the n variables over all the training set. The i-case dependent variable impact of the k-variable ∆ was then defined as: Finally, the variable impact of the k-th dependent variable was then obtained by averaging ∆ over all the m training cases:

Cluster analysis
The results of the procedure of validation to find the best number of k-clusters are reported in Fig. 2. The best number of clusters to be used on this dataset was chosen above the second negative peak for both A) four morphological variables (i.e., leaf stalk length, terminal leaf length, terminal leaf width and terminal leaf ratio) and for B) three morphological variables (i.e., terminal leaf length, terminal leaf width and terminal leaf ratio). Considering four variables, the analysis extracted a value of k=6 (six groups), while using three variables k is equal to 5 (five groups) ( Fig. 2A and B respectively).

Statistical analysis
Four morphological variables (6 groups) Figure 3 reports the PCA performed on four morphological variables (i.e., leaf stalk length, terminal leaf length, terminal leaf width and terminal leaf ratio) for the six groups extracted from the cluster analysis. The interactions among the morphological variables and the six groups were highlighted by the convex hulls. The first component (PC1), reported an explained variance equal to 54.8%, and was related with the leaf stalk length, terminal leaf length and terminal leaf width. The second component (PC2) (presenting an explained variance equal to 24.3%) was mainly related with the terminal leaf ratio. Five groups (1, 3, 4, 5 and 6) resulted partially overlapped and related to high values of leaf stalk length, terminal leaf length and terminal leaf width. Meanwhile, the group 2 positioned on the negative side of PC1 associating with high value of terminal leaf ratio and low values of leaf stalk length, terminal leaf length and terminal leaf width. Figure 4 reports the box plots performed on the four morphological variables [i.e., leaf stalk length (A), terminal leaf length (B), terminal leaf width (C) and terminal leaf ratio (D)]. It was possible to observe as only for the terminal leaf ratio the six groups resulted to be all similar. Table 2 reports the results of ANOVA performed on the four morphological variables (i.e., leaf stalk length, terminal leaf length, terminal leaf width and terminal leaf ratio) for the six groups extracted from cluster analysis. Figure 2. Results of the validation procedure to find the best number of k-clusters (highlighted by a circle) for A) four morphological variables (i.e., leaf stalk length, terminal leaf length, terminal leaf width and terminal leaf ratio) and for B) three morphological variables (i.e., terminal leaf length, terminal leaf width and terminal leaf ratio).
Generally, all the six groups statistically differentiated for all the four variables except for the group 1 and 4 for the variable "leaf stalk length", for the groups 3 and 5 for the "terminal leaf length" and the "terminal leaf width". Meanwhile, resulted few statistically significant differences between groups for the variable "terminal leaf ratio". Figure 5 shows the PCA performed on three morphological for the five groups (1, 2, 3, 4 and 5) extracted from the cluster analysis. The interactions among the morphological variables and the five groups were highlighted by the convex hulls. In this case, the five groups resulted well separated on the first axes PC1 (explained variance equal to 67.8%) and in particular with the trend (from low to high values of terminal leaf length and terminal leaf   Different letters in the same row denote significant differences among samples means (p < 0.05). Automatic size classes identification of Pistachio leaves width and from low to high value of terminal leaf ratio) of groups 4, 2, 3, 1 and 5. Figure 6 reports the box plots performed on the three morphological variables. Also here, as in Fig. 4, it was observed that the five groups resulted to be all similar only for the terminal leaf ratio. Table 3 shows the results of ANOVA performed on the three morphological variables for the five groups extracted from cluster analysis. Also in this case, all the five groups presented statistically significant differences for all the variables except for the "terminal leaf ratio".

Classification
It has been chosen to classify with the MLP model only the three morphological variables (i.e., terminal leaf length, terminal leaf width and terminal leaf ratio) since from the results of the statistical analysis (i.e., PCA), the groups extracted from clustering were more separated than in the four morphological variables one with a better statistically significant differences for all the variables (i.e., ANOVA). Table 4 reports these results about the performances of the MLP model (training and test) to predict the classification of the five groups. The best model resulted to be constructed with 6 nodes. The percentages of bad prediction in both training and testing resulted equal to 0% (0% incorrect). In addition, the training time was of 00:19:59. Table 5 reports the confusion matrices for both the training and the test set of the MLP model. The accuracy (number of TPR and FPR divided by total number of cases) resulted equal to 1. To better analyze the model, 10,500 random trials were run. The training set size/total number of samples ratio varied between 0.75 and 0.85 (step of 0.5) and each time 500 trials were run, the dataset was reshuffled. For each trial the ROC curve was calculated. The curves obtained by averaging the results are reported in Figure 7 for each class. The mean AUC value for each class was also calculated and reported in Table 6 together with the relative standard deviations. Generally, the 10,500 AUC values resulted strictly < 1. However, being their distribution highly asymmetric, the sum of the standard deviation and the mean value were > 1 for four classes out of five. terminal leaf length, terminal leaf width and terminal leaf ratio) for the 5 groups [group 1 (red), group 2 (blue), group 3 (purple), group 4 (green) and group 5 (brown)] extracted from the cluster analysis. Figure 6. Box plots performed on the three morphological variables for the five groups extracted from cluster analysis.

Spanish Journal of Agricultural Research
December 2020 • Volume 18 • Issue 4 • e0208 Figure 8 shows the variable impact on the MLP model underlining that the "terminal leaf length" returned the higher impact (44.89%) for the five groups classification. This variable was followed by "terminal leaf width" and the "terminal leaf ratio" (38.97% and 16.14% respectively) which also have a high impact. Different letters in the same row denote significant differences among samples means (p < 0.05). Table 3. Results of analysis of variance (ANOVA; p<0.05) performed on the three morphological variables (i.e., terminal leaf length, terminal leaf width and terminal leaf ratio) for the five groups extracted from cluster analysis.

Number of cases 320
Training time 00:19:59 Number of trials 1,534,909 Bad predictions 0%

Number of cases 80
Bad predictions 0%

Discussion
Generally, the automatic classification of size classes basing on the contemporary combination of different morphological traits (i.e., leaf stalk length, terminal leaf length, terminal leaf width and terminal leaf ratio) could be a valid and efficient instrument for the germplasm collection researches. As reported by Sun et al. (2012) shape, in terms of length, width, volume or ratios is a crucial aspect to identify species, and cultivar authenticity. The IPGRI located particular emphasis to collect, conserve and promote the utilization of the world's plant germplasm (Ayad, 1986). As reported by Chatti et al. (2017) for pistachio, being leaf an important growing organ, it could be used to classify different species and to discriminate among varieties. In particular leaf stalk length, terminal leaf length, width and ratio are highly discriminating quantitative characters which are continuously variable; so, they could be measured in a group of plants and recorded in scales, to assess phenotypes and discriminate among varieties.
Certification systems based on the technical protocols of the UPOV (at international level) and Community Plant Variety Office (CPVO, at European one), identify appropriate characteristics for variety description set out in visual charts (e.g., specific technical protocols or technical protocols for different crops specie) and objective measurable observations against a calibrated linear scale, made on a large varietal collection.
The pistachio morphological characterization follows the IPGRI protocols, which are time consuming because a lot of traits must be separately considered. This required the collection of a high number of descriptors resulting time consuming. This aspect could be by-passed automatically establishing size classes in crops specie. As also reported in the study of Lootens et al. (2013), where chicory roots morphological traits were studied using elliptic Fourier descriptors, finding that it is possible to objectively use the root shape also to characterize varieties.
The importance of automatically establishing size classes of a pistachio germplasm collection basing on different morphological leaf variables (i.e., leaf stalk length, terminal leaf length, terminal leaf width and terminal leaf ratio) simultaneously and not only separately, was one of the fundamental aspects of this study.
In our study we adopted a similar approach based on different morphological leaf variables (i.e., leaf stalk length, terminal leaf length, terminal leaf width and terminal leaf ratio) simultaneously and not only separately. The aim was to adopt a rapid and quantitative non-hierarchic clustering classification (k-means), already experimented for the extraction of almond shape classes in the study of Antonucci et al. (2012), in combination with the use of an algorithm of the artificial neural network (MLP). Moreover, an automatic leaves image selective classification system was presented by Arribas et al. (2011) to discriminate among sunflower crops using neural networks.
In particular, when the four morphological variables (i.e., leaf stalk length, terminal leaf length, terminal leaf width and terminal leaf ratio) were considered, the cluster analysis extracted six weakly differentiated groups; meanwhile considering the three morphological variables (i.e., terminal leaf length, terminal leaf width and terminal leaf ratio) the analysis extracted five groups more statistically differentiated. These results indicated that these three morphological variables can be used to classify groups of leaf size classes with the MLP. This method could be generalized for germplasm collections of different fruits and morphological traits, just taking into consideration a very important variable, i.e. the occurrence of leaf variations under different ecological conditions and environmental factors which could induce structural variations (Belhadj et al., 2007). Quantifying the shape in its multivariate and multidimensional complexity is a very important aspect for a quality grading in the industry (Sun et al., 2012;Lootens et al.,  2013). In addition, the selection of suitable pistachio phenotypes, basing on specific morphological traits which reflect different environmental and soil conditions and diseases, are important for increasing yield efficiency and the property of this important crop (Karimi et al., 2009). In this study, the contemporary combination of different morphological leaf traits (i.e., leaf stalk length, terminal leaf length, terminal leaf width and terminal leaf ratio) recorded on the pistachio germplasm collection, allowed to create an automatic classification of size classes of great importance for cultivar identification and comparison. The proposed quantitative methodology is rapid, non-hierarchic and provides k-clusters successfully used for pistachio leaf size classification, representing a practical efficient instrument in many different research fields, such as genetics and agronomy. At k=5 (three morphological variables) the system performed better than k=6 (four morphological variables), because the five groups extracted from cluster analysis resulted well separated and presented statistically significant differences for all the variables except for the "terminal leaf ratio".