An in-depth knowledge about variables affecting production is required in order to predict global production and take decisions in agriculture. Machine learning is a technique used in agricultural planning and precision agriculture. This work (i) studies the effectiveness of machine learning techniques for predicting orchards production; and (ii) variables affecting this production were also identified. Data from 964 orchards of lemon, mandarin, and orange in Corrientes, Argentina are analysed. Graphic and analytical descriptive statistics, correlation coefficients, principal component analysis and Biplot were performed. Production was predicted via M5-Prime, a model regression tree constructor which produces a classification based on piecewise linear functions. For all the species studied, the most informative variable was the trees’ age; in mandarin and orange orchards, age was followed by between and within row distances; irrigation also affected mandarin production. Also, the performance of M5-Prime in the prediction of production is adequate, as shown when measured with correlation coefficients (~0.8) and relative mean absolute error (~0.1). These results show that M5-Prime is an appropriate method to classify citrus orchards according to production and, in addition, it allows for identifying the most informative variables affecting production by tree.

**Citation**
:
Díaz, I.; Mazza, S. M.; Combarro, E. F.; Giménez, L. I.; Gaiad, J. E. (2017). Machine learning applied to the prediction of citrus production. Spanish Journal of Agricultural Research, Volume 15, Issue 2, e0205.

Agriculture implies high levels of production risks and many variables must be considered to take decisions. In order to define management strategies and development programmes, an adequate knowledge about variables most directly affecting production is essential. Understanding the behaviour of variables is difficult due to the complexity of relationships and to the amount of factors involved. Citrus production becomes a special challenge due to the significant spatial and temporal variability present in orchards.

In citrus orchards, production is primarily defined by the amount and size of fruits. Production can be affected by both endogenous and exogenous factors. Endogenous factors are, for instance, genetic characteristics of species or varieties, and physiological issues. Among the exogenous factors, environmental and crop conditions, especially irrigation and fertilisation, are highlighted (

Citrus trees’ development is possible between 10°C and 40°C and optimised between 24°C and 32°C. Fruit size and final set depend, among other factors, on the availability of carbohydrates for developing flowers. Thermal influence is very limited in the range of 22°C to 30°C. However, if leaf temperature rises above 32°C, the CO
_{2}
assimilation rate decreases. Thermal influence on growth and competition between vegetative and reproductive developments, emphasise problems from a limitation in the CO
_{2}
fixation, such as the alternation of productivity between seasons, reducing fruits’ size, and final fruits set (

Maximum and average temperatures, reference evapotranspiration, wind speed, and relative humidity, are the meteorological variables with the greatest influence on fresh dough and equatorial diameter of fruits. In citrus orchards growing at temperate climates, autumn rains improve the fruits’ final size and juice content, and reduce the concentration of sugars and free acids. Total annual rainfall between 900 mm and 1200 mm is enough to ensure fruit development. On the other hand, drought periods (even if short) tend to reduce the fruit size. When lower values or dry seasons occur, complementary irrigation is needed (

Many variables must be considered prior to making decisions about planting the framework. Tree vigour and growth habitat, as influenced by variety and rootstock, are important, and site quality in terms of climate, soil characteristics, and water availability must be considered. In general, higher density plantings that rapidly develop into a hedgerow appear to be advantageous, especially at the beginning of trees’ production life. However, vigorous combinations with more spreading growth habits should be planted with wider spacing (

Machine Learning (ML) is a branch of artificial intelligence that provides methods with the ability to learn from or to make predictions on data. These methods build a model from example inputs in order to make predictions or to take decisions (

In particular, some of these methods have been applied for comprehensive agricultural planning in precision farming (PF) (

Among the huge number of issues related to PF, pest prediction is a task where different ML techniques have been successfully applied. In particular, Bayesian techniques have been adopted (

When the variable to predict is continuous, ML methods more commonly used are CART (

The model tree technique (see, for example,

Regarding model tree techniques, the strategy to construct the tree is similar for all of them (

As production-predictive tasks require the learned model to predict a numeric value associated with a variable rather than the class the example belongs to, model regression trees are proposed. Hence, this work checks the effectiveness of ML techniques in order to determine the affecting variables and classify citrus orchards according to production. In particular, the predictive mechanism established in this work to characterise the variables involved, and to identify the most important factors affecting citrus production, is based on the M5-Prime method.

The studies have been conducted during seasons 2013 and 2014, with field information from 964 Citrus orchards in the province of Corrientes, Argentina, located at latitudes 57°W to 59°W, and longitudes 27°S to 31°S. Orchard tree canopies belong to several varieties of three species: lemon (

Every orchard was characterised by the following variables: global position (latitude and longitude degrees, minutes and seconds); annual minimum and maximum average temperatures (ºC), annual total rainfall (mm) and annual total frost-free days defined from the corresponding isolines at orchards’ location; environment, species, variety, age of trees, planting framework (between rows’ distance, m; within rows’ distance, m), presence or absence of irrigation (binary) and production by tree (kg/tree).

Lemon was present in 94 orchards (9.6%), placed at 28°S to 30°S and 57°W to 59°W, in Mesopotamic Park and savanna environments, with annual average temperatures between 18°C and 21°C, total annual rainfall between 1000 mm and 1200 mm, and 320 to 340 frost free days in the year. Two varieties of lemon were found in the studied orchards: 'Eureka' (71% of orchards) and 'Genova' (26% of orchards). In addition, 3% of orchard varieties could not be identified (Unknown). Only 26.5% of the orchards were under irrigation, with similar percentages in all varieties. The characteristics of these orchards are presented in

Mandarin was present in 364 orchards (37.6%), placed at 28°S to 30°S and 57°W to 59°W, with annual average temperatures between 18°C and 21°C, total annual rainfall between 1000 mm and 1200 mm and 320 to 360 frost free days in the year, in mesopotamic park and savanna environments (however, 'Clemenules', 'Murcott', 'Criolla', 'Nova', 'Dancy' and 'Okitsu' varieties appeared in all locations; W Murcott is present only at 59°W, 29°S in mesopotamic park environment and the others only at 57°W, 30°S in savanna environment). Twelve varieties of mandarin were found in the studied orchards: 'Murcott' (24% of orchards), 'Ellendale' (20%), 'Okitsu' (15%), 'Nova' (12%), 'Dancy' (8%), 'Clemenules' (6%), 'Criolla' (5%), 'Encore' (3%), 'Ortanique' (2%), 'Malvacio' (1%), 'Montenegrina' (1%) and 'W Murcott' (1%). In 1% of orchards, the variety could not be identified (Unknown). Irrigation was present in 45% of orchards, with higher percentages in 'Montenegrina', 'W Murcott', 'Clemenules', 'Nova' and 'Murcott'.

Orange was present in 509 orchards (52.8%), placed at 28°S to 30°S and 57°W to 59°W, in mesopotamic park and savanna environments, with annual average temperatures between 18°C and 22°C, total annual rainfall between 1000 mm and 1400 mm, and 320 to 360 frost free days in the year. Fourteen varieties of orange were found in the studied orchards: Valencia late (50% of orchards), Salustiana (8%), Valencia seedless (7%), Washington navel (7%), Delta seedless (5%), Valencia frost (4%), Criolla (3%), Lane late (2%), Navel late (2%), Navelina (2%), Robertson navel (1%), Newhall (0.2%), Hamlin (0.2%) and Westin (0.2%). In 7% of the orchards, the variety could not be identified (Unknown). Irrigation was present in 42.4% of the orchards, with higher percentages in Salustiana, Midknight, Navelina, Robertson Navel, and Newhall. Description of these orchards is presented in

Graphic and analytical descriptive statistical tools were used, and Pearson correlation coefficients (

Based on endogenous (species, varieties, age of trees) and exogenous factors (global position, annual minimum and maximum average temperatures, total rainfall and total frost-free days, environment, planting framework and irrigation) (

M5-Prime is a learner which constructs regression trees producing a classification, based on piece-wise linear functions (

M5-Prime selects the split that maximises the expected error reduction. Once the tree is constructed, a multivariate linear model is computed for the examples at each tree node with standard regression techniques and using only attributes that are referenced by tests or linear models somewhere in the sub-tree under this node. The main characteristics of this method are:

1. Regression tree construction:

a) Splitting criterion: Maximise SDR

_{1}
, T
_{2}
, … the subsets resulting from the node split according to the chosen attribute.

b) Stopping criterion: Standard deviation below a given threshold (small enough) in all nodes.

c) Pruning: Heuristic estimation of absolute error of linear regression models.

with

d) Smoothing is used to compensate discontinuities between the adjacent linear models at the leaves of the pruned tree. The smoothing process uses first the leaf model to compute the predicted value, and then it filters that value along the path back to the root, combining it with the value predicted by the linear model for that node. The modified prediction p’ is computed by

with

2. The value at each leaf is estimated using a linear regression function.

3. At each node, it uses only a subset of the attributes occurring in the sub-tree.

The experiments were conducted using the RWeka Package, using the M5-Prime function with the standard configuration,

The accuracy of this method was studied in terms of root mean square error (RMSE), correlation coefficient (R) and the relative mean absolute error (MAE). RMSE measures the difference between the real and the estimated value and MAE compares the average of the differences between the real and the estimated values to the average of the estimated values (

The

In

Results obtained by all techniques related age with production by tree. However, in Biplot, other variables showed smaller angles with production, indicating stronger association. On the other hand, M5-Prime allows for grouping orchards according to production by tree and highlights age as the best classification variable.

In addition, M5-Prime defines groups primarily based on trees’ age. Minimum and maximum temperatures, despite being below optimum values (

Differences between L1 and L2 were mainly based on differences of weights associated to within rows (see

Orchards with tree age of over 21 years (L3), with the weakest planting framework, showed the largest production by tree. In this group, the main factor affecting production was irrigation. This can be deduced from the value of the corresponding coefficient in regression tree and from the fact that 76% of orchards in this group are irrigated. On the other hand, L1 and L2 orchards (< 25%) indicated that this practice is necessary and improves yield, according to

Differences in regression coefficients with L3 were mostly based on the inclusion of the 'Eureka' variety and latitude degree coefficients. In addition, the weight of irrigation, distance within rows and constant coefficients also influence these differences. Note that latitude and variety (specifically 'Eureka') are not relevant variables for trees with age over 21 years.

According to

Results obtained from M5-Prime indicate that tree age was the most informative variable to classify mandarin production, followed by irrigation, between and within rows distances as shown in

M5-Prime classified mandarin orchards into eight groups. For instance, M1, with tree age of 11.5 years or below, comprised the largest number of orchards, with one of the smallest productions by tree and high variation. The most relevant variable for the other seven groups associated to orchards older than 11.5 years, was irrigation. Descriptive statistics of production by group are presented in

In mandarin orchards, not all techniques related age with production by tree. Age and production were non-significantly associated according to

PCA and Biplot indicated a high association of production by tree with age, irrigation, within and between rows, and matching with the variables selected by M5-Prime.

Orchards were classified into eight groups by M5-Prime, seven of them associated to the orchards over 11.5-year-old M1 group. Orchards with trees’ age under 11.5 years could be considered at the beginning of the commercial production (according to

For the orchards whose trees’ ages were greater than 11.5 years, irrigation was an important characteristic. Groups M2, M3, and M4 present irrigation, but their productions were below higher, indicating that annual rainfall between 1000 and 1200 mm could be enough for citrus growth and production (

Age was again an important variable for the groups with no irrigation, being 13.5 years the split point for age. For the groups associated to ages below 13.5 years (M5, M6 and M7), the distance between rows was relevant. The most productive orchards belonging to M5 and M6 groups presented ages of below 13.5 years and distance between rows of below 6.5 m. These results strongly agree with the results of

Production by tree was significantly associated, in a positive way, with age (

Orange PCA Biplot is shown in

According to

M5-Prime classifies orchards into five groups. O1, O2, and O3 are groups with tree age below 8.5 years, with the lowest values of production, and the other groups are over this age. Descriptive statistics of production by groups are presented in

Not all techniques related age with production by tree. The

Groups O1, O2, and O3, with tree age of 8.5 years or below (that can be considered by

Thus, M5-Prime is demonstrated as appropriate to classify citrus orchards and allows for defining more informative, i.e., more relevant, variables affecting tree production. For all the studied species, the most informative variable is tree age; in mandarin and orange orchards, age is followed by between and within rows distances; irrigation also affects mandarin production.

In this work, the factors affecting sweet orange, lemon, and mandarin production were studied using different techniques. In particular, statistical methods such as correlation coefficient, principal component analysis, and Biplot were employed, to identify such factors. In addition, in order to provide a more complete and interpretable point of view, a machine learning technique (known as M5-Prime) was applied.

M5-Prime is demonstrated appropriate to classify citrus orchards and allows for defining more informative,

In all species studied, in younger orchards, higher productions are associated with stronger planting densities, mainly distance within rows.

Future studies would involve a more thorough investigation in the possibility of using ML techniques for the prediction of citrus yield, and comparing the effectiveness and efficiency of several different paradigms and learning methods, such as regression trees, SVR, neural networks… as well as combinations of them with techniques such as bagging, boosting or random forests.

New, complementary variables will also be incorporated, such as those obtained from hyperspectral satellite imagery, which have been already used successfully in Precision Farming problems (