A comparison of empirical BLUP with different considerations of residual error variance for genotype evaluation of multi-location trials

The empirical best linear unbiased prediction (eBLUP) is usually based on the assumption that the residual error variance (REV) is homogenous. This may be unrealistic, and therefore limits the accuracy of genotype evaluations for multi-location trials, where the REV often varies across locations. The objective of this contribution was to investigate the direct implications of the eBLUP with different considerations about REV based on the mixed model for evaluation of genotype simple effects (i.e. genotype effects at individual locations). A series of 14 multi-location trials from a rape-breeding program in the north of China were simultaneously analyzed from 2012 to 2014 using a randomized complete block design at each location. The results showed that the model with heterogeneous REV was more appropriate than the one with homogeneous REV in all of the trials according to model fitting statistics. Whether the REV differences across locations were accounted for in the analysis procedure influenced the variance estimate of related random effects and testing of the variance of genotype-location (G-L) interactions. Ignoring REV differences by use of the eBLUP could result not only in an inflation or deflation of statistical Type I error rates for pair-wise testing but also in an inaccurate ranking of genotype simple effects for these trials. Therefore, it is suggested that in application of the eBLUP for evaluation of genotype simple effects in multi-location trials, the heterogeneity of REV should be accounted for based on mixed model approaches with appropriate variancecovariance structure. Additional keywords: rape; genotype-location interaction; variance structure; mixed model. Abbreviations used: AIC (Akaike Information Criterion); BIC (Bayesian Information Criterion); BLUE (best linear unbiased estimation); BLUP (best linear unbiased prediction); eBLUP (empirical BLUP); G-E (genotype-environment); G-L (genotypelocation); LRT (likelihood-ratio test); REML (restricted maximum likelihood); REV (residual error variance). Authors’ contributions: Conception and design of the study, and data collection: XH and RH. Analysis and interpretation of data and wrote the paper: XH. Citation: Zhang, R.; Hu, X. Y. (2019). A comparison of empirical BLUP with different considerations of residual error variance for genotype evaluation of multi-location trials. Spanish Journal of Agricultural Research, Volume 17, Issue 1, e0701. https://doi. org/10.5424/sjar/2019171-13907 Received: 06 Sep 2018. Accepted: 26 Feb 2019. Copyright © 2019 INIA. This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International (CC-by 4.0) License. Funding: The authors received no specific funding for this work. Competing interests: The authors have declared that no competing interests exist. Correspondence should be addressed to Xiyuan Hu: xiyuanhu@aliyun.com


Introduction
Best linear unbiased prediction (BLUP), as originally suggested by Henderson (1975) and verified by Harville (1976), has a clearly understood theoretical basis. It is sought such that the correlation between the true and predict effect is maximized and the mean squared error of prediction is minimized among all linear unbiased predictor, provided the assumed model holds and the parameters of the model are known (Searle et al., 1992;Mrode, 2005). If parameters are estimated, this optimality no longer holds, but it can be hoped that the performance of the so-called eBLUP (empirical BLUP) is not far from optimal (Piepho, 1998). A lot of studies (Cornelius et al., 1994;Piepho, 1994Piepho, , 1998Piepho & Möhring, 2005) have shown that the predictive accuracy of the eBLUP based on a two-way analysis of variance (ANOVA) model was better than that of least-squares estimators based the same models and other models, such as the additive main effects multiplicative interaction (AMMI) models (Piepho, 1998). Therefore, eBLUP has recently gained increasing acceptance and use for genotype evaluation in plant breeding trials (Smith et al., 2005;Piepho et al., 2008;Kleinknecht et al., 2013).
In the analysis of yield trial data from multiplelocation trials it is common to assume a mixed linear model, where genotypes are fixed while locations and interactions are random (Cochran & Cox, 1957;Shukla, 1972;Steel & Torrie, 1980;Kelly et al., 2007). In this context, the genotype simple effects at given locations can be evaluated using the eBLUP. The eBLUP based on mixed model has an advantage of its applicability for unbalanced data. Another salient feature of the eBLUP based on mixed model is that it is not only possible to consider the correlation (or variance-covariance) structure of genotype-location (G-L) interaction but also to account for residual error variance (REV) heterogeneity between the trials conducted in different locations with different levels of precision and eventually to consider spatial variation of error terms. Apart from these, t-tests using the eBLUP can be constructed as a worthwhile alternative method for the hypothesis test about genotype effects based on the mixed model framework (Littell et al., 1996(Littell et al., , 2006, although BLUP is originally developed for ranking and selection (Robinson, 1991). Some authors have examined the usefulness of the eBLUP t-tests based on the mixed model (Forkman & Piepho, 2013;Hu, 2015).
In China, in traditional analysis of the variety trial data with random locations, if there is evidence of variety-location interaction, the variety simple effect difference at specific locations is tested by analyzing each location separately. Such approach is not only inconsistent with mixed model theory but also can limit the power and precision of inference at each location (Littell et al., 1996), because with random locations the appropriate method of inference for variety simple effects at specific locations is BLUP, which permits location-specific inference using information from the entire trial for all locations simultaneously (Atlin et al., 2000a;Piepho & Möhring, 2005;Leiser et al., 2012;Windhausen et al., 2012;Kleinknecht et al., 2013). In China and also in some other countries, the statistical test about difference of experiment effects being random or associated with random effects is usually not done or done not based on BLUP in practice (Littell et al., 1996;Smith et al., 2005).
The usual application of eBLUP as well as eBLUP t-tests, as considered in most previous studies, assumed that the REV was homogeneous. However, data from multiple-location trials are often characterized by strong heterogeneous error variation across environments (Piepho, 1995;Casanoves et al., 2005;Hu et al., 2013;Singh et al., 2013). The implication of the heterogeneity of the REV for evaluation of genotype effects by use of the eBLUP has not yet been examined. The objective of this contribution was to compare the evaluation difference by use of the eBLUP with different considerations about REV, i.e. the eBLUP based on the mixed model with homogeneous REV and that with heterogeneous REV, in view of ranking and pair-wise test of genotype simple effects based on diverse data sets from realistic multi-location trials and hence to convince the practitioner of using the appropriate procedure for genotype effect evaluation, where the variance heterogeneity of residual error effects would be accounted for. The study contains three consecutive steps: (1) fitting the mixed model to each data set using restricted maximum likelihood (REML) under two different considerations about REV, one assumed homogeneous REV and the second assumed heterogeneous REV, and comparing the appropriateness of the model with the different considerations about REV; (2) examining the influence of the two considerations on estimate and testing of related variances; (3) comparing the difference between the eBLUP with different considerations of REV in ranking and difference testing of genotype simple effects.

Trials and data analysis
The data sets used in this study came from multilocation trials in a rape (Brassica napus L.) breeding program in northern China conducted from 2012 to 2014. There were four trial groups (A-B-C-D) in these regions for different production types during each year. There fore, in total there were 12 data sets (3 years × 4 groups). Some 12-13 genotypes were tested at 10-12 locations each year. The genotypes were totally different each year apart from a control variety. All trials at each location were laid out as a randomized complete block design with three replicates. All trial plots were 20 m 2 , planting density was 2.8×10 5 plants/ha and yield data was expressed in kilograms of seed per plot. The details of the data set structure are described in Table 1.
Each of the 12 year-group combinations was treated as an independent data set and separately analyzed based on the mixed model, in which genotype effects were fixed, and block, location and G-L interaction effects were random, respectively. Assuming location as random has an advantage that wide space inference about genotype main effects (i.e. genotype effects averaged over the entire population represented by locations) is applicable to entire target population of locations, not only observed but also unobserved, and has been adopted by organizers of variety trials in many countries (Littell et al., 1996). Two procedures for evaluation of the simple effect of genotypes at specific locations or sets of locations (i.e. locationspecific genotype effects) were used. The first was the eBLUP based on a two-way ANOVA mixed model with homogeneous REV structure across locations, and the second was the eBLUP based on the same model with heterogeneous REV structure across locations.
The models and procedures considered here were implemented in the context of the mixed linear models using PROC MIXED of the SAS System, vers. 9.2 (SAS Inst., 2011). The t-test of the eBLUP was constructed using the program statement "ESTIMATE" of SAS PROC MIXED. The denominator degrees of freedom of t-test were determined using the Kenward-Roger method (Kenward & Roger, 1997) as implemented in the SAS System. This approximation uses the basic idea of Satterthwaite (1941). Its extension relative to the Satterthwaite method of Giesbrecht & Burns (1985) and Fai & Cornelius (1996) is an asymptotic correction of the estimated standard error of model effects due to Kackar & Harville (1984) in small and/or unbalanced data structures.

Assessment of REV and G-L interaction
The Akaike Information Criterion (AIC) (Oman, 1991) was used to evaluate models with homogeneous and heterogeneous REV. The smaller the AIC the better is the performance of the model. Since REML was used, only models with the same fixed-effects structure can be compared. AIC is preferred over the Bayesian Information Criterion (BIC), because the latter has a penalty that involves sample size in terms of independent observational units, and the concept of "effective" sample size is not well defined for mixed models, where random effects give rise to possibly complex dependencies among observations (Raman et al., 2011). In fact, there is no established definition of BIC for mixed models (Pauler, 1998).
Since the model with homogeneous REV is a reduced model compared to a model with heterogeneous REV based on we also used the likelihood-ratio test (LRT) to assess the relative goodness of fit of the two models. With the same principle, whether the variance of G-L interaction, i.e. the effect of G-L interaction, significantly existed was also identified using the LRT.

Appropriateness of procedures with different considerations of REV
The AIC value of the model with hetero geneous REV was substantially smaller than that with homogeneous REV for all data sets (Table 2), which implies that the REV of the trials varied across the locations and that the analysis procedures with heterogeneous REV were more appropriate than their homogeneous REV versions. This can be further verified given that the model with hetero geneous residu al variances fitted the data significantly (p < 0.001 in the LRT, Table 3) better than the model with homogeneous residual variances in all of the trials.  χ 2 value obviously showed difference between the two considerations of the REV. This suggests that whether or not considering the variation of REV also would influence the test about the variance of G-L interaction effects. Just because of the extremely small p-value of the LRT their difference did not showed at α = 0.01 level in these special cases.

Ranking of genotypes using eBLUP with different considerations of REV
As shown above, the variance of G-L interaction was highly significantly in all of the trials analyzed in this study. Therefore, evaluations of genotype simple effects at specific locations were necessary. One evaluation lies in ranking genotypes. As a detailed example for showing genotype ranking differences between the eBLUP with two different considerations of REV, Table 6 shows the ranking result of genotype simple effects at different locations using the eBLUP with homogeneous REV and that with heterogeneous REV, respectively, for the trial of group B in 2013. There was some rank discrepancy between the two eBLUP versions. For example, at location L1, genotype 12 ranked first by the eBLUP with homogeneous REV and second by the eBLUP with heterogeneous REV; at location L2, genotype 12 ranked fifth by the eBLUP with homogeneous REV and third by the eBLUP with heterogeneous REV. For this trial the proportion of locations with rank discrepancy of the genotype simple effect between the two eBLUP versions reached 60.0% (6 locations out of 10, i.e. locations L1, L2, L4, L6, L8 and L10). At the locations with rank discrepancy of the genotype simple effect, the proportion of genotypes with rank discrepancy between the two eBLUP versions reached from 33.3% (4 genotypes out of 12, i.e. genotypes

Estimate and test of variances under different considerations of REV
As well known, the eBLUP is based on the estimate of variances, and only when the variance of G-L interaction statistically significantly exists in multilocation trials an evaluation of the simple effects of genotypes at specific locations is just meaningful. Therefore, a comparatively investigation of the estimate and test of the G-L interaction variance under different considerations about residual error effects may be valuable.
In Table 4 are the percentage differences of the estimates of the involved variance of the models with homogeneous REV from their heterogeneous REV versions. There was some discrepancy of variance estimates between the two considerations about REV. And this discrepancy was large for the block variance, ranging from -57.0% (for the trial of group B in 2012) to 380.0% (for the trial of group C in 2013), intermediate for the G-L interaction variance, ranging from -0.7% (for the trial of group B in 2014) to 13.1% (for the trial of group A in 2013), and very small for location variance, ranging from -0.8% (for the trial of group C in 2013) to 1.7% (for the trial of group C in 2012). This suggest that whether considering the REV variation across locations had impact mainly on estimate of the variance for block and G-L interaction effects and slightly on estimate of the variance for location effects.
The p-value of the LRT about the variance of G-L interaction effects was smaller than 0.0001 under both considerations of the REV in all of the considered trials (Table 5), which is extremely small compared to α = 0.01 and means that the variance of G-L interaction effects existed highly significantly in these trials. However, the   The df theoretically should be 1 in the LRT for variance of G-L interaction effects in all cases. Because the variance for block in some cases was estimated null, when the G-L interaction effects were dropped in the analysis, the df of the LRT in these cases became 2.
2, 6, 10 and 12 at location L1, and genotypes 2, 3, 7 and 9 at location L8) to 58.3% (7 genotypes out of 12, i.e. genotypes 1, 4, 6, 7, 9, 11 and 12 at location L2, and genotypes 3, 4, 5, 6, 7, 8 and 10 at location L10). For all trials, the proportion of locations and genotypes with rank discrepancy between the two eBLUP versions is summarized in Table 7. It is to observe that there was rank discrepancy between the two eBLUP versions in all of the trials. The proportion of locations with rank discrepancy of the genotype simple effect between the two eBLUP versions reached from 18.2% (for the trial of group D in 2014) to 100% (for the trial of group A in 2014). At the locations with rank discrepancy of the genotype simple effect, the proportion of genotypes with rank discrepancy of the genotype simple effect between the two eBLUP versions reached from 15.4% (for the trial of group A in 2012) to 58.3% (for the trial of group B in 2013).

Testing of genotype simple effects using eBLUP with different considerations of REV
We also tested genotype simple effects when there is variance of G-L interaction. To illustrate the difference for pair-wise testing of genotype simple effects between the two eBLUP versions, the ratio of the number of genotype pairs with significant (α = 0.05) differences based on the eBLUP with heterogeneous REV compared to its homogeneous version is given in Table 8. With exception of L1-L2 locations for the trials of groups A, C and D in 2012, group C in 2013, as well as groups C and D in 2014, where the number of genotype pairs with significant differences was the same (i.e. the ratio of the number of genotype pairs with significant differences between the two eBLUP versions was unity), there was a substantial discrepancy (i.e. the mentioned ratio was not unity) of the number of genotype pairs with significant differences between the two eBLUP versions at most locations for these trials and at all locations for the other six trials.    We also examined other statistics, e.g., estimates of genotype simple effect difference, standard errors of simple effect difference estimates, degrees of freedom, as well as t-values in the t-test, between the two eBLUP versions (results not shown). There were differences in all of these statistics between the two eBLUP versions. This suggests that whether the heterogeneity of REV is accounted for by use of the eBLUP has an impact on the t-test about genotype simple effect in various aspects, which together resulted in the discrepancy of the number of genotype pairs with significant differences between the two eBLUP versions.

Discussion
In this work, the models with heterogeneous REV fitted the data better than their homogeneous REV versions for all of the considered trials according to both the information criterion AIC and the LRT. This further illustrates that the heterogeneity of REV across locations generally existed in multi-location trials, and that assuming a homogeneous REV is generally not realistic and makes the procedure with consideration of heterogeneous REV a more appropriate choice. Previous work (Hu et al., 2013) has showed that failing to take into account REV variations across locations by use of best linear unbiased estimation (BLUE) could result in an inflation or deflation of statistical Type I error rates for pair-wise difference test of genotype simple effects depending on specific locations. By use of the eBLUP in the present study, the ratio of the number of genotype pairs with significant differences between the two eBLUP versions was mostly not unity. The ratios smaller and larger than 1 indicate an inflation and deflation of statistical Type I error rates (Hu et al., 2013), respectively, for pair-wise testing of genotype simple effects by use of the eBLUP with homogenous REV in comparison with that with heterogonous REV. The reasons for this discrepancy understandably are error variations across locations and the eBLUP with homogenous REV failing to consider this variation. Apart from this, the present study also showed that whether the heterogeneity of REV was accounted for in the analysis procedures impacted the variance estimate of random effects, testing of the variance of G-L interaction effects, as well as the ranking of genotype simple effects by use of the eBLUP. In this context, it is to say that accounting for the heterogeneity of REV is more essential by use of the eBLUP than that by use of BLUE, because by the latter the heterogeneity of REV influences merely the pair-wise t-test of genotype simple effects, and by the former it influences not only the pair-wise t-test but also the ranking of genotype simple effects.
Mixed model equations developed by Henderson (1975) are a useful tool to analyze trials with heterogeneous REV (Henderson, 1975;Harville, 1976Harville, , 1977McLean et al., 1991;Marx & Stroup, 1993). Solutions to the mixed model equations give BLUE for fixed effects and BLUP for random effects (Searle et al., 1992). Generally, when a REML-based mixed model package such as MIXED is employed, the user needs not worry about how to account for the heterogeneity of REV. This will be account for automatically on the basis of the mixed model with a heterogeneous structure for residual error effects. Besides, the mixed model framework also allows analysis procedures to be assessed using likelihood-based criteria (Wolfinger, 1993). This study used AIC and the LRT for assessing the appropriateness of the analysis procedure with different consideration about REV. This may be preferable in practice to the computer-intensive crossvalidation (Piepho, 1998). Therefore, the mixed model should be routinely used for genotype evaluation in multi-location trials.
This paper exclusively focused on the ANOVAtype mixed model, which implies a simple variancecovariance for G-L interaction effects. There are other complex structures, e.g. the factor-analytic variancecovariance structure (Piepho, 1998). The complex variance-covariance, if viewed from a mixed-model perspective, implies heterogeneities of the variancecovariance for G-L interaction effects. There are studies on the impact of the heterogeneity of variancecovariance for G-L interaction effects on estimate of genotype effects (Piepho, 1994(Piepho, , 1998 in multi-location trials. An analysis procedure simultaneously accounting for the heterogeneity of variances of both G-L effects and residual error effects and a simulation study on the precision and efficiency of this procedure would be worthwhile. This will be the subject of further research. Most of the studies on eBLUP are exclusively focused on the estimate of genotype simple effects. This paper examined the impact of the heterogeneity of REV not only on the ranking but also on the statistical hypothesis testing of genotype simple effects for multilocation trials. The latter is especially important for the analysis of late-stage variety evaluation trials or some agronomy trials, where the number of varieties or treatments is fewer and hypothesis testing is more relevant. For example, in the trials for commercial release and recommendation of variety to farmers (e.g. on-farm trials) in China, the statistical hypothesis testing of genotypes is in routine use. The trials used in this study are only some examples of these scenarios. It also should be mentioned that an evaluation of genotype main effects usually is one of the important objectives in multi-location trials. The genotype main effects usually are considered fixed and they are evaluated using BLUE. For information on the impact of the heterogeneity of REV on the evaluation of genotype main effects, readers are referred to Hu et al. (2013).
Conventionally, hypothesis tests are defined for fixed parameters only. Just as BLUP is not estimate, the hypothesis test based on BLUP is not a true one as conventionally defined (Littell et al., 1996). Distribution theory associated with BLUP is not nearly as well-understood as it is with conventional estimable functions, and there are no exact methods for statistical inference on random effects (Littell et al., 1996). These notwithstanding, t-tests based on BLUP can be very useful in assessing variety simple effects at specific locations (Littell et al., 1996).
In addition to yield comparison of genotypes, there is question regarding the stability of genotypes in many multi-location trials. By assessing the genotype simple effects using eBLUP, the stability issue can be also addressed using mixed models with random effects for G-L interaction (Littell et al., 1996(Littell et al., , 2006. For information on the impact of the heterogeneity of REV on the evaluation of genotype stability, readers are referred to Hu et al. (2014).
In China and as showed in this paper, genotypes are modeled as fixed and locations as random. In contrast, in Australia genotypes are generally modeled as random and locations as fixed (Smith et al., 2001(Smith et al., , 2005. Which of them, especially assuming genotypes as fixed or random, being reasonable, is still a controversial topic among statisticians. Piepho (1994) showed that the predictive accuracy of eBLUP based on a two-way ANOVA model differed only slightly depending on whether genotypes, environment, or both, were regarded as random and that the most important assumption was that interactions are random. This paper mainly investigated the properties of eBLUP of interaction effects. Based on this, the conclusion about eBLUP from this work is also applicable to the case as in Australia because fixed genotypes and random locations also imply random G-L interaction and the prediction of the output of random variables is commonly done by BLUPs.
In multi-environment trials, the presence of genotypeenvironment (G-E) interaction is a constant concern since the performance of a variety can vary significantly when the G-E interaction effect is accentuated, and since it is difficult to evaluate the differences among the genotypes in all environments, making the selection process laborious. Thus, the G-E interaction imposes real difficulties to the breeder's work; however, it is also an excellent opportunity to explore its positive effects through specific recommendations in mega-environments (Annicchiarico & Perenzin, 1994;Annicchiarico & Pia no, 2005). This paper has been restricted to the problem of obtaining good estimates of genotypes in trial environments. Clearly, the estimate or therefore the recommendation is only for environments under trial, not for 'new' environments. At times, the main interest is in estimate for new environments not under trial. For example, the farmer's interest is in an appropriate estimate of genotypes in their own fields which are not exactly the same locations as trial, there may be G-L interaction. This problem was dealt with by Annicchiarico & Perenzin (1994), Weber & Westermann (1994), Piepho et al. (1998) and . Even in presence of G-E interaction, it is usually required to find the stable high-performing genotypes across environments. In this case, the best we can do is that variety effects can be estimated across environments by considering the main effects across environments and treating different environments as a sample from a target population of environments. Information on this issue is in the papers by Atlin et al. (2000b) to find. There may be scope to improve predictions by making use of a stratification of the target population of environments into ecological zones according to similarity in agroclimatic conditions and production constraints, such as in the paper by Kleinknecht et al. (2013). But each zone would still be represented by a random sample of locations and estimation would focus on a genotype's zone mean rather than on the location mean.
It may also be worth making a clear distinction between locations and years because G-L interactions are reproducible but genotype-year interactions are not (Annicchiarico et al. 2000(Annicchiarico et al. , 2006. Predicting G-L-year means is much less meaningful than predicting G-L means across years. On repeatability of G-E interactions and genotype recommendation for the following growing season, readers can refer to relevant literature (Annicchiarico et al., 2000(Annicchiarico et al., , 2006Yan & Rajcan, 2003;Annicchiarico & Piano, 2005;Annicchiarico, 2007;Ma & Stützel, 2014).
In summary, we have found heterogeneity of REV in all of the considered rape cultivar trials. Whether the REV differences across locations were accounted for in the analysis procedure influenced the variance estimate needed for the eBLUP, testing of the variance of G-L interaction, and hence influenced the evaluation of genotype simple effects by use of the eBLUP. In application of the eBLUP for evaluation of genotype simple effects, the heterogeneity of REV can be accounted for based on the mixed model with appropriate variance-covariance structure.