Comparison of logistic regression and growth function models for the analysis of the incidence of virus infection

A logistic regression model was compared to logistic, Gompertz and log-logistic growth functions for analyzing a set of data describing the incidence of Alfalfa mosaic virus infection in lucerne fields aged from one to five years, and located in three different ecological areas of the Ebro Valley, Northeast Spain. Models were f itted in the form of generalized linear models, and none of them explained well the high variability of the field data, although they were useful to analyze the interdependence among epidemiological factors associated with estimated parameters in the models. The logistic regression model proved more sensitive than classical growth function models to detect significant differences in parameters such as the rate of incidence increase with age of lucerne field or the initial amount of disease, and to detect differences associated to explanatory variables such as the ecological area. Results indicate that logistic regression may be a method well suited to statistical analyses in plant epidemiology. Additional key words: epidemiology, Gompertz, logistic, log-logistic, logistic regression, model.

population of host plants, and may be considered as the epidemic «signature» in the sense that it integrates all host, pathogen and environmental factors occurring during the epidemic (Campbell and Madden, 1990), which determine the final amount of disease.Different mathematical models have been used for the analysis and prediction of DPCs, including those derived from population growth functions (Campbell and Madden, 1990) or theoretical ecology and epidemiology (Segarra et al., 2001;Gilligan, 2002), and those based on computer simulations (Lannou et al., 1994;Bertschinger, 1997).
Comparative analyses of epidemics by model fitting may help to identify what factors are primary determinants of model parameters describing the epidemics, such as the rate of disease change or the initial and final amount of disease.Among the simplest growth models, the most widely used in plant epidemiology are the logistic, the Gompertz and the log-logistic.These models are usually fitted to data by least squares regression of their linearized form.Logistic regression represents an alternative fitting method for the logistic transformation, in which linear regression is performed by the maximum-likelihood method.In fact, least-squares regression might be considered as a particular case of maximum-likelihood regression when the dependent variable follows a normal distribution (Sokal and Rohlf, 1995).Thus, logistic regression should be a more general and better approach for fitting bivariate, binomially distributed variables, such as count data of disease infection.Whereas logistic regression has been extensively used in case-control analysis in clinical epidemiology (Petrie and Sabin, 2000;Agresti, 2002), it is infrequent in plant epidemiology studies (Mila et al., 2003;Musaka et al., 2003;Weiland et al., 2003;Thebaud et al., 2006;Harikrishnan and Del Río, 2007).
In this paper, we discuss the results of fitting logistic, Gompertz and log-logistic models, or a logistic regression model, to data sets describing Alfalfa mosaic virus (AMV) infection in one to five years-aged lucerne fields in the Ebro Valley, Northeast Spain.AMV (Alfamovirus, Bromoviridae) is worldwide distributed in lucerne crops, reaching considerable incidence (Gibbs, 1962;Forster et al., 1985;Hajimorad and Francki, 1988).AMV is transmitted by more than 15 aphid species, and it is also seed-transmitted in lucerne, with seed transmission rates varying from 1 to 50% depending on virus strain, cultivar and age of the infected plant (Frosheiser, 1974;Pathipanawat et al., 1997).This high transmissibility, together with its broad host range, with more than 400 host species including many vegetable crops, are probably the main reasons for its wide geographical spreading and high incidence.Fitting logistic, Gompertz and log-logistic growth models to AMV incidence data by least squares regression provides similar conclusions regarding the comparison of the rate of incidence increase among epidemics.However, logistic regression seems a more sensitive method for identifying differences in DPC-description parameters.

Origin of data
Data used in this work are incidence of AMV in 26 different fields of lucerne of ecotype 'Aragon', aged from one to five years, sampled in three ecological areas (named A, B and C) in the Ebro Valley (Northeast Spain) during the summer of 2005 and 2006 (Table 1).Areas A and B were located near Sariñena (Huesca), and area C in Tauste (Zaragoza).In each area, one or two fields of each age were sampled in a random systematic manner taking about 100 samples per field and analyzing them for the presence of AMV infection by ELISA with antisera raised against AMV (Loewe).

Statistical models
To analyze incidence of AMV in lucerne fields, logistic, Gompertz and log-logistic models were fitted to incidence data by least-squares regression.The logistic model was also fitted by maximum-likelihood regression.Models were fitted as generalized linear models (Agresti, 2002) and took the following mathematical expression: The dependent variable in the models T M derived from data (Y, expressed as frequency) of AMV incidence (Table 1) after their transformation according to each model: for logistic and log-logistic leastsquares and logistic regression (M = L, LL, LR, respectively), and T M = G = -Ln [-Ln (Y)] for the Gompertz least-squares model.The dependent variable was fitted as a function of two explanatory variables, Z and E, with Z explaining the effect on AMV incidence of the ecological area (Z = A, B, C, see Table 1), and E explaining the effect on AMV incidence of the age of the lucerne field (E = Age for all the models except loglogistic, in which E = Ln (Age), with Age = 1 to 5 years, see Table 1).The term k + a(Z), which is named Y 0 , represents an intercept that does not depend on age effect, i.e., it quantifies the initial amount of disease of an hypothetical DPC, with k constant and a(Z) accounting for differences in ecological area.The term b + c(Z), which is named r M (M = L, G, LL or LR), represents the rate of increase of AMV incidence with age, with b constant and c(Z) accounting for differences in ecological area.The term c(Z) • E represents the interaction between the explanatory variables Z and E. Thus, models also take the form T M = Y 0 + r M • E, which is their more generally known linearized form.
Statistical parameters of goodness of fit for leastsquares regression models, i.e. the coeff icient of determination (R 2 ) and the mean squares of error (MSE), were recalculated to give R 2* and MSE * after back-transformation of the predicted T M and the predicted mean of its distribution, in order to allow goodness of fit comparisons among models.Similarly, the initial amount of disease Y 0 * for the hypothetical DPC was calculated by back-transformation from the intercept parameter Y 0 in the models.The rate of increase of AMV incidence for each model (r M ) was recalculated to give the weighted mean absolute rate of incidence increase using the expression: r R = r M / (2m + 2), with m = 2 for logistic, log-logistic and logistic regression models, and m = 1 for Gompertz model (Richards, 1959).This new parameter allowed comparisons among models.

Results
Incidence data of AMV infection of 26 commercial lucerne fields with ages ranging from one to five years are presented in Table 1.Fields were located in three different ecological areas referred to as A, B and C. Models describing hypothetical DPCs for increase of AMV incidence with age were fitted to transformed incidence data (T M ) in the form of generalized linear models with two explanatory variables accounting for ecological area (Z) and age (E) of fields.Transformations of incidence or incidence and age variables varied depending on the model fitted in each case (see Materials and Methods).

Growth function models
Linearized forms of logistic, Gompertz and loglogistic growth models were fitted by least squares regression.Criteria for comparing models are presented in Table 2.The three models significantly explained the variation of the data in terms of explanatory variables (P < 0.01).Gompertz and log-logistic models yielded quite similar results, while the logistic model explained a smaller proportion of the variation, as indicated by the back-transformed coefficient of determination (R 2* ).Based on both R 2* and MSE * , the Gompertz model showed the best fit to data, slightly better than the loglogistic model.Distribution of standarized residuals was also similar for the three models (not shown).Parameter values and their standard errors were calculated in the three models for each level (A, B, C) of the explanatory variable Z, which affects the component a(Z) of the intercept Y 0 , and the component c(Z) of the rate of increase of incidence with age, r M (Table 3   (1) Coefficient of determination for transformed dependent variable T M .
(2) Coefficient of determination and mean-squares of error recalculated after back-transformation of the dependent variable T M .
(3) Hypothesis test for the models: d.f., degrees of freedom; F, Fisher statistic; P, probability of the significance test.
Table 3.Estimated parameters and their standard errors of growth function models fitted by least-squares regression to data of Alfalfa mosaic virus (AMV) incidence

Model
Model (1) :  means to test hypotheses about the effects of these variables and their interaction.Analysis of variance of explanatory variables in the three models indicated that the variable age (E) had a significant effect on AMV incidence (P ≤ 0.0004), and that the interaction Z × E was not significant (0.4515 ≤ P ≤ 0.9012), i.e., the effect of age (E) on AMV incidence did not change for A, B, or C ecological areas under study.Thus, the interaction Z × E was dropped from the models, so that DPCs for the three areas had the same weighted mean absolute rate of increase of incidence with age (Table 3, case II).With the same rate of increase of incidence (r R ), the three areas were compared for the initial amount of disease (Y 0 * ), which did not differ for the logistic model, while the Gompertz and log-logistic models detected differences between areas A and B, on one side, and area C, on the other (P ≤ 0.0189).

Logistic regression models
Logistic regression models were step-wise fitted to the AMV incidence data (Y) by the maximum-likelihood method.Fitting started with model I, which included no explanatory variables, and proceeded by introducing explanatory variables for ecological area (Z) and/or age of lucerne field (E) at subsequent steps (models II to IV).Finally, the complete model V included both explanatory variables (E and Z) and their interaction (Z × E) (Table 4).Maximum-likelihood ratio statistics (G) tests the goodness of fit of each model.The decreasing values of G indicated a better fit as new explanatory variables were added to the models.However, the G value for the complete model V was still significant, which means a lack of fit, i.e., additional explanatory variables should be included in order to better explain the variation within data.Significance of the expla-natory variables in the models could be analyzed by computing the loss of fit resulting from dropping each variable from the model.Thus, the effect of ecological area, represented by the term a(Z), was tested by subtracting the G-goodness of fit value of model IV from that of model III, i.e.G = 152.3844with two degrees of freedom (d.f.), which is significant (P < 10 -33 ).Similarly, the effect of age, represented by the term b • E, was tested by the difference between the G values of models II and IV, and the effect of Z × E interaction, represented by the term c(Z) • E, was tested by the difference between the G values of models IV and V, which is G = 488.0101with one d.f. for age and G = 19.4428with two d.f. for Z × E interaction (P < 10 -107 and P < 10 -5 , respectively).Thus, the age of the lucerne field was the main factor affecting AMV incidence, the ecological area was the second more important factor, and the interaction between both effects was the third one.These results indicate that the increase of incidence with age differed depending on the ecological area.
Parameter estimates and their standard errors for model V, which includes the three effects indicated above, are presented in Table 5.The intercept and the rate of increase of incidence with age are Y 0 = k and r LR = b, respectively, for Z = A.Those for Z = B or Z = C are computed as Y 0 = k + a(Z) and r LR = b + c(Z), respectively.Coefficient c(Z) for areas B and C was significant, which indicates that the effect of age in these areas differed from that in area A. The effect of age did not differ between areas B and C (P = 0.8408, Wald), i.e., the Z × E interaction was significant because of area A. Thus, the model was reduced to describe DPCs with a common rate of increase of AMV incidence with age (r R ) for areas B and C, different from that of area A. Intercepts for areas B and C were significantly different (P < 0.0001) and are presented in Table 5 in terms of the initial amount of disease (Y 0 * ).

Discussion
Generalized linear models allow modelling of a random categorical or continuous dependent variable in terms of categorical or continuous explanatory variables.This is usually performed by maximum-likelihood regression, which maximizes the likelihood for the distribution of the dependent variable in the fitting process, so that this distribution is not necessarily restricted to normality, contrary to least-squares regression (Agresti, 2002).This is the case of logistic regression, a method specially suited for bivariate, binomial distributions.This type of statistical distributions are frequent in human, animal or plant epidemiology, in which response dependent variables such as presence or absence of disease need to be explained by categorical variables such as breed of an animal or strain of a pathogen, and/or continuous variables such as age of a patient or time from infection.Logistic regression has been extensively used in medicine, for example in clinical studies (Petrie and Sabin, 2000).However, in spite of its potential for many types of studies, its use is not generalized in plant epidemiology (Mila et al., 2003;Musaka et al., 2003;Weiland et al., 2003;Thebaud al., 2006;Harikrishnan and Del Río, 2007).We have compared logistic regression with three functions frequently used in plant epidemiology: the logistic, Gompertz and log-logistic growth functions (Campbell and Madden, 1990;Jeger, 2004).The comparison was done for the analysis of data on AMV incidence in lucerne fields from different ecological areas and with different ages.The three growth models, fitted by least-squares regression, showed a poor fit to the data, as indicated by their back-transformed coefficient of determination (R 2* ), which explained about 50-60% of the data variability according to the model.The maximum-likelihood fitted logistic regression model V, did neither show a good fitting to the data.The poor fit of all models is probably due to the high variability contained within data, in which lucerne fields of the same ecological area and age presented quite different AMV incidences.The purpose of the temporal analysis of a particular data set is what determines the degree of precision needed for such an analysis (Campbell and Madden, 1990).The goodness of fit obtained for our data is probably not enough accurate for predictive purposes or for a highly precise description of the DPC.However, it is sufficiently good to represent the observed field variability and to analyze the level of interdependence of certain epidemiological factors in AMV epidemics.In fact, models that differed in goodness of fit, as the logistic and Gompertz models, yielded similar conclusions regarding comparisons of rates of increase of AMV incidence with age of lucerne f ields among different ecological areas.Other comparisons, however, would require a better fitting of models to the data for the detection of significant differences.For example, logistic, Gompertz and log-logistic models were fitted assuming the same rate of increase of incidence for all the ecological areas, as no significant differences were found for rates of increase.Thus, the initial amount of disease in the different areas could be compared, and only the best fitted models (Gompertz and log-logistic) detected significant differences between areas A and B, on one side, and area C. Logistic regression also provided a useful methodology for this type of comparisons.Model V in Table 4, which showed a clear lack of fit, was sensitive enough for detecting significant differences in the rate of increase of incidence with age between ecological area A and areas B and C, contrary to logistic, Gompertz or log-logistic growth  (2) r R : weighted mean absolute rate of increase of incidence with age after Richards' correction: r R = r LR / (2m + 2), with m = 2. Y 0 *: initial amount of disease after back-transformation of Y 0 = k + a(Z).
models.This may be attributed to the only difference between the logistic-regression model V (Table 4) and the logistic growth model (Table 3): the statistical method for fitting the distribution of the dependent variable, maximum-likelihood regression for model V and least-squares regression for the logistic growth model.As was the case for the Gompertz and log-logistic models, logistic-regression model V detected significant differences in the initial amount of disease between ecological areas B and C, once it was assumed that the rate of increase of incidence was the same for these areas.
In summary, four different generalized linear regression models were fitted to field data of AMV incidence in lucerne fields located in different ecological areas and with different ages, resulting in roughly comparable poor fits.Logistic regression was useful for the analysis of epidemiological factors associated to estimated parameters in the model, and provided more sensitive analyses than traditional least-squares regression of growth functions.Thus, logistic regression should be considered a good method with application to a wide variety of statistical analyses in the field of plant epidemiology.

( 1 )
T M : transformed dependent variable fitted by least-squares regression.Transformation was different for each model fitted:T L = = T LL = Ln[Y/(1-Y)]; T G = -Ln[-Ln(Y)], with M = L for logistic, G for Gompertz and LL for log-logistic models.Z: explanatory variable for the ecological area of lucerne field.E: explanatory variable for the age of lucerne field: E = Age for logistic and Gompertz models, E = Ln (Age) for log-logistic model, with Age = 1 to 5 years. (2)I) Interaction Z × E is present in the models, II) Interaction Z × E is dropped from the models.k, a(Z), b and c(Z) are intercept and coefficients for explanatory variables Z and E (see Materials and Methods for details).r R : weighted mean absolute rate of increase of incidence with age after Richards' correction for comparison among models: r R = r M / (2m + 2), with r M = b + c(Z), m =2 for logistic and log-logistic models, m = 1 for Gompertz model.Y 0 *: initial amount of disease after back-transformation of Y 0 = k + a(Z). (3)Significance t-test for coefficient estimate: significance at 95% (*) or 99% (**) confidence level and standard error (STE) are indicated.
Y: frequency of AMV-infected plants: Logit (Y) = Ln[Y/(1 -Y)].k, a(Z), b and c(Z): intercept and coefficients for explanatory variables Z and E (see Materials and Methods for details).

Table 1 .
, case I).Models directly provide values of the intercept, Y 0 = k, and the rate of increase of incidence, r M = b, for Incidence(1)of Alfalfa mosaic virus (AMV) in commercial lucerne fields aged from one to five years and located in different ecological areas of the Ebro Valley (Northeast Spain) Data are percentages of AMV infection in 26 different fields of lucerne, one to five years old, located in three ecological areas of the Ebro Valley (see Materials and Methods).In each area, one or two fields of each age were sampled during the summer of 2005 (first row of data) and 2006 (second row).A dash indicates the lack of data, i.e.only one field of the corresponding age was sampled.Z = A, i.e. a(Z) and c(Z) equal to zero.These parameters are calculated as Y 0 = k + a(Z) and r M =b + c(Z) for Z = B or Z = C.The parameters describing DPCs, i.e. the initial amount of disease (Y 0 * ) and the weighted mean absolute rate of increase of incidence with age (r R ), were calculated after back-transformation of Y 0 or Richards' correction of r M , respectively, for comparison among models (Table3, case I, see Materials and Methods).The advantage of using general linear models with explanatory variables Z and E is that they provide the

Table 2 .
Summary statistics for the goodness of fit of growth function models fitted by least-squares regression to data of Alfalfa mosaic virus (AMV) incidence

Table 5 .
Parameter estimates and their standard errors of a logistic regression model fitted by maximum-likelihood method to data of Alfalfa mosaic virus (AMV) incidence