Tree species consistent co-occurrence in seasonal tropical forests: an approach through association rules analysis

Aim of study: Assessing the existence of consistent co-occurrence between tree species that characterize seasonal tropical forests, using a novel data mining methodology; and evaluating the taxonomic and functional similarities between associated species. Area of study: forty-four seasonal forest sites with permanent plots (40.2 ha of total sample) located in Southeast Brazil, from which we obtained species occurrences. Materials and methods: we applied association rules analysis (ARA) to the dataset of species occurrence in sites considering the criteria of support equal to or greater than 0.63 and confidence equal to or greater than 0.8 to obtain the first set of associations rules between pairs of species. This set was then submitted to Fisher’s criteria exact p-value less than 0.05, lift equal to or greater than 1.1 and coverage equal to or greater than 0.63. We considered these criteria to be able to select non-random and consistent occurring associations. Main results: We obtained a final result of 238 rules for semideciduous forest and 11 rules for deciduous forests, composed of species characteristic of vegetation types. Co-occurrences are formed mainly by non-confamilial species, which have similar functional characteristics (potential size and wood density). There is a difference in the importance of co-occurrence between forest types, which tends to be less in deciduous forests. Research highlights: The results point out the feasibility of applying ARA to ecological datasets as a tool for detecting ecological patterns of coexistence between species and the ecosystems functioning.


Introduction
Tropical forests are the most biodiverse ecosystems in the world, where hundreds of species coexist in the same space (Wright, 2002;Barlow et al., 2018). Species coexisting in these ecosystems use the same local set of resources and interact in positive or negative ways (e.g., facilitation, competition); the multidimensional nature of these interactions has significant consequences to ecosystem function due to driving the full community use of resources and their final patterns of important ecosystems attributes such as carbon stock and uptake and biodiversity (Wright, 2002;Hart & Marshall, 2013;Barraclough, 2015;Schmid et al., 2020). Understanding the underlying mechanisms of species interactions in tropical forests has been a central issue in ecology, especially considering its fundamental role in maintaining biodiversity (Chesson, 2000;Wrigth, 2002;Hart et al., 2017).
In general, species that coexist have different ecological requirements or different ecological niches (Wright, 2002;Amarasekare et al., 2004;Barraclough, 2015;Kraft et al., 2015). When their niches overlap, species may face short-term consequences in the local community scale, such as competitive exclusion; and longterm consequences in evolutionary or biome scales, such as niche differentiation and habitat partitioning (Wright, 2002;Barraclough, 2015;Kraft et al., 2015;Chen et al., 2020). In addition, the mechanisms of species coexistence are also influenced by resource availability, which may strongly control the importance of interactions in community assembly and define its most important mechanisms (Holmgren & Scheffer, 2010;Cadotte & Tucker, 2017). For example, in harsh environments process such as facilitation may have relatively greater importance than competition (Wright, 2002;Carrión et al., 2017). In addition, under restrictive conditions, species adopt a series of physiological mechanisms to assist in their survival by increasing their resistance to stressful environmental conditions and enhancing their ability to obtain and use resources efficiently (Cadotte & Tucker, 2017;van der Sande et al., 2017). These mechanisms can be observed in broad ecological strategies, such as resprouting, scleromorphic traits or deciduousness, or in the level of adaptation of internal structures and associated processes (e.g., variations in diameter and length of vessels, stomatal conductance) (Zeppel et al., 2015;Pausas et al., 2016;Jimenez-Rodriguez et al., 2018;He et al., 2019).
Species coexistence also affects forest ecosystem function, stability, resilience and important ecosystem services such as carbon uptake and carbon stocks (Barraclough, 2015;van der Sande et al., 2017;van der Plas, 2019). High species diversity usually promotes high productivity and ecosystem stability because the niches of high numbers of coexisting species tend to be complementary rather than overlapping (Tilman, 1999;van der Sande et al., 2017). Coexisting species with different requirements are able to exploit available resources more efficiently and, consequently, the overall ecosystem achieves higher productivity (Chesson, 2000;van der Sande et al., 2017).
Questions related to the maintenance and management of species coexistence in tropical forests have been addressed with different approaches, such as theoretical and statistical models and empirical data (Chesson, 2000;Wright, 2002;Hart et al., 2017;Chen et al., 2020;Schmid et al., 2020). The emergence of new methodologies of machine learning and statistical analysis may potentially contribute to our understanding of important ecological patterns: coexistence between species, their relation to environment and also in identifying indicators species of the ecosystems. An example is provided by the data-mining technique association rule analysis, or ARA, that is an important market baskets analysis (Silverstein et al., 1998;Rossi et al., 2014). ARA is widely used in online product sales through different algorithms for allowing efficient identification of associations between elements in extensive datasets (Agrawal et al., 1993;Ferrarini & Tomaselli, 2010;Zumel et al., 2019). However, despite its potential, its use in scientific contexts is still scarce and, to our knowledge, even more so in ecological studies. The few examples of use of ARA in ecological contexts are Leote et al. (2020), that used to identify indicator species of arthropods; and Ferrarini & Tomaselli (2010) and Rossi et al., (2014) that used ARA to identify relations between vegetation and environmental attributes in studies at landscape level.
Here, we used ARA to assess patterns of species coexistence using as study case the tropical seasonal forest (deciduous and semideciduous), that may be broadly characterized by enduring high temperatures and the alternation between a rainy and a dry season every year, that imply deciduousness of most of the species under strong water stress (DRYFLOR, 2016). Most studies on tropical species coexistence have mainly focused on rainforests, despite of less the significant contribution of seasonal forests to tropical biodiversity and ecosystem service provision (Sunderland et al., 2015;DRYFLOR, 2016). We thus applied ARA on the data of 44 sites of tropical seasonal forests in order to (1) identify consistent associations between species (i.e., patterns of species coexistence) that may characterize these forest types, (2) evaluate functional and taxonomic similarities between consistently associated species. In addition, we discuss the feasibility of applying ARA into ecological research as a tool to identify patterns of species coexistence, structure, function and diversity of communities.

Data set
Here we used a dataset of 44 sites of seasonal tropical forests in central-eastern Brazil ( Fig. 1; Table S1 [suppl.]), with 22 semideciduous forest sites and 22 deciduous forests sites. Semideciduous forests are associated with the Atlantic Forest domain and between 50 and 70% of the forest canopy loses its leaves in the dry season (Neves et al. 2017). These sites are under a Köppen Cwa climate (subtropical with dry winters and rainy summers), with average annual rainfall between 1400 and 1500 mm and average monthly temperature between 19 and 20 °C. Deciduous forests are affiliated with the Caatinga domain (a nucleus of seasonally dry tropical forests -SDTF) in which more than 70% of the species lose their leaves in the rainy season (Pennington et al., 2009), and that is located in the southern Caatinga or in islands of fertile soil in Cerrado domain (Fig. 1). These sites are located under a Koppen Aw/As climate transition (tropical dry winter), with average annual rainfall between 750 and 1000 mm and average monthly temperature between 22 to 24.6 ° C .
Each site was sampled with a varied number of permanent sampling units (5 to 128 per site; total of 1077) with varying dimensions (400 m² in most cases, but a few with 225 and 300 m²) depending on local terrain features and fragment size. Total sampled area was 40.2 ha: 23.68 ha in semideciduous forests and 16.52 ha in deciduous forests. Within each sampling unit, we measured and identified to the species level all trees with a diameter at breast height (DBH; 1.30 m above the ground) ≥ 5 cm.

3
Tree species consistent co-occurrence through association rules analysis Plant identification followed APG IV (APG, 2016) and name standardization followed REFLORA (Flora do Brasil, 2020), using the flora package (Carvalho, 2016) implemented in the environment R v. 3.6.1 (2020). Forest inventory data are stored in the ForestPlots.net system (https://www.forestplots.net; codes presents in Table  S1 [suppl.]) and are available upon request. These data were then used to generate two matrices of species occurrence on the sites, one for each vegetation type (semideciduous and deciduous), which were used in the analyses described below. The final matrices had 664 species occurring in 22 semideciduous forests and 433 species occurring in 22 deciduous forests.

Association rule analysis
We explored patterns of species coexistence in each vegetation type (semideciduous and deciduous forest) using association rule analysis (ARA, Agrawal et al., 1993;Silverstein et al., 1998;Ferrarini & Tomaselli 2010;Rossi et al. 2014;Leote et al., 2020). ARA is a data mining tool that identifies associations between categorical observations (in this case, species names) in large data sets. It proposes relationship rules in the template: "if species X, then species Y". Thus, based on a set of events (often called "transactions") and the presence of elements in these events, it is possible to relate the existing categories and associate their coexistence with a probability of occurrence and importance (Ferrarini & Tomaselli, 2010;Rossi et al., 2014;Zumel et al., 2019;Leote et al., 2020).
In the case of our data, the events (or transactions) are the sites of collection and the set of elements are all tree species present in each one.
Each rule has two parts: the rule left side (RLS) is a unique category level of reference (e.g., one species) that has an occurrence frequency in seasonal forest sites, which is called support; and the rule right side (RRS) that is composed by one or more category levels and that is related to the RLS in a frequency called confidence. Confidence values are always based on the rule support value and inform how frequent is the rule in the dataset (Ferrarini & Tomaselli, 2010;Rossi et al., 2014;Zumel et al., 2019;Leote et al., 2020). Thus, if for a given rule the RLS species occurs in half of the sites (support of 0.5) and the whole rule has confidence of 1.0, the RRS species are associated with the RLS species in all its occurrences, that is, they coexist in half of the sites. The rules obtained present three other measures: lift, coverage and Fisher's exact test p-value. Lift is a measure of the rule's importance and is obtained through the relation between the observed rule confidence and the previously expected confidence, in which lift > 1 indicate the rule is more frequent than expected by initial constraints and lift equal to 1 indicates that rule's species are independent from each other and that the rule is not real. That is, more the value of lift, greater are the chances of having the sp 2 if the site already has the sp1. In addition, in cases where the lift has values less than 1, there is a negative relationship between the species, in which the occurrence of the first can negatively affect the occurrence of the second. Coverage is a measure of the rule's frequency that evaluate the support of the species present in the RLS. Fisher's exact test p-value is the rule's significance obtained through Fisher's exact test of contingency for small samples, and evaluate whether the rule is more significantly frequent than expected by chance (Hahsler, 2006). The full process of rules' obtaining by applying constraints is summarized in Fig. 2.
Based on the occurrence data (presence/absence) of species in sites of each vegetation type (analysis were run separately for semideciduous and deciduous forests), we used the apriori algorithm (Borgelt & Kruse, 2002) from the arules package (Hahsler et al., 2020) in R v. 3.6.1 (2020) to obtain all pairwise association rules that met the following criteria: support equal to or greater than 0.63, which corresponds to the occurrence of the species in RLS in more than 60% of the sites (at least 14 sites of each vegetation type); and confidence of 0.8 for the rule.
These criteria indicate that species on RRS must be related to the species on RLS in at least 80% of its occurrences. We focused on pairwise species associations aiming to reveal direct associations between them and analyze their similarities more closely. From the rules obtained, we selected those with significant Fisher's exact p-value (<0.05), with coverage equal to or greater than 0.63 and lift greater than or equal to 1.1 (more than 10 % frequent than expected). By establishing these criteria, we were able to select strong combinations composed of species of wide occurrence, as well as to select associations of high importance that characterize the vegetation types.
Based on the selected rules, we explored the similarities between the coexisting species in each vegetation type. For that, we evaluated the compatibility between families of the coexisting species in selected rules of each vegetation type, as well as the similarities between them regarding two functional characteristics: potential size (DBH, cm), calculated as the 95th percentile of all DBH measurements of that species in the data set; and wood density (WD, g/cm³), extracted from a global database (Wood Density database; ) that uses the ave-rage species WD value or the genus or family average when species-level data is unavailable. We chose these functional traits because they hint on species ecological behavior: wood density is associated with growth rate, mechanical resistance and hydraulic efficiency ; while potential size is a proxy of resistance to canopy light and ability to reach advanced successional stages in forests (Falster & Westoby, 2005). With the functional data of the species in selected association rules (Table S2 [suppl.]), we thus quantified the difference between the functional traits of the species in each rule (left, RLS and right, RRS), evaluating the pattern of functional similarity between associated species. Graphics were made with the packages arulesViz (Hahsler, 2019) and ggplot (Wickham, 2016) for R program (R Core Team, 2020).

Results
Association rule analysis (ARA) with the established criteria resulted in 238 (out of 928 rules) significant and frequent pairwise species associations in semideciduous forests, and in 11 (out of 62) significant and frequent pairwise species associations in deciduous forests (Tables S3 and S4 [suppl.]). In the semideciduous forests, the pairwise associations are formed by a group of 33 species in 28 genera and 18 families. In the deciduous forests, the pairwise associations are formed by a group of 8 species in 8 genera and 5 families. In semideciduous forests, maximum support was 0.68 and maximum coverage was 0.68; maximum lift was 1.47 for the association between Dendropanax cuneatus (DC.) Decne. & Planch.
(Araliaceae) and Annona dolabripetala Raddi (Annonaceae) (Table S3 [ suppl.] ; Fig 3 -a). In deciduous forests, maximum support was 0.68 and maximum coverage was 0.73; maximum lift was 1.21 for the association between Ptilochaeta bahiensis Turcz. (Malpighiaceae) and Cenostigma pluviosa (DC.) L.P. Queiroz (Fabaceae) (Table S4 Figure 2. Theoretical framework of the process of obtaining association rules between species using association rules analyzes (ARA). The scheme shows that from the set of species occurrence in the study areas, two limitations are initially proposed (support and confidence) for obtaining the initial rules. This set of initial rules then goes through a second selection process through the constraints Fisher's p-value, Lift and Coverage, to finally obtain the final rules.
Tree species consistent co-occurrence through association rules analysis [suppl.] ; Fig 3 -b). In both semideciduous and deciduous forests, some species pairs were mutually related, with occurrence dependence found in both directions; i.e., the association was consistent when swapping the species in the left and right sides.
Most of the pairs are formed by taxonomically unrelated species, at least as far as the family level (Table S2  [ suppl.]). In the semideciduous forests, only 5.8% of all pairwise associations were between confamilial species. In the deciduous forests, Goniorrhachis marginata Taub. and Machaerium acutifolium Vogel. formed the only confamilial species pair (both belong to Fabaceae) (~ 9 % do total; Table S2 [suppl.]). In the rules selected for semideciduous forests, there are associations between species with different functional characteristics (Fig S1 [suppl.]), but most associations occur between species with similar potential size (DBH) and wood density (Fig S1 -a  [suppl.]). In deciduous forests the trend seems to be the same, but the low number of rules selected prevents an accurate assessment (Fig S1 -b [suppl.]). In general, association rules tend to be formed by taxonomically unrelated species that are similar but not functionally equal.

Discussion
Our results showed the existence of consistent pairwise associations between species in semideciduous and deciduous forests, which are characteristic of their com-munities. However, widespread associations of species of broad occurrence are more frequent and are associated with a greater species number in the semideciduous forest, in comparison with the deciduous forests. In addition, we found that these pairs are mainly formed by non-confamilial species (of different botanical families) with similar functional characteristics.
The widely occurring associations between species in these seasonal forests indicate that a set of broad species has their occurrence dependent on each other in communities, and that their associations are characteristic of these vegetation types and important to their structure and functioning. These co-occurrences are formed by species characteristic of these vegetation types, which have a wide occurrence in these forests conditioned by past evolutionary occupation processes (Oliveira-Filho & Fontes, 2000;Pennington et al., 2009;Santos et al., 2012;Moro et al., 2016;Neves et al., 2017). This result points out that there is a great chance of sampling the presented co-occurrences in tree community randomly sampled in these regions.
The greater number of species and coexistence rules observed in the semideciduous forests compared to the deciduous forests is related to the different environmental conditions between the two vegetation types. Because they are subjected to more restrictive ecological conditions such as high temperatures and a long dry season that explain the greater species deciduousness, often species in deciduous forests tend to occur in environments close Figure 3. Representation of the first 20 association rules between pairs of species according to the lift values for semideciduous forests (a) and of the 11 selected association rules between pairs of species for deciduous forests (b). For deciduous forests, since only 11 rules emerged in this vegetation type, the figure represents all of them. Each circle represents an association rule and the arrows point to its participating species. Circle sizes are directly related to rule support values (the larger the circle, the higher the support), while circle colors are related to the rule lift value (the stronger the color, the greater the lift value). The arrow points the direction of co-occurrence between species, considering the reference species of the association rule. The situation in which two species have two arrows between them, each pointed in one direction, points out that co-occurrence occurs in both directions. In other words, species X associates with Y and Y associates with X.
to their survival limit (Pennington et al., 2009;Allen et al., 2017). Thus, small variations in environmental conditions may generate variations in ecological filters and thus select a distinct set of species (Santos et al., 2012;Apgaua et al., 2015;Maia et al., 2020;Souza et al., 2020). In fact, these results reflected the most frequent species in our deciduous forest dataset. It is also important to highlight the different sampling intensity between vegetation types, that probably influenced the result due to a lower number of trees sampled, and consequently a lower chance to identify association. But we consider that the great difference (238 x 11 rules) is strong enough to indicate the trend of more association rules in semideciduous forests.
The lower number of coexistences in deciduous forests can also be explained by the lower participation of interactions in community assembly in restrictive environments according to stress-gradient hypothesis (Holmgren & Scheffer 2010;Kraft et al. 2015;Cadotte & Tucker, 2017). In this perspective, associations thus would have a secondary role in the deciduous forest assembly and function and would be linked mainly to facilitation processes (Holmgren & Scheffer, 2010;Hart & Marshall, 2013;Cadotte & Tucker, 2017;Carrión et al. 2017). It is important to emphasize that adopting broader criteria in the measures of association rule analysis (more permissive values) may allow the achievement of a greater number of association rules, but without considering the assumption of high frequency in the communities.
The result found that most of the significant coexistences are composed of non-confamilial and functionally similar species (especially in semideciduous species where the number of relationships is greater) suggests the existence of niche adaptation processes for the occupation of the same environments. Species of different families with similar ecological requirements may thus have undergone niche differentiation processes after the occupation of habitats, as a way of persistence and avoiding competitive exclusion due to niche overlap (Wright, 2002;Barraclough, 2015;Cadotte & Tucker 2017;Chen et al. 2020). In addition, the coexistence between these species may have been consolidated after processes of modification of their ecological patterns, such as decrease of representativity in abundance and/or biomass (Holmgren & Scheffer, 2010;Hart & Marshal, 2013;Kraft et al., 2015;Hart et al., 2017;Chen et al., 2020). Thus, the species coexistence seasonal tropical forest would be associated with changes in ecological requirements and persistence strategies by species, consequently impacting ecosystem functioning (Wright, 2002;Hart & Marshal, 2013;Kraft et al., 2015).
Here we demonstrate the feasibility of applying the novel association rule analysis (ARA) methodology to ecological studies of tree communities in seasonal tropical forests, but that can be extended to other biological groups. As demonstrated here, the methodology is promising to identify coexistence relationships (both positi-ve and negative) in large ecological data sets, which can also be used to identify relationships with environmental drivers synthesized into categorical variables, although in this case other robuster analysis approaches may be more appropriate (Ferrarini & Tomaselli, 2010;Rosssi et al., 2014;Leote et al., 2020). The use of ARA approach can also be used to identify indicators species in ecosystems, being an alternative to traditional approaches in situations of extensive datasets (Leote et al., 2020). In vegetation studies we call attention to the need of sampling intensity and for the also need of inclusion criterium standardization, since species richness and composition are strongly influenced by them. Our results indicate that association rules analysis is an interesting alternative to the traditional analysis of ecological data and that it should be incorporated into future studies, joint to other novel data mining and data science tools (e.g., machine learning, neural networks, natural language processing and artificial intelligence) broadly used in non-scientific contexts (Hahsler, 2006;Zumel et al., 2019). All of these approaches are already in consolidated use with positive results in organizations, so that they can also assist in the resolution of relevant ecological issues, in addition to allowing a revisiting of consolidated patterns, as they offer new perspectives in relation to the data. Its use must also be associated with the incorporation of other objects of ecological studies, in order to contribute to the understanding of biodiversity patterns and ecosystem functioning.