Species-diagnostic markers in the genus Pinus : evaluation of the chloroplast regions matK and ycf 1

Aim of study: The identification of material of forest tree species using genetic markers was carried out. Two promising chloroplast barcode markers, matK and ycf1, were tested for species identification and reconstruction of phylogenetic relationships in pines. Area of study: The present study included worldwide Pinus species, with a wide representation of European taxa. Material and methods: All matK sequences longer than 1600 base pairs and ycf1 sequences for the same species were downloaded from GenBank, aligned and subsequently analyzed to estimate alignment statistics, phylogenetic trees and substitution saturation signals. Main results: We confirm the usefulness of the ycf1 marker for barcoding purposes and phylogenetic studies in pines, especially in studies focusing at the within-genus level relationships, but caution in the use of the matK marker is recommended. Research highlights: Incongruent phylogenetic signals between these two chloroplast markers are demonstrated in pines for the first time. Additional keywords: barcoding, conifers, phylogeny. Abbreviations used: posterior probabilities (PP), bootstrap (BS). Authors ́ contributions: SO and DG designed the study. JCV analysed the data with help from SO. SO wrote the manuscript together with DG and contributions from JCV. All authors approved the final version of the manuscript. Citation: Olsson, S., Grivet, D., Cid-Vian, J. (2018). Species-diagnostic markers in the genus Pinus: evaluation of the chloroplast regions matK and ycf1. Forest Systems, Volume 27, Issue 3, e016. https://doi.org/10.5424/fs/2018273-13688 Received: 11 Jul 2018. Accepted: 30 Oct 2018. Copyright © 2018 INIA. This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International (CC-by 4.0) License. Funding: SO received funding from the Spanish Ministry of Economy and Competitiveness (MINECO) under PTA2015-10836-I contract. Competing interests: The authors have declared that no competing interests exist. Correspondence should be addressed to sanna.olsson@inia.es Forest Systems 27 (3), e016, 11 pages (2018) eISSN: 2171-9845 https://doi.org/10.5424/fs/2018273-13688 Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria O. A., M. P. (INIA)


Introduction
In forest trees, diagnostic markers have diverse applications in biodiversity, conservation, restauration, trade control, or tree improvement.The identification of forest material is generally performed using molecular markers developed for different purposes, and therefore analysed at different hierarchical levels (species, provenances, families or clones).When the objective is the unambiguous identification of single species that are morphologically difficult to distinguish in their original state or because samples are transformed products (e.g.timber, furniture, barrel, processed food), barcoding technology, using short universal DNA sequences, can be applied (Lidder & Sonnino, 2011).At the species level, barcoding is central to a major field: the internationally traded timber and wood products.Forensic applications are directed towards identifying species that are illegally exported, high-value species that are falsely declared to be low value timbers and sold as such (Nielsen & Dahl, 2008), or protected species under the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES) regulations.
Species delineation is also of interest for establishing the relationships among species in phylogenetic studies.Apart from advancing our understanding in evolution and biodiversity, there are many practical applications of phylogenetics.For example, the knowledge of the species phylogenies may help understand the evolutionary trade-offs of life-history traits in pines (e.g.Grivet et al., 2013) or assist strategies dealing with pine diseases and pests (e.g.Moreira et al., 2016).In conservation biology, phylogenetic information can be used to select and prioritize populations (Volkmann et al., 2014).Phylogenetic and phylogeographic methods can be particularly useful to infer the origin of timber and wood products (Finkeldey et al., 2010).Phylogenetic methods based on barcoding markers have successfully been applied to prevent illegal trade of protected species (Baker et al., 2010;Ghorbani et al., 2017).Furthermore, several applications are implemented at the intraspecific level for traceability of important tropical timber species (Tnah et al., 2009(Tnah et al., , 2010;;Degen et al., 2010), following international agreements (e.g.FLEGT, the EU Forest Law Enforcement, Governance and Trade, regulation), or for trade control of forest reproductive material.
Chloroplast genomes, due to their characteristics, provide a good source of species-diagnostic markers.More specifically, they are present in multiple copies (facilitating PCR amplification), uniparentally inherited, and suitable for studies involving different taxonomic levels due to regions that evolve at different rates (Soltis & Soltis, 1998;Xu et al., 2015).Speciesdiagnostic markers are deposited in public repositories of molecular sequence data that rassemble the information available for all species sequenced for a specific marker (e.g.Genbank).The use of novel diagnostic markers is therefore limited as it would require sequencing many species for that marker, and consequently the same established genetic markers are often used for both barcoding and phylogenetic purposes.Ideally these markers should be as generalizable across groups as possible without losing species resolution capacities (Kress et al., 2009).The most suitable markers for barcoding in plants were selected among commonly used phylogenetic markers by the CBOL Plant Working Group (Hollingsworth et al., 2009a).
In the present study our aim is to test diagnostic chloroplast markers in Pinus, a genus of huge ecological and economical importance (Price et al., 1998).With over a hundred recognized species, Pinus is the largest genus of conifers and constitutes a major, often dominant component of multiple natural landscapes such as boreal, subalpine, temperate, tropical and arid woodlands (Richardson & Rundel, 1998).The economic importance of pines stems from their use as sources of wood, pulp, resins and charcoal.In addition, pines are currently the focus of biomass research as promising type of forest plantation for energy production (Álvarez-Álvarez et al., 2018).
The Pinus genus is divided in subgenus Strobus and subgenus Pinus, the latter consisting of sections Pinus (subsections Pinus and Pinaster) and section Trifoliae (subsections Contortae, Ponderosae and Australes) (Gernandt et al., 2005).Pine phylogenetic relationships are still partly unresolved, especially among terminal taxa in the subsections Strobus and Australes (Eckert & Hall 2006;Parks et al., 2009;Gernarndt et al., 2018).Furthermore, species complexes have been particularly debated groups and their exact composition and relationships have been questioned, as this is the case for instance for North-American Pinus contortabanksiana (Yang et al., 2007), Asian Pinus kesiya (Businský et al., 2014), as well as European Pinus mugo (Christensen, 1987) and Mediterranean pines (Syring et al., 2005;Grivet et al., 2013).This species-delineation limitation poses problems when trying to identify forest materials at the species level based on solid timber products from species that are not well identified by wood traits, as is the case of the closely related Pinus nigra, Pinus mugo and Pinus sylvestris (Schoch et al., 2004).Two promising species-diagnostic chloroplast markers in pines are matK and ycf1.The matK marker has been one of the most frequently used genes for inferring phylogeny in pines (Wang et al., 1999, Geada López et al., 2002;Gernandt et al., 2003Gernandt et al., , 2005Gernandt et al., , 2008;;Hernández-León et al., 2013;Dong et al., 2015).The more recently introduced ycf1 was reported to be more variable than other chloroplastic markers commonly used in phylogenetic studies in pines (rbcL, trnD-Y-E, trnH-psbA and matK) as shown by Hernández-León et al., (2013).Based on these premises, we tested the suitability of matK and ycf1 for barcoding purposes and for resolving phylogenetic relationships in pines mostly from Europe.

Material and Methods
The approximately 1,550 base pairs (bp) long maturase K (matK) gene was shown to be one of the most promising barcode markers in all land plants (Hollingsworth et al., 2009a).In pines, matK has been frequently used for inferring phylogeny (Wang et al., 1999, Geada López et al., 2002;Gernandt et al., 2003Gernandt et al., , 2005Gernandt et al., , 2008;;Hernández-León et al., 2013;Dong et al., 2015).These studies showed that matK is not variable enough in pines to fully resolve species level relationships.Efforts to develop more variable markers to clarify the remaining controversial relationships have been made.The marker ycf1 was proposed as a promising marker for pines by Parks et al. (2009Parks et al. ( , 2011)).Dong et al. (2015) confirmed ycf1 to be the most variable plastid DNA barcode of land plants.However, the evolution of the gene was pointed as abnormal and probably under selection (Parks et al., 2009).Furthermore, this uncommonly high variability could be an issue in higher taxonomic level in studies focusing on above-species level relationships.The few earlier studies comparing the use of matK and ycf1 in resolving phylogenetic relationships in the genus Pinus (Hernández-León et al., 2013;Dong et al., 2015) did not study the whole length of the matK marker but used only an approximately 800 bp long region.
In the present study, all the matK sequences longer than 1600 bp were downloaded from the GenBank, totalling 55 Pinus species (Table 1).The ycf1 sequences for the same species were also downloaded.The GenBank Accession Number of each sequence is provided in Table 1.Only one sequence per species was used.The sequences were aligned using MAFFT (Katoh & Standley, 2013) to produce two alignments, one for matK and one for ycf1, and adjusted manually with PhyDE® v1.0 (Müller et al., 2005).Statistics on the alignments were obtained with PhyDE plugin SeqState.Uncorrected pairwise distances were compared with maximum likelihood distances in PAUP v4.0b10 (Swofford, 2002) to detect any saturation signal in the markers, and checking for deviation from linearity of plots.
Two phylogenetic analyses were performed on the individual alignments and on a concatenated matrix.First, Bayesian analyses were performed with MrBayes v3.2.6 (Ronquist et al., 2012) implemented at CIPRES Science Gateway (Miller et al., 2010).Bestfit substitution models were inferred from jModeltest v.2.1.10( Darriba et al., 2012).Following the output from the jModeltest the GTR+Γ model was applied for both matK and ycf1.The a priori probabilities supplied were those specified in the default settings of the program.Four runs with four chains (1 × 10 6 iterations each) were run simultaneously.Chains were sampled every 1,000 iterations and the respective trees written to a tree file.Tracer v1.6 (Rambaut et al., 2014) was used to analyze the output of the model parameters, more specifically to examine the sampling and convergence results.Calculations of the consensus tree and of the posterior probability of clades were performed Table 1.Pinus sequences from 55 species downloaded from GenBank.The dataset corresponds to all matK sequences longer than 1600 base pairs and to all ycf1 sequences for the same species.Asterisks (*) indicate those sequences where the ycf1 region was extracted from the whole or partial chloroplast genome.

Alignment statistics
There were 1667 characters in the matK alignment, of which 586 belonged to the barcode region for matK.The ycf1 alignment contained 2863 characters, including a visually observed hypervariable region of 208 bp.The regions are depicted in Figure 1.Details on the alignment are given in Table 2. Our alignment statistics for these two markers are consistent with earlier reported results (Hernández-León et al., 2013;Dong et al., 2015).No signal of saturation was observed, except for the ycf1 marker including the hotspot region, for which very slight substitutional saturation was observed as illustrated with a slight desviation of the pairwise distance points from linearity (Figure 2).
The ycf1 alignment was more variable than the matK alignments, with 17.5 % of parsimony informative sites (PIS) vs 7.5% and 5.8% for matK, depending whether the longer full matK region or only the barcode region was included, respectively.The hypervariable region observed by visual inspection of the ycf1 marker had 32.2% of informative sites.Excluding this region lowered slightly the variability of the rest of the ycf1 region (16.4PIS %).

Phylogenetic trees
The majority rule consensus tree from the Bayesian inference had better resolution compared to the maximum likelihood tree (Figures 3-5).Therefore, the Bayesian trees are presented with confidence at the nodes indicated by posterior probabilities (PP) and complemented with bootstrap values (BS) of the maximum likelihood analysis when applicable.Following Alfaro et al. (2003) we consider PP > 0.95 or BS > 70 as statistically significant support for a clade.
The phylogenetic tree based on combined marker data is shown in Figure 3.The tree is fairly well resolved and supported.The relationships in subsection Pinaster are resolved and fully supported, but in subsection Pinus many of the placements do not receive statistically significant support.The topology of section Trifoliae is congruent with the phylogeny presented by Gernandt et al. (2018), with the formation of the same groups Contortae, Ponderosae, Attenuatae, Australes I and II.Australes II does not receive significant support (PP 0.87 / BS 62), though,  and Oocarpae is not resolved as a monophyletic group.
The relationships in the tree based on matK are poorly resolved from species level up to subsection level (Figure 4).The subsections Pinaster and Pinus are not resolved as individual clades, neither are the groups Attenuata, Oocarpa nor Australes.
The ycf1 tree (Figure 5) is similar to that based on the combined marker data in both resolution and topology.The same subsections and groups are formed, and as in the combined tree, Oocarpae is not resolved as monophyletic clade.The support of Australes II clade is, however, significantly better supported than in the combined tree (PP 0.98 / BS 59).There were no significant differences between the phylogenetic trees based on ycf1 with or without (data not shown) the hotspot region.
A few significant incongruences were detected when comparing the gene trees based on individual markers.The conflicting positions involve P. attenuata, P. oocarpa, P. caribaea and P. tabuliformis.P. attenuata is placed sister to Pinus oocarpa (PP 0.96 / BS 62) in the analysis based on matK, while P. attenuata more logically forms a clade together with P. muricata and P. radiata (Attenuatae or the California closed-cone pines) based on ycf1 and the combined analysis.P. caribaea is placed in a clade with P. leiophylla and P. patula (PP 0.99 / BS 66) only in the analysis based on matK, while it is sister species to P. elliottii based by ycf1 and the combined analysis.P. tabuliformis is sister species to P. yunnanensis (PP 0.96 / BS 65) based on matK but sister to P. kesiya (PP 0.98 / BS 63) based on ycf1.In the combined analysis P. tabuliformis is sister to P. yunnanensis with low support (PP 0.65 / BS 40).
Furthermore, the placement of some species present higher support values in one of the single marker trees.Most noteworthy, the relationships in the subsection Pinus are better resolved based on ycf1 alone than on the combined data set.Based on ycf1, the positions of P. resinosa, P. nigra, P. mugo, P. densiflora and P. sylvestris are fully resolved with maximum support from the Bayesian analysis and mostly high bootstrap support from the maximum likelihood analysis.In the combined analysis, only the clade comprising P. mugo, P. densiflora and P. sylvestris receives statistically significant support values.This is because the main phylogenetic signal grouping those species comes from ycf1, while matK brings a conflicting signal.

Discussion
This study confirms the usefulness of the ycf1 marker as diagnostic marker in pines.Although it has been suggested that ycf1 does not correctly reflect phylogenetic relationships in plants (Parks et al., 2009), its use for pine phylogenetic analyses resulted in expected taxonomic grouping in the present study.However, the hypervarible region of this marker could cause problems in homology assessment when it is used on a broader taxonomic scale.The marker matK should be used in pines with caution, because as shown in the present study, its phylogenetic signal does not reflect species relationships correctly in pines.In spite of this result, matK could be useful as a barcode marker with an intermediate level of variation in combination with other markers for species delineation (Bruni et al., 2012; see also Celinsky et al., 2017).The present study is the first work which reports phylogenetic incongruences in pines between the chloroplast markers matK and ycf1.These incongruences were not detected in earlier studies because of the use of a shorter matK region resulting in a poorly resolved gene tree (e.g.Hernández-León et al., 2013).Previous studies have shown that pine phylogenies based on chloroplast markers may be incongruent with phylogenies based on nuclear markers, as well as morphological and geographical classifications (e.g.Liston et al., 2003;Syring et al., 2005;Wilyard et al., 2009;Gernarndt et al., 2018).
One of the disadvantages of using chloroplast markers is chloroplast capture, defined as the movement of a chloroplast genome from one species to another through the process of introgression (Soltis & Soltis, 1998).This phenomenon has negative consequences on both phylogenetic inference and systematic efforts (Tsitrone, et al., 2003), and it has been suggested to occur in pines (Gernarndt et al., 2005;Liston et al., 2007;Gernarndt et al., 2018).Furthermore, different parts of the chloroplast have different phylogenetic topologies (Zeng et al., 2014).To circumvent these limitations, few initiatives focused on developing new nuclear markers for pines (Syring et al., 2005;Palme et al., 2009;Grivet et al., 2013;Gernarndt et al., 2018), but their wide use is limited by the availability of multispecies sequence data from public databases.
Other reasons may impede pine phylogenies, such as reticulate evolution due to hybridization.Gernarndt et al. (2018) suggested that hybridization occurred in the Oocarpae ancestors, explaining the difficulties to place them taxonomically.The Oocarpae group appears polyphyletic in our analyses.Hybridization could also explain other aberrant phylogenetic grouping observed in this study in the analysis based on matK.While chloroplast markers may not succeed to discriminate species in a group of plants in which reticulate evolution is present, they might result useful to discern hybridization processes in interspecific hybrids by the presence or absence of selected chloroplast markers.The usefulness of the matK marker to identify hybrids remains to be investigated.
For all land plants, the establishment of a single DNA region as universal barcode is not a realistic goal, but accurate species delineation may be achieved by combining several loci used as barcode (Kress, 2017).However, the rate of successfully identified gymnosperm species using different combinations of the seven main candidate plastid regions for barcoding (rpoC1, rpoB, rbcL, matK, trnH-psbA, atpF-atpH, psbK-psbI) is low (Hollingsworth et al., 2009b;Ran et al., 2010).Species delineation with existing chloroplast markers in closely related conifer species is particularly problematic (Ortiz-Martínez & Gernandt, 2016;Celinski et al., 2017).In spite of the challenges to barcode species in the genus Pinus, the present study shows that the marker ycf1 is promising at the species level delineation.Consequently, this marker could be used to solve specific problems, such as the differentiation of the closely related Pinus nigra, Pinus mugo and Pinus sylvestris, which are difficult to identify based solely on wood traits (Schoch et al., 2004).
Due to the importance of species-level identification in pines, it will be useful to further develop barcodes for specific sections and assess how to combine successfully species-level markers with population-or clonal-level markers.There is indeed a huge interest in forestry to identify forest material at the intraspecific level with genetic markers, more specifically to avoid fraud marketing of forest reproductive material (Nanson, 2001;Degen et al., 2010).There already exist some examples of studies, in which material of specific origins at the infraspecific levels have been identified (Aragonés et al., 1997;Ribeiro et al., 2002;Deguilloux et al., 2004;Tigabu et al., 2005;Fidler et al., 2006;Hernandez-Tecles et al., 2017).Therefore, an awaiting challenge is to combine multilevel diagnostic markers that could respond to the many challenges facing forest product traceability.

Figure 1 .
Figure 1.Depiction of the genetic regions matK and ycf1 included in this study.The grey color in matK stands for a region used as barcoding marker and in ycf1 for a hypervariable region.Regions are scaled by the length in base pairs (bp).

Figure 4 .
Figure 4. Phylogenetic tree based on the matK marker.The tree represents the majority consensus of trees sampled after stationarity in the Bayesian analysis.Posterior probability values from the Bayesian inference are indicated above and the corresponding bootstrap values of the parsimony analysis are shown below when it was applicable.The labels indicating taxonomic divisions into subsections following Gernandt et al. (2005) are shown.The taxa in red colour had different positions than in the analysis based on ycf1.

Figure 5 .
Figure 5. Phylogenetic tree based on the ycf1 marker.The tree represents the majority consensus of trees sampled after stationarity in the Bayesian analysis.Posterior probability values from the Bayesian inference are indicated above and the corresponding bootstrap values of the parsimony analysis are shown below when it was applicable.The labels indicating taxonomic divisions following Gernandt et al. (2005) are shown.The taxa in red colour had different positions than in the analysis based on matK.

Table 2 .
Alignment statistics.Number of base pairs (bp), number of variable sites (VS), percentage of variable sites (VS %), number of parsimony informative sites (PIS) and percentage of parsimony informative sites (PIS %) are shown.