Population Structure of Nation-Wide Rice in Thailand

Background Thailand is a country with large diversity in rice varieties due to its rich and diverse ecology. In this paper, 300 rice accessions from all across Thailand were sequenced to identify SNP variants allowing for the population structure to be explored. Results The result of inferred population structure from admixture and clustering analysis illustrated strong evidence of substructure in each geographical region. The results of phylogenetic tree, PCA analysis, and machine learning on population identifying SNPs also supported the inferred population structure. Conclusion The population structure inferred in this study contains five subpopulations that tend to group individuals based on location. So, each subpopulation has unique genetic patterns, agronomic traits, as well as different environmental conditions. This study can serve as a reference point of the nation-wide population structure for supporting breeders and researchers who are interested in Thai rice. Supplementary Information The online version contains supplementary material available at 10.1186/s12284-021-00528-2.


Background
Rice (Oryza sativa) has been the main carbohydrate source in Thailand for more than 4000 years (Weber et al. 2010), and Thailand has been a major rice exporter since 1851 (Siamwalla 1975). Accelerated cultivar selection for specific environments is important for rice breeding programs. The long time period of rice domestication has yielded many rice cultivars with wide variation in physical traits, such as size, flowering time, grain quality, and yield, to name a few.
Thailand has large diversity in ecological systems (Chakhonkaen et al. 2012). In the north, most of the area is covered by mountains and tropical rain forests, while central Thailand consists of plains and fields that are prone to flood. In the north-eastern part, plateaus are the main type of area. In the south are tropical coastal regions and tropical islands. See Fig. 1 for more details. According to Köppen climate classification (Köppen 1884), the south of Thailand is in the Tropical monsoon climate zone (Am), while the rest of the country is in the Tropical savanna climate zone (Aw/As).
Due to the diverse ecology in Thailand, rice varieties need to be adapted to their intended growth region and there is some degree of association between genetic variation and geographical origin of Thai rice (Pusadee et al. 2019). Moreover, there is a higher level of diversity in Thai rice accessions compared to selected rice accessions obtained from International Rice Research Institute (IRRI) germplasm based on InDel markers (Chakhonkaen et al. 2012). Limited data shows that Upland Thai rice forms a cluster of tropical japonica (Pathaichindachote et al. 2019;Chakhonkaen et al. 2012;Kladmook et al. 2012), while lowland rice forms indica clusters.

Open Access
*Correspondence: chainarong.amo@nectec.or.th 5 National Electronics and Computer Technology Center (NECTEC), 112 Phahonyothin Road, Khlong Nueng, Khlong Luang District, 12120 Pathum Thani, Thailand Full list of author information is available at the end of the article Understanding population structure and genetic diversity is an important step before Genome-wide association studies (GWAS) (Reig-Valiente et al. 2016), which paves the way for studies of traits and functional gene investigation. Studies in population structure and genetic diversity of Thai rice have been conducted using different sets of rice varieties and molecular markers. Comparison of genetic diversity among 43 Thai rice and 57 IRRI rice varieties was investigated, using single-stranded conformation polymorphism (SSCP) InDel markers (Chakhonkaen et al. 2012). Additionally, 12 simple sequence repeat (SSR) markers were used to examine ongoing gene flow among three categories of rice variety in Thailand, including 42 wild rice varieties, 12 weedy rice varieties, and 37 cultivated rice varieties (Pusadee et al. 2013). Recently, with a greater number of rice germplasm accessibility, 144 Thai and 23 exotic rice varieties were included to evaluate genetic diversity using SSR markers (Pathaichindachote et al. 2019). Another study assessed the population gene pool of 15 Thai elite rice cultivars using InDel markers (Moonsap et al. 2019). It is worth noting that there are some limitations regarding access to a high number of accessions for each region of Thailand and the application of SNP markers to explore variation among Thai rice germplasms in these previous works.
To fill gaps in the literature, our study mainly focused on the population structure of 300 rice accessions, 277 of which are grown in diverse ecological systems in Thailand and 23 obtained from IRRI germplasm collection. We use SNP markers derived from the Genotyping-by-Sequencing (GBS) method to infer subpopulations. These accessions are a good representation of the nation-wide rice population structure. Fig. 1 The environment of Thailand and the ratios of subpopulations in each area. The environment details are in the aspects of landforms, average temperature, amount of rain, and humidity in 2018 separated by regions (National statistical office of Thailand (NSO and T.N.S.O. 2020)). Each pie chart represents the ratio of each subpopulation members that have their known origin belong to the particular area. Note that there are no accessions for the east since it is not a rice cultivation area

Population Structure
After clustering the 300 accessions, five subpopulations were found in the dataset. These five inferred populations generally group according to geological areas of rice accession cultivation. Table 1 shows the origins of 300 accessions where the clusters of IRRI accessions were labeled according to the work in Zhao et al. (2011). POP1 has a majority of indica accessions from Central Central Thailand. POP3 has a majority of indica accessions from Northeastern Thailand. POP2 represents rice accessions from both Northeastern and Central Thailand, suggesting it is an admixed population of the two. POP4 represents accessions from Southern Thailand. And lastly, POP5 represents japonica accessions from Northern Thailand. There are many accessions of indica from IRRI in POP1, which is consistent with POP1 being indica. The majority of japonica accessions from IRRI are in POP5, which includes the Thai japonica accessions. Additionally, a Chi-Square Test of Independence excludes the possibility that the origins and subpopulations in Table 1 are independent (36 dof, p value < 0.01). Hence, areas of origin and suppopulation in Table 1 are associated with each other.
A principal component analysis showed that PC1 separated the japonica population accessions (POP5) from the rest of the accessions, while PC2 separated the southern population accessions (POP4) from the central and northern accessions of indica (Fig. 2). Lastly, PC3 separated the central indica accessions (POP1) from the northern indica accessions (POP3), with the accessions identified as admixed (POP2) joining the two, showing that the geographical separation is reflected in the genotypes of each accession. A phylogenetic tree was constructed and showed that the japonica population (POP5) was separated from the indica populations (Fig. 2D). Admixed accessions (POP2) were distributed among central (POP1) and northern (POP3) branches, supporting that POP2 is an admixed group of POP1 and POP3, while POP1, POP3, and POP4 were clearly separated from each other. Admixture analysis showed that POP1, POP3, POP4, and POP5 were grouped into different ancestors (different colors). POP2, however, had mixed ratios of ancestor A and B, which were the ancestors of POP1 and POP3. This confirms that POP2 is an admixed population of POP1 and POP3. POP1, POP3, POP4, and POP5 have high bootstrap support around 0.9, while POP2 has average support at 0.69 (Table 2). This is consistent with POP2 representing an admixed population of POP1 and POP3.
The genetic distance of each population was estimated using F ST between admixture ancestry populations, which is a widely-used measure of genetic variation among populations (Holsinger and Weir 2009). The F ST (Table 3) shows that ancestor D, which was the ancestor of the japonica population (POP5), was the most distantly related.
The majority of accessions that formed POP4 were landraces from southern Thailand. These landraces were considered likely to be mostly indica, but there was no empirical evidence to support this. The F ST values suggest that the ancestry of POP4 (C) was closer to ancestors A and B, which are indica, than to ancestor D, which is japonica. In addition, two indica accessions from the central region belong to the same cluster as the landraces. The members of POP4 cluster in PCA plots are the indica accessions rather than the japonica accessions (Fig. 2). This shows that the landraces from southern Thailand are primarily of indica descent. The 300 accessions were compared against 30 accessions of Thai rice selected from the 3,000 rice genomes project (3K RGP dataset) (Li et al. 2014) that have areas of origin in Thailand using PCA (Fig. 3). According to the result, indica accessions from the 3K RGP dataset are in POP1, POP2, POP3, and POP4, while japonica accessions from 3K RGP dataset are in POP5. These 3K accessions are consistent with the population groupings. An indica-japonica admixed variety from the 3K RGP dataset is placed between the area of japonica and indica in the PCA (Fig. 3). Additionally, many accessions from the Southeast Asian Indica (IND3) are grouped with POP4 (see Additional file 1: Table S7 for details regarding types of clusters in 3K RGP dataset).

Agronomic Traits of Subpopulations
There are three agronomic traits that were measured for all accessions: days to flowering, grain length, and plant height.
The broad-sense heritability estimates ( h 2 B ) were 97.8% for plant height, 98.6% for grain length, and 93.58% for flowering time.    The reason that rice in the central and northeast regions have different flowering times, even though the two areas have a similar latitude, is primarily because of distinct environment conditions in these regions to support multiple growing seasons per year. In central plain of Thailand, an irrigation system is well-managed and feasible for off-season rice cultivation, so farmers choose to grow short-duration rice varieties which can be harvested faster. Hence, there are more than one growing seasons per year in the central area. In contrast, the north-east has less rainfall and less access to water sources compared to the central region. Northeastern farmers tend to cultivate rice in one growing season per year and select for drought-tolerate varieties. This suggested different selection pressures for a days to flowering trait observed in this study.
For the grain-length trait, POP1, POP2, and POP3 have similar grain length, while POP4 has a significantly shorter grain length, and POP5 has high variation of grain length. This indicates that japonica (POP5) cannot be distinguished from indica (POP1-POP4) by using the grain-length trait (Fig. 4B, E).
For the plant-height trait, ordering by ascending heights, the order is POP5, POP1, POP2, and POP3/ POP4. POP3 and POP4 have no significant difference in the height trait (Fig. 4C, F).
In the aspect of the association between phenotypes and known origins of accessions, with the Mann-Whitney test at α = 0.001 , the results were as follows. The accessions from the south had significantly longer flowering time and significantly shorter grain lengths than the rest. The accessions from the north had significantly shorter flowering time than the rest. The accessions from the north-east had significantly longer plant height than the north. The accessions from the central area had significantly shorter plant height than the south. Hence, accessions can be separated roughly by these three traits, which implies that there are associations between traits and areas of origin of accessions. The potential cause of the difference in phenotypes might be the difference in landform and selection for crop use.

Unique SNPs of Subpopulations
A QTL analysis was used to identify SNPs with large variation in allele frequency between populations and 50-100 of the SNPs with the greatest allele frequency difference between populations were selected to train a random forest model to identify which population any given accession is from based on genotype. A total of 268 SNPs were selected (Additional file 1: Table S1).
Only POP5 had population specific SNPs that allowed for accurate population identification, this was not surprising as this population is japonica and the other populations are all indica (Table 4). The indica populations had too much allele sharing to allow for each accession to be accurately assigned to their population. The admixed population had the lowest rate of correct population assignment, while the other populations were all in the 80-90% range (Table 4).
While a QTL analysis to identify population specific SNPs might be unconventional, it is well known that population stratification can result in false positives. In this particular case the populations in question are not discrete populations, but rather groupings of accessions that tend to correlate with location and have genetic mixing between groups.
The majority of SNPs most predictive for POP1 occurred on chromosome 1 in an interval between 21.6 and 22.5 Mb and an interval on chromosome 3 between 8.4 and 8.8 Mb. The majority of SNPs most predictive for POP2 occurred on chromosome 3 between 31 and 31.5 Mb with some small intervals on chromosomes 5, 6 and 7. There were 5 intervals of predictive SNPs for POP3 and several small intervals. Chromosome 3 had a interval from 27.59 to 27.65 Mb, chromosome 5 had an interval from 18.71 to 18.78 Mb, chromosome 6 had two intervals from 7.61 to 7.68 Mb and 11.02 to 11.06 Mb, chromosome 10 had an interval from 14.74 to 14.8 Mb. POP4 had the most distinctive allele frequencies with SNP intervals on chromosome 1 at 21.07 to 21.11 Mb, chromosome 2 at 5.32 to 5.35 Mb and 16.41 to 16.45 Mb, chromosome 5 at 23.71 to 23.84 Mb, and chromosome 11 at 2.7 to 2.8 Mb and 23.36 to 23.42. Of the 268 SNPs, there were 110 SNPs located in 75 genes, although the majority of these are predicted genes with no known function (Additional file 1: Table S2). There were 259 genes within the upstream and downstream intervals of the 268 predictive SNPs and most were predicted genes of unknown function (Additional file 1: Table S3).

Discussion
According to the work in Chakhonkaen et al. (2012), upland Thai rice grouped into a japonica cluster, while rice from other regions formed an indica cluster, which is consistent with the population structure found in this work. Additionally, PCA analysis of rice accessions in this work compared against accessions from the 3K rice genome project confirmed that POP5 is japonica, while the rest of the subpopulations are indica.
All of the accessions of rice in this study possess unique traits that make them suited to their growing environment and type of farming. The types of environmental conditions range from the tropical monsoon climate in the south to tropical savanna in central Thailand and mountainous regions in northern Thailand. Grouping the accessions on genetic similarity tended to group accessions according to these environmental differences, which suggests that accessions in similar environments share the genetic variance that makes them suited to those environments. The inferred subpopulation in the north is a japonica cluster (POP5). The other four inferred subpopulations are indica clusters in the central area (POP1), north-east (POP3), south (POP4), and the admixture of POP1 and POP3 (POP2). All inferred subpopulations were different and could be separated fairly well using 268 selected SNPs using Random forest classifier, with the exception of the admixed cluster (POP2). This implies that the inferred subpopulations were reasonably robust.
An interesting finding was that the most predictive SNPs for each subpopulation occurred within a few small intervals, rather than randomly spread throughout the genome, which may suggest a selection pressure, perhaps selecting for a trait that makes the accession better in the area it is grown. However, the subpopulation groupings are broad, each covering a quite diverse range of environments, and the allele frequencies between subpopulations have a large amount of overlap, so many of these regions could be due to chance rather than function.
Although the majority of genes within or nearby the SNP intervals have an unknown function, some interesting genes are functionally annotated, for example, Os03g0262000, Os05g0203800, Os06g0677800, and Os09g0433650. The gene Os03g0262000, is a homolog of AtPIP5K1 that is induced by water stress and abscisic acid in A. thaliana (Mikami et al. 1998). Os05g0203800 (OSMADS58) is identified as a rice C-class MADS box gene which plays a crucial role for flower development (Yamaguchi et al. 2006;Yun et al. 2013;Dreni et al. 2011;Chen et al. 2015;Li et al. 2011). Os06g0677800 (OsARF17) encodes a rice auxin response factor (ARF) involved in plant defense against several different types of plant virus (Zhang et al. 2020), and functions in leaf inclination regulation (Chen et al. 2018) and tiller angle modulation . Os09g0433650 is located on chromosome 9 and associated with rice grain shape ). The roles of these candidate genes identify a potential relationship between predictive SNP markers and differences in agronomic traits found in the inferred subpopulations which could be further investigated.

Conclusion
Thailand is a country with large diversity in rice varieties due to its rich and diverse environment. In this paper, 300 rice accessions (277 rice accessions from all across Thailand and 23 IRRI rice accessions) were sequenced to identify SNP variants allowing for the population-structure to be explored. The inferred population structure from admixture and clustering analysis illustrated strong evidence of substructure for each geographical region. The results of phylogenetic tree, PCA analysis, and machine learning on SNPs selected by QTL analysis also supported the inferred population structure. Moreover, by using only 268 SNPs, a random forest classifier was able to classify individuals for four out of the five subpopulations with reasonably high accuracy, the admixture population was the exception. This shows that these subpopulations are unique enough to be distinguished by a small number of SNPs. A unique ecological system where rice is grown might play a key role in this uniqueness. The 268 SNPs may be used as markers of these subpopulations for future studies. This study can serve as a reference point of the nation-wide population structure for supporting breeders and researchers who are interested in Thai rice. Finally, the dataset of 300 rice accessions is available at PRJNA753279-Thai Rice Genotyping Project.

Plant Material
The panel used in this study is composed of 300 Thai rice accessions representing diversity in phenotype, agro-ecosystem, and geographic origin: northern, northeastern, southern, and central region of Thailand. Detailed information regarding the accessions is reported in Additional file 1: Table S4.

Plant Cultivation
The study was carried out in the wet season of 2018 at Ubon Ratchathani Rice Research Center (URRC) of Ubonratchatani province,Thailand ( 15 • 19 ′ 55.2 ′′ N, 104 • 41 ′ 27.9 ′′ E). Seeds of the 300 rice accessions were germinated in a wet seedling bed on 16th June 2018. The seedlings were transplanted in a puddled field at 30 days after sowing (DAS) in 80 × 380 cm plots (5 rows × 20 plants). Fertilizers were applied as follows: 50 kg/ha N, 50 kg/ha P 2 O 5 , 25 kg/ha K 2 O at 10 days after transplanting; and top-dress with 10 kg/ha N at 30 days after transplanting. The experimental field was managed according to normal agricultural practices regarding crop protection and paddy water management. The mean air temperature ranged from 24.5 to 31.7 • C. The highest and lowest relative humidity recorded during the experiment was 93.6 to 65.7%. No extremely high temperature or extremely low relative humidity was recorded, therefore heat stress was not a cause that affected growing and/ or fertility conditions. Flowering time (days to flowering after sowing, DTF) was recorded when 50% of the plants in each plot had flowered. At maturity, the five plants in the middle position of each plot were selected for assessment of plant height and grain length. The broad-sense heritability ( h 2 B ) was calculated as h 2 B = σ 2 G /(σ 2 G + σ 2 e /r) , where σ 2 G represents genetic variance, σ 2 e represents residual variance and r is the number of replicates.

Genotyping by Sequencing and Variance Calling
The genotypic sequences were generated from Ion S5 TM XL Sequencer (Thermo Fisher Scientific). The data were obtained as BAM files. The ApeKI enzyme was used for genomic DNA digestion to prepare the DNA libraries for each accession. E-Gel TM SizeSelect TM agarose gels (Invitrogen) were used to select DNA fragments for 250-300 bp. Fastq files were created from BAM files using Samtools v1.9 (Li et al. 2009). Then, reads were mapped to the japonica reference genome using Burrow-wheeler aligner (BWA) v0.7.17 (Li 2013) and SAMtools. Variants were called using using GATK v4. 1.4.1 (McKenna et al. 2010).

Population Structure Analysis Numerical Genotype Function
Genotype was converted into a numerical value, such that homozygous reference allele was 1.0, homozygous alternate allele was 0.0, and heterozygous was 0.5 using TASSEL (Bradbury et al. 2007). The SNPs were filtered to have a minimum allele frequency of 0.05 and a minimum call rate of 70% per SNP. The SNP number reduced from 3,366,491 to 117,054 sites after filtering.

Admixture Analysis
Numerical genotypes were used to create .ped, .map and .bed files for ADMIXTURE (Alexander et al. 2009) analysis to estimate ancestry ratios of all individual accessions. The optimal number of ancestors was found to be four by the Elbow method (see Additional file 2 for the result of the Elbow method). The F ST values where also calculated by ADMIXTURE (Alexander et al. 2009).

Clustering Analysis
The ancestry-ratio vectors of each SNP were used for data clustering. The individual assignments of clustering were inferred by applying a k-means clustering approach (Forgy 1965) in the R software package (R Development Core Team 2011). The Elbow method was applied to infer the optimal number of clusters based on Betweencluster and Total Sum-of-Square (BCTSS) Ratio. The BCTSS ratio represents a ratio of difference of distance from individuals to their cluster centroid between current clustering assignment compared to single cluster assignment. The optimal number k * of clustering assignment should reduce BCTSS ratio significantly compared against k * − 1 and k * + 1 cases (see Additional file 2 for the result of the Elbow method).
A 10,000 iteration bootstrap approach (Efron 1992) was deployed to estimate the support of clustering assignment of each population. The clustering assignment that maximized BCTSS ratio with the optimal k along with the support of assignment from bootstrap was used to represent the subgroups of the population.

Principal Components Analysis
PCs were generated from numeric genotype data using TASSEL (Bradbury et al. 2007).

Phylogenetic Tree Construction
A phylogenetic tree was generated by Neighbor-Joining method (Saitou and Nei 1987) using the numerical genotype data in TASSEL (Bradbury et al. 2007).

Domination Graphs Inference
Domination graphs, which represent relationships between pairs of populations for three phenotypes, were inferred using EDOIF package (Amornbunchornvej et al. 2020). For each phenotype, nodes of the domination graph are subpopulations while there is an edge from a population with a significantly higher phenotype value to a population with a lower phenotype value. The Mann Whitney test was deployed to infer edges of a domination graph with α = 0.001.

Population Specific SNPs
We investigated the potential of identifying SNPs that were specific to each population identified by the admixture analysis. These groupings can include a large number of accessions and the accessions have varying levels of relatedness, which means varying levels of SNP sharing occur within and between populations, so a large number of SNPs would be required to discriminate between populations. The variants were filtered to select for bi-allelic SNPs where all accessions were homozygous and a series of Quantitative trait locus (QTL) analyses were performed to identify the most discriminatory SNPs. The phenotype for each QTL analysis was set as a binary trait of 'same population' or 'other populations' using the population groupings identified by the admixture analysis. A separate QTL analysis was performed for each population and the SNPs with the highest LOD score and largest allele frequency difference were taken as being the most predictive for that population. These SNPs were then used to train a random forest model (Breiman 2001) using the R randomForest package (Liaw and Wiener 2002) and the R caret package (Kuhn 2020). Gene information from the GFF was overlaid on the SNP data to identify any population discriminatory SNP that was within a gene. In addition, genes within intervals of closely spaced predictive SNPs were also investigated.

Population Classification
We deployed machine learning data classification to investigate whether the set of population specific SNPs we selected can be used to discriminate between the five populations. We used 10-fold cross validation (Allen 1974), which is a technique in machine learning to measure the performance of prediction from a set of classifiers. We used a random forest model (Breiman 2001) as the main classifier in the analysis training on the 268 selected SNPs to classify the five populations of 300 rice accessions. A true positive (TP) is when the predicted population was the same as the ADMIX-TURE derived population. The false positive (FP) count is the incorrect inclusion of an accession into a subpopulation and the false negative (FN) count is the incorrect exclusion of an accession out of a subpopulation, calculated per subpopulation. The precision is the ratio of the number of TP cases to the sum of TP and FP cases. The recall is the ratio of the number of TP cases to the sum of TP and FN cases. The F1 score is calculated from precision and recall as follows.