Resequencing of 672 Native Rice Accessions to Explore Genetic Diversity and Trait Associations in Vietnam

Background Vietnam possesses a vast diversity of rice landraces due to its geographical situation, latitudinal range, and a variety of ecosystems. This genetic diversity constitutes a highly valuable resource at a time when the highest rice production areas in the low-lying Mekong and Red River Deltas are enduring increasing threats from climate changes, particularly in rainfall and temperature patterns. Results We analysed 672 Vietnamese rice genomes, 616 newly sequenced, that encompass the range of rice varieties grown in the diverse ecosystems found throughout Vietnam. We described four Japonica and five Indica subpopulations within Vietnam likely adapted to the region of origin. We compared the population structure and genetic diversity of these Vietnamese rice genomes to the 3000 genomes of Asian cultivated rice. The named Indica-5 (I5) subpopulation was expanded in Vietnam and contained lowland Indica accessions, which had very low shared ancestry with accessions from any other subpopulation and were previously overlooked as admixtures. We scored phenotypic measurements for nineteen traits and identified 453 unique genotype-phenotype significant associations comprising twenty-one QTLs (quantitative trait loci). The strongest associations were observed for grain size traits, while weaker associations were observed for a range of characteristics, including panicle length, heading date and leaf width. Conclusions We showed how the rice diversity within Vietnam relates to the wider Asian rice diversity by using a number of approaches to provide a clear picture of the novel diversity present within Vietnam, mainly around the Indica-5 subpopulation. Our results highlight differences in genome composition and trait associations among traditional Vietnamese rice accessions, which are likely the product of adaption to multiple environmental conditions and regional preferences in a very diverse country. Our results highlighted traits and their associated genomic regions that are a potential source of novel loci and alleles to breed a new generation of low input sustainable and climate resilient rice. Supplementary Information The online version contains supplementary material available at 10.1186/s12284-021-00481-0.


Background
Rice production in Vietnam is of great value for export and providing daily food for more than 96 million people. However, agricultural production, especially rice cultivation, is inherently vulnerable to climate variability across all regions in Vietnam. Based on the records of monthly precipitation and temperature from 1975 to 2014 (Nguyen et  Vietnam possesses a vast diversity of native and traditional rice varieties due to its geographical situation, latitudinal range and diversity of ecosystems (Fukuoka et al. 2003). This diversity constitutes a largely untapped and highly valuable genetic resource for local and international breeding programs. Vietnamese landraces are disappearing as farmers switch to modern elite varieties. To limit this erosion of genetic resources, several rounds of collection of landraces, particularly from the northern upland areas, have been undertaken since 1987. Thousands of rice accessions have been deposited in the Vietnamese National Genebank at the Plant Resources Center (PRC, Hanoi, Vietnam), together with passport information detailing their traditional name and province of origin. One hundred and eighty-two traditional Vietnamese accessions were selected for a genotype by sequencing (GBS) study in 2014 (Phung et al. 2014). This study yielded 25,971 single nucleotide polymorphisms (SNPs) that were used to describe four Japonica and six Indica subpopulations. These subpopulations were classi ed by region, ecosystem and grain-type using passport information (province and ecosystem) and phenotyping. This dataset had subsequently been used for genome-wide phenotype-genotype association studies (GWAS) relating to root development (Phung et al. 2016), panicle architecture (Ta et al. 2018), drought tolerance (Hoang et al. 2019b), leaf development (Hoang et al. 2019a) and Jasmonate regulation ).
An international effort to re-sequence Asian rice accessions known as the "3000 Rice Genomes Project" (3K RGP) has provided the rice community with a better understanding of Asian rice diversity and evolutionary history, as well as providing valuable knowledge to enable more e cient use of these accessions for rice improvement ; Wing et al. 2018). However, only 56 of these accessions originated from Vietnam, suggesting that the rice diversity within this country may not be fully captured within the 3K RGP. While the original 3K RGP analysis described nine subpopulations , subsequent reanalysis had shown that the 3K RGP could be further subdivided into fteen subpopulations (Zhou et al. 2020).
In this paper, we newly sequenced 616 Vietnamese rice accessions using whole-genome sequencing (WGS), most of them being native landraces. 164 of these rice accessions were in common with a previous study [8] based on a genotyping-by-sequencing (GBS) approach. We supplemented this dataset with all 56 Vietnamese genotypes from the 3K RGP to form a native diversity panel with 672 accessions. We analysed this diversity panel of 672 accessions to explore how breeding and environmental pressures have shaped the rice genome in Vietnamese accessions. We also carried out a comprehensive analysis of the population structure of the 3,635 rice genomes obtained from joining our diversity panel and the complete 3K RGP datasets. We completed a GWAS on the diversity panel with 672 accessions (and separately for the Japonica and Indica subtypes within it) on thirteen phenotypes, which are available for around two-thirds of the samples. Our results highlight genomic differences and trait associations in traditional Vietnamese landraces, which are likely the product of adaption to multiple environmental conditions and regional culinary preferences in a very diverse country.

Sequencing rice diversity from Vietnam
Whole-genome sequencing was carried out on 616 rice accessions. 511 of the accessions were obtained from the PRC (Plant Resource Centre, Hanoi, Vietnam, http://csdl.prc.org.vn), together with their passport data, which shows that they were collected from all eight administrative regions of Vietnam (Table S1). The remaining samples were obtained from AGI's collection (Agricultural Genomics Institute, Hanoi, Vietnam). Three reference accessions (Nipponbare, a temperate Japonica; Azucena, a tropical Japonica; and two accessions of IR64, an Indica) obtained from the PRC, were included in the dataset. A total of 1,174 Giga base-pairs (Gbps) of data was generated for the 616 samples representing an average sequencing depth of 30x for 36 "high coverage" samples and 3x for 580 "low coverage" samples (Table S1). These 616 newly-sequenced accessions were classi ed into 379 Indica and 202 Japonica subtypes, with the remaining 35 (including the Aus and Basmati varieties) being classi ed as admixed, based on the STRUCTURE (Pritchard et al. 2000) output for K=2 using a subset of 163,393 SNPs.
Population structure of rice within Vietnam The population structure of rice within Vietnam was analysed using the diversity panel of 672 samples, comprising 616 newly sequenced accessions and 56 Vietnamese genotypes from the 3K RGP. We assigned the 672 samples to four Japonica subpopulations and ve Indica subpopulations (Table S1) using (i) the population structure information obtained from the STRUCTURE analysis ( Fig. 1), (ii) the previous characterisation of a panel of Vietnamese native rice varieties using GBS (Phung et al. 2014), and (iii) the assessment of the optimal number of subpopulations (Fig. S1) using the method described in Evanno et al. (2005). Subpopulations were named as in Phung et al. (2014), except that we considered the I6 subpopulation to be part of the I3 subpopulation. Although the previous study used a limited number of GBS markers, 129 of the 164 common samples were assigned to the same subpopulations in both studies. Most differences were due to samples being classi ed as admixed in either one of the studies. We classi ed 48 (11%) of the Indica (Im), and eight (4%) of the Japonica samples (Jm) as admixed. The reference varieties Nipponbare (Temperate Japonica), Azucena (Tropical Japonica), and IR64 (Indica) were classi ed as J4, J1 and I1, respectively.
Each Indica subpopulation contained shared ancestry (admixed components) with other Indica subpopulation (Fig. 1a). The admixed components are shown in detail for the 43 samples in the I5 subpopulation ( Fig. 1c) namely 38 samples from our dataset and the following ve samples from the 3K RGP; IRIS 313-11384 (IRGC 127275), B184 (IRGC 135862), IRIS 313-11383 (IRGC 127274), IRIS 313-10751 (IRGC 127577) and IRIS 313-11893 (IRGC 127519). The Japonica subtropical J1 subpopulation shared ancestry (between 0 and 25% of the genome) with the Japonica tropical J3 subpopulation, whereas the two temperate subpopulations, J2 and J4 shared ancestry dominantly with each other. The tropical J3 subpopulation contained four samples with around 20% of the haplotypes in common with the temperate J4 subpopulation. Using the passport information available from the PRC, the proportion of each subpopulation originating from each of the "administrative regions" of Vietnam is shown in Fig. 1d. Only the I1 and I2 Indica subpopulations were collected from the Mekong River Delta regions, I2 being almost exclusively grown there whereas I1 was more widespread than I2. The I4 and J4 subpopulations were mainly collected from the Red River Delta areas. The J1 and J3 subpopulations were closely related; the J1 subpopulation was predominantly from the North of Vietnam whereas the J3 subpopulation was concentrated around the South-Central Coast region. Small variations in the percentage of reads mapping were observed for each of the subpopulations (Fig. S2).
A Principal Component Analysis ( Fig. 2a and 2b) showed the relationship between these nine Vietnamese subpopulations. Concerning the Vietnamese genotypes from the 3K RGP dataset included in the diversity panel, the Indica I1 subpopulation included two XI-1B modern varieties and eight admixed (XIadm) accessions. I2 included fourteen XI-3B1 genotypes, which comprises Southeast Asian accessions, and similarly, I3 and I4 included one and ten XI-3B2 genotypes, respectively. Finally, I5 included ve XI-adm accessions and clustered distinctly away from all the other subpopulations (Fig. 2a). On the other hand, J1 included the two subtropical (GJ-sbtrp) accessions from the Vietnamese 3K RGP genotypes, and J3 included one tropical (GJ-trp1) accession from the Vietnamese 3K RGP genotypes (Fig. 2b). These results correlate well with the latitudinal distinction between these subpopulations. J2 and J4 included two and one temperate (GJ-tmp) accessions, respectively; and split into two clear subpopulations in Vietnam compared with the East Asian temperate subpopulation described by the 3K RGP.
Population structure of the combined 3,635 Asian cultivated rice genomes 612 of the 616 newly sequenced accessions from this study and the 3,023 accessions from the 3K RGP were combined and classi ed into 9 and 15 subpopulations (Table S2), and compared with the subpopulations from the 3K RGP analysis Zhou et al. 2020). For clarity, we used the pre x Jap-and Ind-to label these subpopulations from our analysis.
When the combined dataset of 3,635 samples was classi ed into nine subpopulations (Fig. S3a), we found that 95% of the 3K RGP accessions (2,882 out of 3,023) were assigned into the same subpopulations. The remaining 5% lines were either (i) previously classi ed as admixture and our analysis placed into a subpopulation, or (ii) were previously classi ed in a subpopulation and were now classi ed as admixture. The 612 newly sequenced Vietnamese accessions were placed in three Indica clusters (187 accessions), three Japonica clusters (176 accessions), the Basmati and Sadri aromatic cB group (11 accessions), or the Aus cA subpopulation (one accession). In more detail, the three Indica clusters included three Im accessions in the East Asian cluster (Ind-1A), seventy-six I1 accessions in the cluster of modern varieties of diverse origins (Ind-1B), and 108 accessions (I2, I3 and Im) in the Southeast Asian cluster (Ind-3). Whereas, the three Japonica clusters included 54 accessions (J2, J4 and Jm) in the primarily East Asian temperate cluster (Jap-tmp), 119 accessions (J1, J3 and Jm) in the Southeast Asian subtropical cluster subpopulation (Jap-sbtrp) and three J3 accessions in the Southeast Asian Tropical subpopulation (Jap-trp). Any remaining accession with admixture components over 65% either Indica or Japonica were classi ed as Ind-adm (191 accessions) or Jap-adm (27 accessions), respectively. Finally, the remaining accessions were considered as Admix (19 accessions). Notably, all thirty-seven I5 accessions were placed in Ind-adm, and ten of the sixteen J3 accessions were placed in Jap-adm.
When the combined dataset of 3,635 samples was reclassi ed into 15 subpopulations (K15_new, Fig. S3b), we noticed the following differences in the distribution of subpopulation compared to the 3K RGP analysis for the same number of 15 subpopulations (K15_3KRGP); we did not observe the division of the Aus samples into cA-1 and cA-2, and we subdivided the Indica subtypes and Japonica subtypes into eight and ve subpopulations, respectively. A Principle Coordinate (PCO) analysis of the Indica and Japonica subpopulations is shown in Fig. 3, highlighting our new eight Indica and ve Japonica subpopulations (In addition the Vietnamese and 3K RGP subpopulations are shown in Figs. S5 and S6).
The relation between the subpopulations in our comprehensive analysis (3,635 accessions) and the 3K RGP (3,023 accessions) was as follows: (i) The Ind-1A, Ind-1B.1 and Ind-1B.2 were equivalent to XI-1A, XI-1B1 and XI-1B2, respectively. Forty-three of the Vietnamese I1 accessions were in the Ind-1B.1 subpopulation, and the remaining 102 I1 accessions were classi ed as admixed. (ii) The Ind-2 was equivalent to XI-2A and XI-2B, and as expected, this geographically distant South Asian subpopulation was not present in Vietnam. (iii) The previously observed split of the Indica-3 subpopulation into 3A and 3B was also observed in our analysis, where Ind-3.1 was equivalent to XI-3A and did not contain any Vietnamese accessions. (iv) The remaining Ind-3. approximately two-thirds of the samples in our study. For ve of these traits, additional scores were also included from trials by the Vietnamese Plant Resource Centre. In addition, phenotypic data were available for eleven of the traits in 38 of the 56 samples sourced from the 3K-RGP dataset (Table S3, S4).
Finally, the grain length to grain width ratio (GL/GW) was calculated to give a total of 20 traits (Table S5). Scores were available for between 328 and 503 of the 672 samples (Indica subpanel, 170 -297 samples and Japonica subpanel, 134 -178 samples).
There were signi cant differences in measurements between the Indica and Japonica subtypes for ten of the traits; these are detailed in Table S5 and histograms are shown in Fig. 4 for selected phenotypes. The Indica subtypes had signi cantly (p-value <0.0001) higher values for grain length to width ratio, leaf pubescence, culm number, culm length, and oret pubescence. In contrast, the Japonica subtypes had signi cantly higher values for grain width, leaf width, ag leaf angle, panicle length, and oret colour. The Indica I1 subpopulation (mostly elite varieties) was the most phenotypically distinct when compared to the rest of the Indica samples (mostly native landraces). I1 samples had longer grains (p-value = 2.2e-16), earlier heading date (p-value = 9.9e-12), higher culm strength (p-value = 2.2e-16), shorter leaf length (p-value = 2.7e-14) and shorter culm length (p-value < 2.2e-16). Similar values were obtained when comparing I1 to just the I5 subpopulation (Fig. 4). The I5 subpopulation was not phenotypically distinct (p-value < 0.001) from the other landrace subpopulations I2, I3 and I4, except for a signi cantly lower measurement of leaf pubescence (p-value = 0.0007). The Japonica J2 subpopulation had a signi cantly lower grain length to width ratio than J1 (p-value = 1.8e-13) and J3 (p-value = 5.7e-07). A correlation analysis carried out between the 20 phenotypes ( Fig. S9) showed that the highest correlation (r = 0.6) was between leaf length and culm length (excluding the correlation between grain length to width ratio and grain length and grain width). Histogram and correlation plots are available for the 13 traits used for the GWAS analysis in Fig The Japonica subtypes had a lower nucleotide diversity (p = 0.000912) than the Indica subtypes (p = 0.00167). Looking at the individual subpopulations (Table S6), the elite I1 subpopulation is the most diverse (p = 0.00144), and the I5 subpopulation is the least diverse (p = 0.00103). Regions of the genome with low diversity in all Indica subpopulations, and regions with low diversity in speci c subpopulations, were observed when plotting diversity along each chromosome (Fig. S13). The J3 subpopulation is the most diverse of the four Japonica subpopulations. (p = 0.000697). Large genomic regions with very low diversity were observed in chromosomes 2, 3, 4 and 5 in all Japonica subpopulations (Fig. S14).
The full list of phenotypic measurements is available in Table S3. We found 643 signi cant phenotype-genotype associations. These associations were organised into 21 QTLs (Table 1, Table S7). The GWAS Manhattan and Quantile-Quantile plots are available in Fig. S17 and Fig. S18. The QTLs ranged from 41 kb (16_FP) to 3,148 kb (5_GS). The 21 QTLs contained 1,730 genes and covered a total of 11 Mbp over ten chromosomes, and contained 453 SNPs with a signi cant association to a trait in at least one diversity panel (Fig. 5). The list of genes within each QTL is available in Table S8. Functional enrichment was found within 9 of the QTL (Table S9).
Seventeen QTLs were identi ed in the full diversity panel signi cantly associated with eight traits: grain length, grain width, grain length-to-width ratio, leaf width, panicle length, oret pubescence, heading date and internode diameter. A further 4 QTLs associated with grain length and grain width were observed only in the Japonica subpanel. Three of the QTLs, which were found in the full panel, were also observed in the Indica subpanel.
The set of 3.8M SNPs (see methods), representing one SNP every 99 bases, was annotated based on the potential effect of each SNP in protein function using SnpEff (Table S12). 526,138 (4.79%) of the SNPs were in genes. There were 21,639 (0.197%) SNPs in 11,125 genes classi ed as having a putative "High impact" effect (E.g. Exon changes, frameshifts, gene fusions or rearrangements, protein structural changes, etc.). Following additional minimal allele frequency (MAF) ltering, in the Indica dataset (MAF 5%, 2,027,294 SNPs), there were 11,906 "High impact" SNPs in 7,396 genes and the Japonica dataset (MAF 5%, 1,125,716 SNPs), there were 6,240 "High impact" SNPs in 4,439 genes of which 2,818 were present in both Indica and Japonica.
None of the 453 SNPs with a signi cant association was annotated as resulting in protein changes ("High impact" SNPs). However, "High impact" effects were identi ed in other SNPs within the QTL. Among the total 1,730 genes in the 21 QTLs, we annotated 309 genes with "High impact" SNPs in the Indica subpanel, 248 genes with "High impact" SNPs in the Japonica subpanel, including 137 "High impact" SNPs common between the two sets. 129 of the 309 genes and 94 of the 248 genes had functional annotations in PhytoMine (Goodstein et al. 2012), but no functional overrepresentation was found for these sets of genes. In addition, we looked for overlaps with the QTL in ve published Vietnamese studies (Hoang et

Indica and Japonica rice subpopulations within Vietnam
Whole-genome sequencing of 616 Vietnamese rice accessions, predominantly landraces, plus 56 Vietnamese genotypes previously sequenced by the 3K RGP, provides us with a diversity panel to clarify the structure of rice subpopulations in Vietnam. Here, we describe ve Indica subpopulations and four Japonica subpopulations using phenotypic measurements from this study, passport information available from the Vietnamese National Genebank (PRC), and the agronomic and geographical annotations from Phung et al. (Phung et al. 2014). In general terms, our population structure within Vietnam agreed with the previous study, which used a smaller number of markers and 182 samples and is approximately a third of our diversity panel (Phung et al. 2014).
Subpopulation I1 is the most phenotypically distinct of the Indica subpopulations and shows typical phenotypes of 'elite' varieties, such as short height, strong culm strength, long slender grains and a short growth-duration (less than 120 days from sowing to harvest). I1 accessions are grown throughout Vietnam in irrigated ecosystems but predominantly in the Mekong River Delta in the south of the country. Subpopulation I2 is mainly composed of long growth-duration (over 140 days), tall varieties grown in the rainfed lowland and irrigated ecosystems of the Mekong River Delta with a broad diversity of grain shapes. The remaining three Indica subpopulations are intermediate between I1 and I2 for growth-duration, height and culm strength, have a broad diversity of grain shapes, and are not grown in the Mekong River Delta. Subpopulation I3 has the highest proportion of upland varieties but also includes some lowland varieties from the "South Central Coast" region many of which were classi ed as an independent subpopulation (I6) by Phung et al. (Phung et al. 2014). Subpopulation I4 is mainly grown in the rainfed lowland and irrigated ecosystems of the Red River Delta. Subpopulation I5 is grown in a range of ecosystems but concentrated around the North Central Coast and Red River Delta regions, but excluding the Northwest region suggesting that it is the main lowland subpopulation. The J1 and J3 subpopulations are closely related upland varieties and the J2 and J4 subpopulations are closely related lowland varieties.
Subpopulation J1 is mostly composed of medium growth-duration upland varieties from the mountainous regions in the North of Vietnam, with long large grains typical of upland varieties. Subpopulation J2 is grown throughout Vietnam in a range of ecosystems but has consistently short grains. Subpopulation J3 is mainly grown in the "South Central Coast" region and has long large grains. Subpopulation J4 is primarily grown in the Red River Delta region in lowland and mangrove ecosystems and has short grains.
The drought tolerance of these subpopulations can be inferred from the root traits measured by Phung et al. (2016). The J1 and J3 upland subpopulations have deeper and thicker roots than the thinner shallower roots in the J2 and J4 subpopulations, which are grown in irrigated and mangrove ecosystems (Phung et al. 2016). This suggests that the J1 and J3 subpopulations, which are grown mainly in rainfed upland regions, would be more drought tolerant than the others. Similarly, the I3 subpopulation has the deepest and thickest roots. It would, therefore, be more drought tolerant than the I1 and to a lesser extent the I5 subpopulation, which has the thinnest, shallowest root systems. Vietnamese accessions, as well as half of the Southeast Asian native XI-3B2 genotypes from the 3K RGP. The remaining XI-3B2 were classi ed as Indica admix (Ind-adm). However, when only the Vietnamese samples were considered in the analysis, I5 clustered distinctly away from I3 and I4 subpopulations ( Fig. 2A) and included ve accessions from the 3K RGP, which had very low shared ancestry (admixture components) with other 3K RGP samples. Notably, Vietnamese landrace IRIS 313-11384 (IRGC 127275) had no shared ancestry with any other Vietnamese 3K RGP genotypes. Remarkably, a recent study on genomic signals of admixture and alien introgression in a core collection of 948 accessions representative of the earlier Asian Rice Landraces (Santos et al. on speci c traits but used a smaller number of markers and a third of the samples from the Vietnamese dataset. We took a similar approach of carrying out the analysis on both the full panel and the Indica and Japonica subpanels. Showing the QTL for the various traits altogether in Fig. 7 has highlighted some interesting overlaps. Notably, the overlap of QTL for panicle morphology with our QTL for grain size (2_GL and 6_GS). These previous studies found QTL in the full panel and in the Indica subpanel, but not in the Japonica subpanel. However, we found QTL for grain size that were only present in the Japonica subpanel, and all the QTL found in the Indica subpanel were also found in the full panel. These differences probably re ect our larger dataset. Comparing our results with the GWAS results from the 3K RGP (https://snp-seek.irri.org/) ( Subpopulation I5 constitutes an untapped resource of cultivated rice diversity. The analysis restricted to Vietnamese accessions allowed us to observe differences among the accessions within the country. Although 38 accessions (including two genotypes from the same accession in our study) are deposited in the PRC in Hanoi, and the remaining ve accessions are available from the 3K RGP, there is limited information from the passport and phenotypic data to be able to understand the distinctiveness of this subpopulation fully. Further analysis of this subpopulation should encompass 'Indica speci c genes' which may have been overlooked in our study as we used a Japonica reference.
Phung et al. (Phung et al. 2014) described subpopulation I5 as "medium growth-duration accessions from various ecosystems of the North and South Central Coast regions, with rather small and non-glutinous grains". Our I5 accessions are predominantly from the Red River Delta and contiguous coastal departments, the "North Central Coast" and "Northwest" administrative regions, but remarkably excluding the higher altitude Northwest region in the North, the more upper "Central Highlands", as well as the whole Mekong River Delta in the south. This suggests that I5 accessions are common traditional low yielding lowland varieties with speci c environmental or culinary values.
Comparing the Vietnamese subpopulations to the fteen Asian rice subpopulations identi ed from the 3K RGP highlighted the I5 subpopulation as a potential source of novel variation as it forms a well-separated cluster. Subpopulation I5 originates from lowland areas such as the Red River Delta and adjacent regions. For the range of phenotypes measured in this study, the I5 subpopulation did not differ phenotypically from the other landraces, which have undergone breeding selection within Vietnam. However, compared to the 'elite' I1 subpopulation, I5 accessions have shorter grains, take longer to ower, having lower culm strength, longer culms and leaves.

Conclusions
In this study, we generated a large genome-variation dataset for rice by sequencing 616 accessions from Vietnam and supplementing these with the data obtained for the 3K RGP. Using this resource, we incorporated the Vietnamese rice diversity within the population structure of the Asian cultivated rice. A GWAS analysis yielded the strongest associations for grain characteristics and weaker associations for a range of characteristics such as panicle length, heading date and leaf width. We used these associations together with published QTLs obtained using a subset of our accessions to give us an insight into traits underlying the regions identi ed as being under breeding selection.
Vietnam is currently experiencing increasing variability in the local climate due to global changes and the growing severity of the El Nino-Southern Oscillation phenomenon, creating notable inter-annual variations in precipitation ranging from severe drought to large-scale oods (Yen et al. 2019). The Mekong River Delta region is an essential region for rice production globally, but the adverse effects of salinisation have damaged rice production in recent decades (Tran et al. 2019). In addition, long-term trends in rainfall and temperature patterns have been identi ed in areas with a high proportion of agricultural land. Genomic studies on the locally adapted varieties and subpopulations will provide a potential source of novel alleles which can be exploited in rice breeding programs, such as the new generation of sustainable 'Green Super Rice' which are designed to have lower inputs, enhanced nutritional content and suitability for growing on marginal lands ).

Sequencing of 616 accessions from Vietnam
We sequenced a total of 616 rice accessions, 612 accessions from Vietnam and three reference accessions, Nipponbare, a temperate Japonica; Azucena, a tropical Japonica; and IR64, an Indica (2 samples). 511 accessions are available from the Vietnamese National Genebank (PRC) at http://csdl.prc.org.vn (Table S1). All Vietnamese native rice landraces were grown at Dai Dong Experimental Farm (Dai Dong commune, Thach That district, Hanoi, Vietnam) in 2015. The healthy seeds generated from one mature spikelet of the individual plant in each landrace were harvested and dried separately. After that, the selected seeds (35-40 seeds/landrace) were incubated and sown for two weeks to collect leaf samples (30g/sample) for genomic DNA extraction.
Total genomic DNA extraction of each rice landrace was made from young leaf tissue using the Qiagen DNeasy kit (Qiagen, Germany). DNA concentration and purity of the samples were measured by the UV-VIS NanoDrop ND-2000 spectrophotometer (Thermo Fisher Scienti c) at OD 260/280 nm and OD 260/230 nm wavelengths.
Sequencing was performed by Genomic Services at the Earlham Institute (Norwich, UK). Around 1µg of genomic DNA from each sample was used to construct a sequencing library. For the 36 high coverage samples (pre x: SAM) the Illumina TruSeq DNA protocol was followed, and the samples were sequenced on the HiSeq 2000 for 100 cycles. For the low coverage samples (pre x: LIB), genomic DNA was sheared to 500bp using the Covaris S2 Sonicator (Covaris and Life technologies), and samples were processed using the KAPA high throughout Library Prep Kit (Kapa Biosystems, MA, USA). The ends of the DNA were repaired for the ligation of barcoded adapters. The resulting libraries were quality checked, pooled, and quanti ed by qPCR. The libraries were sequenced on a HiSeq 2500 instrument following the manufacturer's instructions.

Phenotyping
Phenotyping experiments were conducted at the Thach That Experimental Farm of AGI in 2014 and 2015 (Dai Dong commune, Thach That district, Hanoi, Vietnam). The seeds of each rice landrace were incubated in an oven at 45 o C for ve days to break the seed dormancy. All rice seeds were soaked in tap water for two days and incubated at 35-40 o C for four days for germinating. The fully germinated seeds of each rice landrace were directly sown in the paddy eld plot (1.5m 2 in the area). After 15 days of sowing, 24 seedlings of each landrace were carefully transplanted by hand in eld plots (2x4m 2 ). The fertiliser and pesticide applications were performed following the conventional methods of rice cultivation in Vietnam. The phenotypic and agronomic characteristics were carried out following the method of IRRI and Institute (2014).
In addition, phenotypic data were available for eleven of the traits in 38 of the 56 genotypes sourced from the 3K-RGP dataset. These eleven traits were included in our analysis because we did not observe a signi cant difference (p-value > 0.07) between our dataset and the 3K-RGP dataset for the I2 subpopulation (Table S5).
Merging the SNP called in the sequenced materials and the complete 3K RGP dataset Raw sequencing reads were mapped to the Nipponbare reference genome Os-Nipponbare-Reference-IRGSP-1.0 (IRGSP-1.0), using BWA-MEM with default parameters except for "-M -t 8". Alignments were compressed, sorted and merged using samtools. Picard tools were then used to mark optical and PCR duplicates and add read group information. We used freebayes v1.1.0 for variant calling using default parameters. A total of 21.2 M variants were identi ed of which 16.4 M were SNPs, and 4.8 M were indels. The resulting VCF le was then ltered for biallelic SNPs with a minimum SNP quality of 30, resulting in 16.0 M variants. PLINK v1.9 was used to convert the VCF into a PLINK BED format. These variants were then combined with the 3K-RGP 29 M biallelic SNPs dataset v1.0 by downloading the PLINK BED les from the "SNP-seek" database (https://snp-seek.irri.org) excluding variants on scaffolds and 26,553 SNPs that were agged as triallelic upon merging, resulting in 36.9 M SNPs. The SNPs present in both datasets were then extracted and ltered using an identical approach to Wang et al. , resulting in 5.9 M SNPs. For that, PLINK v1.9 "--hardy" (Purcell et al. 2007) was used to obtain observed and expected heterozygosity for 100,000 SNPs. We removed SNPs in which heterozygosity exceeds Hardy-Weinberg expectation for a partially inbred species, with inbreeding coe cient (F) estimated as the median value of "1−Hobs/Hexp", in which Hobs and Hexp are the observed and expected heterozygosity for SNPs where "Hobs/Hexp <1" and the minor allele frequency is >5% and using the cut-off value of 0.479508 for the entire 3,622 samples dataset. A further ltered set of 3.4 M SNPs was obtained by removing SNPs with >20% missing calls and MAF < 1%. Finally, a core set of 361,279 SNPs was obtained with PLINK by LD pruning SNPs with a window size of 10 SNPs, window step of one SNP and r2 threshold of 0.8, followed by another round of LD pruning with a window size of 50 SNPs, window step of one SNP and r2 threshold of 0.8. Samples with more than 50% missing data in this core set were then removed, resulting in dropping seven newly sequenced samples and one genotype from the 3K-RGP dataset.
Population structure of the combined 3,635 samples The population structure was analysed using the ADMIXTURE software (Alexander and Lange 2011) on the SNP set obtained in the previous section. First, ADMIXTURE was run from K=5 to K=15 in order to compare it with the analysis from IRRI Zhou et al. 2020). For each K, ADMIXTURE was then run 50 times with varying random seeds. Each matrix was then annotated using the subpopulation assignment from the 3K-RGP nine subpopulations.
Then, up to 10 Q-matrices belonging to the largest cluster were aligned using CLUMPP software (Jakobsson and Rosenberg 2007), these were averaged to produce the nal matrix of admixture proportions. Finally, the group membership for each sample was de ned by applying a threshold of ≥ 0.65 to this matrix. Samples with admixture components <0.65 were classi ed as follows. If the sum of components for subpopulations within the major groups (Ind and Jap) was ≥ 0.65, the samples were classi ed as Ind-adm or Jap-adm, respectively, and the remaining samples were deemed admixed (admix).

Recalling the diversity panel with 723 samples
The 616 rice samples were mapped to the Japonica Nipponbare (IRGSP-1.0) reference with BWA-MEM using default parameters, duplicate reads were removed with Picard tools (v1.128) and the bam les were merged using SAMtools v1.5 (Li et al. 2009). Variant calling was completed again on the merged bam le with FreeBayes v1.0.2 (Garrison 2012) separately for each of the 12 chromosomes, but using the option "--min-coverage 10". Over 6.3 M bi-allelic SNPs with a minimum allele count of ³3 and quality value above 30 and missing in <50% of samples were obtained with VCFtools v0.1.13 (Danecek et al. 2011). BAM alignment les to the Nipponbare IRGSP 1.0 reference genome were downloaded from http://snp-seek.irri.org/ (Mansueto et al. 2017;Mansueto et al. 2016) for 107 selected samples. Alignment statistics are included in Table S12. These BAM les were merged and variant calling was similarly completed using FreeBayes v1.0.2 (Garrison 2012) separately for each of the 12 chromosomes using the option --min-coverage 10, and ltered with VCFtools v0.1.13 as before to obtain 6.8 M bi-allelic SNPs with a minimum allele count of ³3 and quality value above 30 and missing in <50% of samples. The two sets of 6.3 M and 6.8 M SNPs were merged using BCFtools v1.3.1 isec to obtain 4.4 M SNPs which were present in both sets and in at least 70% of samples. These 4.4 M SNPs were then ltered to remove positions which fell outside the expected level of heterozygosity for this dataset, as previously indicated. The resulting estimate of F for the 723 samples was 0.882, so a SNP whose heterozygosity is >5x higher than the most likely value for a given frequency and the dataset's inbreeding rate will be deemed as having an excessive number of heterozygotes. The cut-off value was 0.591, which resulted in 3.8 M SNPs passing this lter, a scatter plot indicating the SNPs which were kept and removed is shown in Fig. S15. Missing data was imputed in this latest dataset using Beagle v4.1 with default parameters (Browning and Browning 2016). A comparison using PCA, between the imputed and non-imputed SNP sets showed that imputation did not change the clustering of these 723 samples (Fig. S16). The 3.8M SNPs were subsequently ltered for minimum allele frequency (MAF), linkage disequilibrium (LD pruning or ltering), and distance between polymorphisms (thinning) in different subsets of samples to obtain fourteen sets of SNPs that ranged from 59K to 3.8M SNPs, which were appropriate for the various downstream analysis described below (Table S11).
Population structure and diversity analysis for the panel of 672 Vietnamese samples SNP sets were ltered for MAF 5%, followed by LD ltering using PLINK --indep-pairwise 50 10 0.2, with further thinning if required. We ran STRUCTURE (Pritchard et al. 2000) v2.3.5 using the default admixture model parameters; each run consisted of 10,000 burn-in iterations followed by 50,000 data collection iterations. STRUCTURE was run using K=2 for the 616 samples using SNP set 1 (163,393 SNPs). Samples with admixture components <0.75 were classi ed as admixed, and the remaining samples were classi ed as Indica or Japonica. STRUCTURE was run varying the assumed number of genetic groups (K) from 3 to 10 with three runs per K value for the 672 Vietnamese samples (SNP set 9 -80,000 SNPs); from 1 to 8 with ten runs per K value for the 426 Indica subtypes from Vietnam (SNP set 10 -108,420 SNPs) and the 211 Japonica subtypes from Vietnam (SNP set 11 -59,815 SNPs). The output les were visualised using the R package POPHELPER v.  Table S20.

Declarations
All sequence data used in this manuscript have been deposited as study PRJEB36631 in the European Nucleotide Archive. TDK, KHT, AH, SD, LHH, MC and JDV designed and conceived the research. TDK, KHT, TDD, NTPD, NTK, DTTH, NTD, KTD, CNP, TTT, NTT, HDT, NTT, HTG, TKN, CDT, SVL, LTN, NVG and LHH performed the phenotyping and laboratory experiments. JH and BS performed the data analysis with assistance from TDD, NTPD, DTTH, NTD, KTD, NTT, LTN, TDX, MC and JDV. JH, BS and JDV wrote the paper. All authors read and approved the nal manuscript.

Author contributions
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.
with selected regions in the four Japonica or ve Indica subpopulations, any overlap with publish QTLs for Vietnamese rice populations or for the 3K RGP.     Supplementary Files