Genotyping-by-Sequencing Based Investigation of Population Structure and Genome Wide Association Studies for Seven Agronomically Important Traits in a Set of 346 Oryza rufipogon Accessions

Being one of the most important staple dietary constituents globally, genetic enhancement of cultivated rice for yield, agronomically important traits is of substantial importance. Even though the climatic factors and crop management practices impact complex traits like yield immensely, the contribution of variation by underlying genetic factors surpasses them all. Previous studies have highlighted the importance of utilizing exotic germplasm, landraces in enhancing the diversity of gene pool, leading to better selections and thus superior cultivars. Thus, to fully exploit the potential of progenitor of Asian cultivated rice for productivity related traits, genome wide association study (GWAS) for seven agronomically important traits was conducted on a panel of 346 O. rufipogon accessions using a set of 15,083 high-quality single nucleotide polymorphic markers. The phenotypic data analysis indicated large continuous variation for all the traits under study, with a significant negative correlation observed between grain parameters and agronomic parameters like plant height, culm thickness. The presence of 74.28% admixtures in the panel as revealed by investigating population structure indicated the panel to be very poorly genetically differentiated, with rapid LD decay. The genome-wide association analyses revealed a total of 47 strong MTAs with 19 SNPs located in/close to previously reported QTL/genic regions providing a positive analytic proof for our studies. The allelic differences of significant MTAs were found to be statistically significant at 34 genomic regions. A total of 51 O. rufipogon accessions harboured combination of superior alleles and thus serve as potential candidates for accelerating rice breeding programs. The present study identified 27 novel SNPs to be significantly associated with different traits. Allelic differences between cultivated and wild rice at significant MTAs determined superior alleles to be absent at 12 positions implying substantial scope of improvement by their targeted introgression into cultivars. Introgression of novel significant genomic regions into breeder’s pool would broaden the genetic base of cultivated rice, thus making the crop more resilient. Supplementary Information The online version contains supplementary material available at 10.1186/s12284-022-00582-4.


Background
To feed nearly 10 billion people by 2050, agricultural production must be increased by 60% from 2005 base year (Alexandratos 2012). The global annual yield increase in Open Access *Correspondence: kneelam@pau.edu rice during the first decade of the current century has been < 1.0% (Phillips 2010;Ray et al. 2013), and the fact that agriculture is experiencing greater competition for land, water, and energy makes it sceptical whether the requisite growth rate could be achieved. Considering the erratic climatic changes along with challenges posed by abiotic and biotic stresses, increasing the rice productivity without increasing land under cultivation is a big challenge for rice breeders (Foley et al. 2011;Qian et al. 2016;Zeng et al. 2017). Compounding the problem is the current practice of crossing elite lines, which is expected to reduce genetic variability in the working germplasm, thus, preventing the discovery of novel traits to improve yield. Undoubtedly, plant breeders have witnessed a substantial increase in yield over the years with adoption of new cultivars and better management practices (Sanchez et al. 2013). But, in order to solve the envisioned 9 billion people question (Jacquemin et al. 2013), the rate of rice production must increase on the currently available land.
The Asian cultivated rice, O. sativa, belongs to genus Oryza that includes another cultivated African rice species, O. glabberima (2n = 24, AA) and 22 wild species (2n = 24, 48) representing the AA, BB, CC, BBCC, CCDD, EE, FF, GG, KKLL, and HHJJ genome types (Sanchez et al. 2013). It has been envisioned that utilizing the useful novel variability present in wild relatives of rice could be a promising approach to increase the genetic variability in a breeder's pool. The wild relatives are an important genetic resource for breeding and genomics research as they are a reservoir of useful genes/ QTLs for tolerance to major abiotic and biotic stresses, yield-related traits including weed-competitive ability, new source of cytoplasmic male sterility, and other traits related to rice improvement (Brar and Khush 2018). Of the several approaches advocated to further improve rice productivity, utilization of wild species is of substantial importance (Khush 2005(Khush , 2013. A large amount of untapped genetic variations and higher percentage of fertile hybrids obtained from inter specific crosses of O. sativa with ancestral species, O. rufipogon has made the progenitor an attractive choice for rice breeders. It has been utilized not only for improving qualitative and quantitative traits but also for introgressing new useful variability which recognizes its potential as a valuable reservoir of genetic variation (Tanksley and McCouch 1997;Brar and Khush 2018;Dalmacio et al. 2005). Different kinds of populations such as advanced backcross populations, backcross inbred lines, chromosome segment substitution lines, near-isogenic lines, and recombinant inbred lines have been derived from crosses between O. rufipogon and O. sativa as a prebreeding material (Neelam et al. 2018). Genes for biotic stress like bacterial blight resistance (Zhang et al. 1998;Utami et al. 2008), brown planthopper resistance (Deen et al. 2017), tungro virus tolerance (Kobayashi et al. 1993), and abiotic stress tolerance like acidic conditions, iron toxicity, phosphorus deficiency have been transferred from O. rufipogon into rice cultivars by McCouch et al. (2007) and Brar and Khush (2006). Similarly, there have been a number of studies where introgression lines and back-cross populations derived from O. rufipogon accessions have been used to map yield related QTLs. Moncada et al. (2001) identified three QTLs for grain number, gpl.1, gpl2.1, gpl11.1 in back-cross population utilizing AB-QTL approach. Marri et al. (2005) mapped 3 QTLs each for grain number (gnp2.1, gnp2.2, gnp5.1), spikelet number per panicle (snp2.1, snp5.1, snp5.2), yield (yldp2.1, yldp2.2, yldp9.1) and four QTLs for thousand grain weight (gy2.1, gy2.2, gy2.3, gy9.2) in BC 2 F 1 population derived from O. rufipogon IC22015. Septiningsih et al. (2003) evaluated performance of 400 BC 2 F 2 families derived from O. rufipogon accession IRGC105491 for mapping yield and yield components. The study reported three QTLs each for grain size (gw1.1, gw3.1, gw3.2), spikelet number per panicle (spp2.1, spp3.1, spp9.1), yield (yld1.1, yld1.2, yld2.1) and 1 QTL, gpl1.1, for grain number. Likely, Xiao et al. (1998) identified 6 QTLs (gpl1.1, gpl2.1, gpl4.1, gpl5.1, gpl8.1, gpl8.2) for grain number in BC 2 population derived from IRGC 105,491. Fu et al. (2010) identified a total of 26 QTLs related to grain number, thousand grain weight and yield in BC 2 F 2 and BC 2 F 4 populations derived from Yuanjiang Oryza rufipogon Griff. In addition, Xie et al. (2008) and Jin et al. (2009) mapped grain number QTLs gn9.1 and on gpp8 chromosome 9 and 8 in BC 3 F 4 and F 2:3 populations, based on O. rufipogon accession IRGC105491. Xie et al. (2006) mapped a grain size locus, gw8.1, on chromosome 8 in BC 3 F 3 population. Similarly, Liu et al. (2009) mapped two QTLs, qspp1, qspp11 andLuo et al. (2013)  Majority of the research on O. rufipogon has utilized only a few accessions in different biparental crosses, thus limiting the allelic diversity and genetic resolution. Genome-wide association studies has been extensively employed in order to overcome these limitations as it involves a large association mapping panel, thereby increasing the allelic diversity and mapping resolution. Also, it provides an estimation of the effects of various alleles on the target trait. Since GWAS exploits historic recombination, it helps in dissecting the molecular basis of traits at a finer resolution which increases its chance for immediate utility in breeding programs. With the advent of NGS based SNP markers, a high density of markers is tested for their association with the target traits, thus giving better resolution than biparental linkage mapping carried out with limited number of SSR markers. Given these advantages of GWAS over traditional bi-parental mapping, GWAS has established itself as a promising approach to dissect complex polygenic traits at allelic level in biological sciences. The present study was designed with an aim to exploit a diverse set of 346 O. rufipogon accessions for exploiting variation for seven agronomically important traits that affect yield directly or indirectly.

Variation of Seven Agronomic Traits in Panel of O. rufipogon Accessions
A large amount of variation for all the seven agronomic traits was recorded in O. rufipogon accessions. The frequency distribution curves of all the seven traits PH, CT, PL, PB, GL, GW and HGW revealed continuous variation for all the traits (Fig. 1). Pairwise correlations showed a negative trend of PH, PL and PB with all the grain parameters. The descriptive statistics and heritability measurements of the phenotypic traits are given in Table 1. Heritability ranged from 0.38 to 0.80  with minimum observed for panicle length and maximum for grain weight. A few accessions like IR104777,  IR81989, IR100678, IR81802, IR93119 and IR104873 from Thailand, Myanmar, Taiwan, Indonesia, Cambodia and Thailand, respectively, were found to be better in terms of grain length and grain width. Similarly, a few Thailand accessions seem to be promising for promoting CT like IR104796, IR104775 and IR104792. Some other Thailand accessions, IR104783 and IR104766, had higher values of grain weight. Likewise, a Cambodian accession, IR110406, was recognized to have superior panicle architecture. Thus, many accessions were found to have the potential to be used in breeding systems to introduce beneficial genetic diversity into cultivated germplasm.

Population Structure Analysis
PCA plot didn't reveal any distinct sub-grouping indicating absence of strong structure in the population (Fig. 2). Lack of clustering implies natural selection to have occurred in a continuous manner, leading to continuous diversity. Although bayesian model-based clustering by StrAuto suggested probable division into six-subpopulations but the level of differentiation was determined to be too low to call them genetically differentiated (Fig. 3). Considering the membership criterion of 75%, only eighty-nine accessions were classified into discrete sub-populations, and the remaining 257 were judged as admixed. Such a high proportion of admixtures led to blurring of the boundaries among different sub-populations, making this germplasm set an ideal panel for GWAS. High degree of admixture suggests a high degree of gene movement to have occurred between regions. Only a little correlation was observed between geographic coordinates and sub-populations.
Global F st value of 0.06 denoted very low level of genetic differentiation, indicating only 6% of the total genetic variation to be distributed among subpopulations, and remaining 94% of the variation was present within subpopulations. However, F st values showed a marked increase to 0.28, after removal of admixtures. AMOVA test (Table 2) further confirmed the results as only 10.74% of total marker variation was attributed between sub-populations and the remaining 89.26% of variation was observed within sub-populations. This also serves as evidence of presence of continuous variation and absence of discrete classification into subpopulations. By removing the admixtures, the marker variance between sub-populations increased to 30% instead of 10% and remaining 70% was observed within the sub-populations.
Based on PCA, STRU CTU RE, Fst, AMOVA, the current analysis indicated a very weakly differentiated population, where admixed lines made up most of the population. The real structure of the population was masked by the presence of a large number of admixtures as removal of admixtures from the population enhanced F st , pairwise F st values. Also, before performing GWAS, model-based selection suggested the highest BIC value when no PCs were used in the model as covariates. Therefore, in the current analysis, covariates obtained from studying population structure were not added to the GWAS model. Also, the LD decayed to its half maximum at less than 10 Kb.

Genome Wide Association Study
Genome wide association study conducted on a set of 346 O. rufipogon accessions using tagged set of 15,083 SNP markers, revealed a total of 47 significant marker trait associations (MTAs) at p-value ≤ 1e-4 (Table 3). Deciding an appropriate threshold value for determining the significance of association of a genomic region with the trait under study is an important aspect in interpreting GWAS findings. In the current study, Bonferroni corrected p value, LD-based threshold came out to be 3.31499E-06 and 1.28205E−06, respectively. However, mBF based on  Bayesian methods was calculated to be 0.00173379. A total of 10, 6 and 194 significant MTAs were obtained by using the Bonferroni, Ld based and mBF based corrections. However, for current study, p-value threshold was kept to be 1e-4 in order to keep a manageable number of significant SNPs for further in-depth annotation. The details of loci harbouring the significant SNPs or loci in the LD region of significant SNPs along with their functional annotation is given in Table 4. Of the 47 significant MTAs observed, 19 SNPs located in/close to previously reported QTL/genic regions such as bct2b, bct11c,pl2a,gw1,gw4,gw5,gw11.1, AQDZ008, AQDZ009, AQCU085, AQCU149, QKw2b, AQEO021, providing an analytic proof of the concept of our study (Table 3). The distribution of SNP markers chromosome wise is given in Fig. 4. These MTAs were found on all chromosomes, except chromosome 12 as depicted by Manhattan plots presented in Fig. 5. Most of the markers were associated with more than one trait. Markers showing significant association with the trait at p ≤ 1e-4 were designated to be strongly associated with the trait and     (Xiao et al. 1998;Brondani et al. 2002) S4_31,316,844 HGW(5.89E−06, 5.43%) AQEO021 (Redoña and Mackill 1998) the traits are referred to as primary traits. The association of these strongly associated markers with other traits at lesser stringent p value ≤ 0.05 was examined and such traits were designated to be secondary traits. A total of 10 significant SNP associations were obtained each for CT, PL and HGW; 2 each for PB, GL and 14 for GW. For CT, the significant MTAs altered the value of trait over the mean by a maximum of 7.83% on chromosome 7 while the most significant association, obtained on chromosome 3, altered it by 6.75%. Most of the SNPs associated with CT were also associated with PB and PH at lesser significant p-value ≤ 0.05. Sixty percent of the associated MTAs harboured in loci encode proteins like OsCML16-Calmodulin-related calcium sensor protein, terpene synthase, lectin-like receptor kinase 7, chalcone synthase gene, glycosyl hydrolases family 17 protein, peroxisomal multifunctional enzyme type 2 protein. Few loci coding for hsp20/alpha crystallin family protein, nmrAlike family domain containing protein, SMC-related protein MSS2, SMC-related protein MSS2, ubiquitin carboxyl-terminal hydrolase 21 were present in the LD region of these significantly associated SNPs.
The significant MTAs obtained for PL altered the value of trait over mean by a maximum of 8.34% while the most significant association obtained on chromosome 6, altered it by 4.70%. Most of these SNPs were also associated with PH and PB. Ninety percent of significant MTAs harboured in various loci encoding proteins such as helicase domain containing protein, N-rich protein, estradiol 17 beta-dehydrogenase 12, protein kinase domain containing protein, calmodulin-binding protein, gibberellin receptor GIDL2. While a few of them localized in loci encoding expressed proteins of unknown functions, others were in the vicinity of loci encoding tRNA-specific adenosine deaminase, homeobox protein knotted-1.
SNPs most stringently associated with PB were located on chromosomes 4 and 7, S4_30721851 and S7_24282724, respectively. The former SNP altered the trait by a value of 5.56% over the mean value and harboured in LOC_Os04g51809, encoding an expressed protein with highest FPKM values reported in inflorescence. Two other loci, coding for OsHKT1;1-Na + transporter and formin-like protein 20 were present in the LD region of these SNPs. Two significant MTAs obtained for GL on chromosomes, 6 and 8, S6_24807445 and S8_5775398, altered the trait by 1.89% and 1.84%, respectively. SNPs localized in LOC_Os06g41380, LOC_Os08g09990 respectively, both of which encoded expressed protein.
Few other loci, coding for zinc finger family protein, N-terminal asparagine amidohydrolase were in the vicinity of these SNPs.
For GW, the most significant MTA, S8_24,621,885, on chromosome 8 altered the trait over the mean value by 3.92% and was localized in LOC_Os08g38960 encoding conserved expressed protein. The SNP was also associated with PB and PH at lesser significant levels. Among all the significant MTAs obtained for GW, S4_12374542, on chromosome 4, had the highest effect, which altered the trait by a value of 4.88% over the mean value. SNP S5_28157471, altered GW by 2.44% over the mean, was also associated with HGW and PH at p-value ≤ 0.05. The SNP localized in LOC_Os05g49100, encoding WRKY 49 protein. In total, 10 SNPs significantly associated with GW, localized in loci that encode proteins like auxin efflux carrier component, cytokinin-O-glucosyltransferase, calcium-binding EF hand family protein, Pale Cress Protein (PCP), pentatricopeptide, OsSCP46-Putative Serine Carboxypeptidase homologue, DEAD-box ATP-dependent RNA helicase. Also, some other loci encoding F-box family protein, OsWAK33-OsWAK receptor-like protein OsWAK-RLP, polygalacturonase inhibitor 3 precursor, OsSCP44-Putative Serine Carboxypeptidase homologue, tetraspanin family protein, harpin-induced protein 1 domain containing protein (DS), Arabidopsis-LEA (LEA) hydroxyproline-rich glycoprotein family/other ortho NHL25, leucine-rich repeat family protein, cytochrome P450, dynamin-2B, OsGrx_ S17-glutaredoxin subgroup II, lysine ketoglutarate reductase trans-splicing related 1, hydrolase, alpha/beta fold family domain containing protein, pentatricopeptide repeat-containing protein (ortho-60S ribosomal protein L34), pyruvate kinase, hydrolase, alpha/beta fold family domain containing protein, acyl-desaturase, chloroplast precursor were present in the LD region of strongly associated SNPs.
The most significant SNP for HGW, S4_4499266, on chromosome 4 also had the highest effect. Two loci, LOC_Os04g08390 and LOC_Os04g08410, encoding Leucine Rich Repeat family protein, and ELM2 domain containing protein were present within 10 kb region of  the SNP. The latter locus had its highest FPKM expression value reported in anthers. Also, this SNP was associated with GL at p-value of 0.005, altering it by 2.61% over the mean value. SNP S2_2875772, strongly associated with HGW was also associated with GL and PH, was in LOC_Os02g05830, encoding ribulose bisphosphate carboxylase small chain, chloroplast precursor with highest FPKM values reported in embryo. Another SNP, S2_3873759, present on chromosome 2, was also associated with GL. Three more SNPs, S4_26914103, S4_31316844, S4_35115087, altered the trait by ~ 6%. SNP S4_26914103 were part of LOC_ Os04g45480, encoding heat shock protein with highest reported FPKM values in seed. SNP S5_24316574 and S11_19062952, strongly associated with HGW, were localized in LOC_Os05g41530 and LOC_Os11g32260, encoding ZOS5-11-C2H2 zinc finger protein and lysosomal alpha-mannosidase precursor, respectively.

Allelic Effects
The allelic effects of significant MTAs for each trait were evaluated using Kruskal-Wallis test and their chi-square values along with p-values have been presented in Table 5 and represented via box-plots in Fig. 6a-f. For all the traits, the differences among alleles were statistically significant at 34 genomic regions. The differences among alleles were significant from breeding point of view as well. For PL, significant MTA with highest effect, S8_11774122, the genotypes with allele CC had on an average 6.3 cm longer panicles than accessions with GG allele. Similarly, for PB, significant SNP with highest effect on chromosome 7, S7_24282724, accessions with genotypes CC had on an average 1.6 more branches relative to accessions with GG genotypes and this difference was statistically significant.

Identification of Potential O. rufipogon Accessions
The number of O. rufipogon accessions possessing superior allelic combinations for CT, PB, PL, HGW, GL and GW, at significant genomic regions were found to be 5, 13, 1, 3, 34 and 1, respectively (Tables 6,7,8,9,10,11). However, three accessions, CR100443, IR104777, IR104783 had superior alleles for both GL and HGW.  in breeding programs aimed at improving grain weight. Also, these accessions had higher values of CT, HGW and GL, respectively. Similarly, a Cambodian accession, IR110406, had higher values of both PL and PB and can be utilized for improving better panicle architecture. Therefore, these accessions should be utilized in breeding programs to transfer useful genetic variability into the cultivated germplasm. Understanding the nature of correlation among various traits that affect yield directly or indirectly, will lead to improved selection rate of better germplasm, thus paving path to superior genetic gains in the breeding programs. In the present research, PH was positively correlated to PL, PB, CT and negatively correlated to grain parameters.  rufipogon have identified different number of sub-populations, viz., four (Zhou et al. 2003), three (Huang et al. 2012), five (Prathepha et al. 2012), six (Kim et al. 2016), three . Similarly, the level of differentiation estimated by F st /AMOVA indicated low to high level of differentiation in various studies. A conclusive statement combining past and present study expressing number of sub-populations in O. rufipogon seems unjustified as population structure depends on several factors like set of accessions used for study, their geographical origins, different population size, types and number of markers, and method/technique used for predicting structure. LD decayed very rapidly across O. rufipogon genome, and the decay rate was calculated to be 10 kb. Studies by Huang et al. (2012) have also demonstrated similar rate of LD decay and have attributed it to thousands of reproductive cycles and thus several years of recombination, leading to higher mapping resolution in association studies as compared to the domesticated populations.

Genome Wide Association Study in O. rufipogon
GWAS has seen an upward trend in plant sciences since the commencement of this millennium, but it has been challenged by the problem of false positives and false negatives, both of which are equally portentous. Where false positives arising due to unaccounted genetic structure and kinship, lead to practical non-usability of GWAS results during their validation and utilization in biparental mapping populations, false negatives accounted by overcompensating corrections caused by multiple testing ) and strict statistical level (p-value) threshold, lead to loss of true rare SNPs. Therefore, ideally an association mapping (AM) panel with minimum genetic structure or accounted genetic structure in models is employed. Due to absence of strong differentiation in the current AM panel, no structure co-variates were used in the GWAS model in the current study. Generally, a test is statistically significant if the p-value is smaller than the pre-defined alpha value. Since GWAS is based on hundreds to thousands of multiple comparisons/testings, the average probability of false positives increases dramatically. For choosing an appropriate significance threshold that distinguishes false positives from false negatives, many corrections have been demonstrated. Out of them, Bonferroni correction is too stringent, gives a very conservative p-value threshold, resulting in a huge loss of power and leads to loss of true positives and underrate the goal of genome-wide studies. In our case, LD based determination of significance threshold was also too stringent. Therefore, another threshold was defined based on minimum Bayes Factor using the formula mBF = − e*P*ln(P) as documented by Wakefield (2009), Zhang et al. (2019). However, it led to 194 MTAs. Therefore, in order to narrow down the number of significant SNPs for further detailed annotation, significant threshold for current study was determined to be 0.0001. Overall, 47 MTAs were identified on eleven chromosomes, with no associations observed for any trait on chromosome 12. Previous studies have already reported 42.5% of the total significant MTAs obtained in the current analysis, providing positive analytical support for our findings. Had Bonferroni or LD-based criterion been employed for determining p-value, the previously reported QTL regions obtained significant by employing mBF-based correction, would not have been determined   Table 6 The list of O. rufipogon accessions possessing combination of superior alleles at significant SNP positions associated with GW where SNPs marked with * represent alleles absent in indica cultivars S1_40142074* S2_22216515 S3_7917671 S5_23720696 S5_28157471 S8_24621885* S10_19238621 *   Superior  AA  GG  GG  CC  CC  TT  GG   IR103308  AA  GG  GG  CC  CC  TT  GG   PR114  GG  GG  NN  CC  CC CC CC significant as only 10 and 6 MTAs were found to be significant by employing former corrections. Thus, deciding an appropriate threshold for GWAS is one of the determining factors of success of GWAS besides accurate phenotyping and modelling of covariates in the model. Of the 47 significant MTAs reported, only a single SNP, SNP S1_1931325, on chromosome 1, was found to be strongly associated with both CT and HGW. The SNP altered the respective traits by a value of 3.5% and 3.4% over their mean values and localized within LOC_Os01g04330, that encodes an expressed protein OsCML16, gene regulated by OsERF48 transcription factor (TF), whose overexpression in roots led to increased grain yield under drought stress (Jung et al. 2017), thus explaining its association with grain weight. About 40% of the significant MTAs associated with CT harbored in already reported QTLs/genes; bct2b (Mu et al. 2004), QTLs AQDZ008 and AQDZ009 (Kashiwagi and Ishimaru 2004) and bct11 (Mu et al. 2004) reported on 2, 6, 7 and 11 chromosomes. The role of NmrA-like domain containing proteins, Lectin like receptor kinases in cell differentiation, cell division, shoot/fiber development has been documented by Reiner et al. (2016) and Zuo et al. (2004) explaining the association of the significant markers with CT. Similarly, the function of glycosyl hydrolase family proteins, a type of cell wall degrading enzyme, in the control of longitudinal and transverse growth has been linked to CT and PH influencing lodging resistance (Pan et al. 2019). Zhou et al. (2016) have established the interaction of OsRLCK57 with OsBRI1 (a rice BR receptor) to affect rice panicle branching, explaining association with PB. Most of these MTAs also showed association with PH and PB at p-value ≤ 0.05, evident from positive correlation of CT with PB and PH. For PL, 60% of the significant SNP associations obtained in the present study localized in previously identified QTLs/ genes pl2a, qPL-3-2, qPL-6, AQCU085 and AQCU149 as reported by Zhuang et al. (1997), Mei et al. (2003), Kobayashi et al. (2003) and Yanamoto et al. (2016). For the novel MTAs, the expression of proteins encoded by   Superior  CC  TT  TT  CC  AA  TT   CR100443  CC  TT  TT  TT  AA  TT   IR104777  CC  TT  TT  TT  AA  TT   IR104783  CC  CT  TT  TT  AA  TT   PR114  CC  TT  TT  TT  GG  TT   Table 9 The list of O. rufipogon accessions possessing combination of superior alleles at significant SNP positions associated with PL where SNPs marked with * represent alleles absent in indica cultivars S2_30762305* S3_32359666 S6_6597022 S6_11174827* S8_11774122 S9_4933781* S10_14320467* S10_16385834 S11_7438223 *   Superior  GG  GG  GG  CC  CC  GG  GG  CC  AA   CR100459 GG  GG  GG  CC  CC  GG  GG  NN  AA   PR114  TT  GG  GG  GG  CC  CC  AA CC GG the significant novel genomic regions have their highest reported expression values in panicles/seeds, making these regions strong candidates for new genes/QTLs. The role of Calmodulins being regulators of degradation of tapetal cells and pollen development binding proteins Yu et al. 2016) and GIBBERELLIN INSENSITIVE DWARF (GIDs) in GA perception followed by GA triggered actions (Shimada et al. 2008) like regulation of cell elongation and plant height (Thomas et al. 2005), corroborate the role of this region in regulation of panicle length. A single novel strong association was obtained each for PB and GL, on chromosomes 7 and 8, respectively. Besides this, the other two significant SNPs strongly associated with PB and GL obtained on chromosomes 4 and 6, respectively, have been documented as qPRB-4a and qGL-6 by Teng et al. (2002) and Li et al. (2003). The SNP on chromosome 7, significantly associated with **PB and *PH, localized in locus coding for formin like proteins, reported to play a role in polar pollen cell growth and overexpression leading to broadening of pollen tubes, polar growth changes (Cheung and Wu 2004). Of all the 14 MTAs reported for GW, SNP S1_40,142,074, on chromosome 1, harbored in previously reported QTL qGW-1 (Wan et al. 2005). Other MTAs localized/harbored in vicinity of loci encoding proteins belonging to diverse families. The novel SNPs strongly associated with GW on chromosome 2 localized in loci coding for F-box family proteins and cytokinin-O-glucosyltransferases. The latter plays a key role in maintaining the adequate levels of active cytokinins (Takei et al. 2001;Sakano et al. 2004;Abe et al. 2007) essential for modulating the expression of cell cycle regulators which facilitate cell division in the endosperm cells, thus leading to improvement in grain filling (Panda et al. 2018) and seed development ). Similarly, the role of F box proteins in regulating senescence, seed size and grain number has been reported by (Piao et al. 2009). The only SNP strongly associated with GW on chromosome 3   IR93120  AA  TT   IR104404C  AA  TT   IR100597  AA  TT   IR100678  AA  TT   IR103850  AA  TT   IR104425  AA  TT   IR104641  AA  TT   IR104746  AA  TT   IR104766  AA  TT   IR104777  AA  TT   IR104783  AA  TT   IR104796  AA  TT   IR104821  AA  TT   IR104821A  AA  TT   IR104873  AA  TT   IR105240  AA  TT   IR105726  AA  TT   IR105491  AA  TT   CR100001  AA  TT   IR113661  AA  TT   CR100216  AA  TT   IR80562  AA  TT   CR100368  AA  TT   CR100465B  AA  TT   IR80660  AA  TT   CR100490  AA  TT   PR114 AA CC harbored in locus encoding Calcium-binding EF hand family protein, structural component of Calcium-Dependent Protein Kinases (CPKs), reported to be predominantly abundant in panicle, stamen and seed development (Valmonte et al. 2014). Similarly, OsWAK receptor-like proteins, known to play a role in cell expansion (Lally et al. 2001) are reported to be linked to grain yield, panicle characteristics . LRR family protein and pentatricopeptide domain containing protein found in the ld block of MTAs obtained on chromosome 5, are known to regulate panicle/grain size (Su et al. 2012) and plant embryogenesis (Saha et al. 2007), respectively. Furthermore, cytochrome P450s, such as CYP701A8 and CYP714B in rice Magome et al. 2013), are considered to play an important role in gibberellin metabolic pathways and biosynthesis of brassinosteroids, known to regulate grain size regulation in rice, including GS5 (Li et al. 2011), GW5/qSW5 Weng et al. 2008;Liu et al. 2017). Studies by Hong et al. (2002Hong et al. ( , 2003 have demonstrated defects in BR biosynthesis leading to smaller seeds. Recently, Ponce et al. (2020) identified a putative cytochrome P450 (Cyp/LOC Os05g08850) to be a possible candidate gene for the qGW5. Another MTA on chromosome 10, S10_19,109,511, harbored in LOC_ Os10g35750 encoded pentatricopeptides, known to regulation of shape, size and weight of rice grains (Wang et al. 2020). Novel SNP association obtained for GW on chromosome 8 localized in RbcX protein, chaperone involved in biogenesis of Rubisco (Kolesinski et al. 2013), enzyme which fixes inorganic carbon into organic form leading to production of carbohydrates. Another locus in the vicinity coded for dynamin-2B protein, with established role in cellulose biosynthesis as reported by Hirano et al. (2010), Xiong et al. (2010). Also,  have demonstrated that mutation in rice dynamin-related gene OsDRP1E led to significant alteration in key agronomic traits like plant height, grain weight, panicle length etc. Similarly, another locus found in the LD region, coded for endoglucanase, enzyme responsible for degradation of cellulose, making this LD block to be associated with carbohydrate metabolism and thus grain parameters. The novel MTA for GW obtained on chromosome 10 coded for OsSCP46-Putative serine carboxypeptidase, known to control grain size by regulating grain width and grain filling in GS5, loss of which led to wide and heavy grains owing to dense, slender spikelet epidermal cells as demonstrated by Duan et al. (2017). Another novel MTA, S10_19,238,621, on chromosome 10 localized in DEAD-box ATP-dependent RNA helicase with highest expression in pistil. The other QTLs in the LD block coded for tetraspanins and remorins, known to be involved in floral organ formation (Mani et al. 2015) and grain setting (Gui et al. 2014), making this region to be a strong candidate for grain parameters.
For HGW, 70% of the MTAs obtained in the present study have been previously reported as gw1.2/qTGW1-2/gw1.1 (Moncada et al. 2001;Septiningsih et al. 2003;Hittalmani et al. 2002), QKw2b , AQE053/gw4 (Xiao et al. 1998;Brondani et al. 2002), AQEO021 (Redoña and Mackill 1998), gw5b/gw5 (Xiao et al. 1998;Hua et al. 2002), gw11.1 (Moncada et al. 2001) on chromosomes 1, 2, 4, 5 and 11. Amongst the novel MTAs obtained in present research, S4_35,115,087, on chromosome 4, was located close to loci coding for proteins with F-box domain and soluble inorganic pyrophosphatase enzyme with highest FPKM values reported in seed. Comprehensive analysis of F-box proteins in rice by Jain et al. (2007) suggest their role in floral transition as well as panicle and seed development. Also, loci like OsFBK12 and LARGER PANICLE as reported by Chen et al. (2013) and Li et al. (2011), code for F-box proteins have been reported to regulate seed size, grain number and panicle size, grain weight, grain number, primary branches, respectively, making this region likely to be associated with a novel grain weight region.

Potential O. rufipogon Accessions
The identification and utilization of O. rufipogon accessions possessing superior allele combinations at genomic regions significantly associated with trait of interest is one of the promising strategies to introgress useful genetic variability in cultivated gene pool. Thus, identification of 51 O. rufipogon accessions possessing superior alleles would enhance the speed of rice breeding operations. Comparison between alleles of O. rufipogon and an elite O. sativa indica cultivar, PR114, at 34 significant genomic regions revealed superior alleles of wild relative to be absent at 12 loci, implying that despite excessive utilization of O. rufipogon in breeding programs, there is still untapped genetic diversity in the progenitor whose introgression in cultivated rice would substantially increase genetic gains.

Conclusions
Identification of genetic factors underlying agronomically important traits is critical to meet the world's growing demand for high crop yields. Abundant phenotypic variation in wild O. rufipogon germplasm coupled with minimum population structure, made this germplasm an ideal panel for conducting association mapping studies. GWAS revealed a total of 47 significant MTAs, out of which 19 were part of previously documented gens/ QTLs, providing a positive analytic proof of our study. In-depth genome annotation in the LD region of significant MTAs identified putative candidate genes belonging to F-box proteins, Lectin like receptor kinases, glycosyl hydrolases, Calmodulins, GIDs, formin like proteins, cytokinin-O-glucosyltransferases, OsWAK receptorlike proteins, Cytochrome P450, pentatricopeptides and putative serine carboxypeptidase. The role of majority of the identified putative candidate genes could be established with the trait of interest using previous literature. Validation of the putative candidate genes would contribute to their use in rice breeding programs, broadening the genetic base of cultivated rice, thus making the crop more resilient. Furthermore, genotypes chosen on the basis of improved phenotypic performance along with superior combination of alleles can be directly incorporated into breeding programme to generate pre-breeding material, which will serve as a valuable germplasm resource for rice breeding.

Plant Material and Phenotyping
School of Agricultural Biotechnology, Punjab Agricultural University, Ludhiana is maintaining a large set of wild species accessions belonging to different genomes of rice through clones or seeds. These accessions were originally procured from International Rice Research Institute, Philippines and Central Rice Research Institute, Cuttack. In the present study, a set of 346 accessions of O. rufipogon was investigated. The detailed information of these accessions is provided in the Additional file 1: Table S1.
Phenotypic data was collected in replications from 2014-2016 years for seven different traits, namely, plant height (PH), culm thickness (CT), panicle length (PL), number of primary branches per panicle (PB), grain length (GL), grain width (GW) and hundred grain weight (HGW). Data for HGW was recorded in all the three years, while all the other traits were recorded in two years, 2014 and 2015. Briefly, PH and PL was recorded from two different plants and four panicles per accession. The culm thickness was measured from four and six plants respectively with a Vernier caliper. The number of primary branches were counted manually from four panicles. Grains were dehulled and grain parameters; GL and GW, were recorded for 10 grains/accession with grain analyzer. Grain weight was recorded for hundred grains with an electronic weighing balance.

Statistical Analysis of Phenotypic Data
The phenotypic data was statistically analyzed in R version 3.4. Distribution of averaged phenotypic data was checked by plotting histogram using hist function and by Shapiro-Wilk test. Statistical analysis of phenotypic data was done in R using lme4 package (Bates et al. 2015). For each trait, components of phenotypic variance were estimated from analysis of variance using restricted maximum likelihood methods. The linear mixed effects, lmer function, in lme4 package (Bates et al. 2015) was used to estimate variance components. All the effects were treated as random and broad sense heritability (H 2 ) on a line mean basis was calculated.

DNA Isolation and Genotyping
Large scale DNA was isolated from each accession from 10-day old leaves using Cetyltrimethyl ammonium bromide (CTAB) method (Doyle and Doyle 1987). DNA quality was accessed on 0.8% agarose gel electrophoresis and genomic DNA was quantified using Thermo Scientific NanoDrop ™ 8000 spectrophotometer, followed by its normalization to 100 ng μl −1 . Thereafter, the samples were sent to Genomic Diversity Facility, Cornell University, NY, USA for Genotyping by Sequencing (GBS). Restriction enzyme ApeKI was used to generate GBS library.
GBS data was analyzed with the reference-based 'discovery' pipeline described in TASSEL 3.0 documentation and in Glaubitz et al. (2014). The vcf file generated after the discovery pipeline, was indexed for use with bwa version 0.7.8-r455. After alignment, file was filtered for the minor allele frequency (maf ) > 0.01 and missing data per site < 90% using VCFtools version v0.1.12a. Further filtration was done in unix and R to remove all monomorphic and multi-allelic markers. Also, accessions with missing data points more than 10% were removed to obtain final SNP data file for further analysis.

Population Structure and Linkage Disequilibrium Analysis
Principal Component Analysis (PCA) and StrAuto program was used to investigate population structure of 346 O. rufipogon accessions. Before PCA, the missing data was imputed using A.mat function of rrBLUP package (Endelman 2011). PCA was done on imputed dataset using prcomp function, based on Singular Value Decomposition method. Strauto program (Chhatre and Emerson 2017), based on Structure V2.3.4 software model-based clustering program, was used to infer the population structure. The input file for running STRU CTU RE was prepared using PGDSpider (Lischer and Excoffier 2012). The length of burn-in period and number of Monte Carlo Markov Chain (MCMC) replicates after burn-in were set to 100,000 each. The dataset was analyzed for K values ranging from 1-10 with 10 replications/K value. Admixture model-based approach was used to infer the population structure. The best K was determined by Structure Harvester (Earl and vonHoldt 2012) based on Evanno method (Evanno et al. 2005). The outcome of STRU CTU RE was plotted with Pophelper package (Francis 2017) in R. Other widely studied parameters for assessing genetic diversity like fixation index (F st ) and AMOVA were calculated by stamppFst function of StAMPP package (Pembleton et al. 2013) and Poppr package in R (Kamvar et al. 2014), respectively. The stamppFst function of StAMPP package calculates pairwise F st values along with confidence intervals and p-values between populations according to the method proposed by Wright (1949) and updated by Weir and Cockerham (1984). The number of bootstraps was set to 100. LD decay was calculated using PopLDdecay program in unix and was plotted in R using a customized script.

Genome Wide Association Study
Genome-wide association study (GWAS) was carried out for seven traits, namely, PH, CT, PB, PL, GL, GW, HGW in R using GAPIT 3 (Wang et al. 2020) using a tagged set of 15, 083 SNPs (Additional files 2 and 3: Tables S2 and S3). SNP tagging was done in R using hclust2 function. PCA, STRU CTU RE analysis and BIC values indicated absence of genetic structure in the panel, therefore, no covariates were included in the model to correct for population structure. FarmCPU calculated kinship, based on FaSTLMM algorithm, was considered while estimating associations in order to prevent false positives, arising due to population structure. Determining an optimum threshold that determines the significance of a genomic region with trait of interest is of utmost importance to minimize both Type I and Type II errors. Therefore, various corrections such as Bonferroni-correction, LD-based correction and minimum Bayesian approaches were tried and compared. Bonferroni correction was calculated using the formula alpha/n; where alpha = 0.05 and n = 15,083. LD based approach determines effective number of independent tests as LD bins calculated by Reference genome size (390 MB)/Average LD extent (10 Kb). Considering the experiment wide probability of Type-I error to be 0.05, LD-based correction was calculated as documented by Zhang et al. (2015). Minimum Bayes Factor was calculated using the formula e*P*lnP as documented by Goodman (2001) and Zhang et al. (2019). GWAS results were assessed by studying the Quantile-Quantile plots (QQ plots), Manhattan plots and association tables for each trait. The allelic effects were determined for the strongly associated markers by depicting phenotype data for alleles as box plots and using the Kruskal-Wallis test to see if the alleles differ significantly for the associated traits.