1k-RiCA (1K-Rice Custom Amplicon) a novel genotyping amplicon-based SNP assay for genetics and breeding applications in rice

Background While a multitude of genotyping platforms have been developed for rice, the majority of them have not been optimized for breeding where cost, turnaround time, throughput and ease of use, relative to density and informativeness are critical parameters of their utility. With that in mind we report the development of the 1K-Rice Custom Amplicon, or 1k-RiCA, a robust custom sequencing-based amplicon panel of ~ 1000-SNPs that are uniformly distributed across the rice genome, designed to be highly informative within indica rice breeding pools, and tailored for genomic prediction in elite indica rice breeding programs. Results Empirical validation tests performed on the 1k-RiCA showed average marker call rates of 95% with marker repeatability and concordance rates of 99%. These technical properties were not affected when two common DNA extraction protocols were used. The average distance between SNPs in the 1k-RiCA was 1.5 cM, similar to the theoretical distance which would be expected between 1,000 uniformly distributed markers across the rice genome. The average minor allele frequencies on a panel of indica lines was 0.36 and polymorphic SNPs estimated on pairwise comparisons between indica by indica accessions and indica by japonica accessions were on average 430 and 450 respectively. The specific design parameters of the 1k-RiCA allow for a detailed view of genetic relationships and unambiguous molecular IDs within indica accessions and good cost vs. marker-density balance for genomic prediction applications in elite indica germplasm. Predictive abilities of Genomic Selection models for flowering time, grain yield, and plant height were on average 0.71, 0.36, and 0.65 respectively based on cross-validation analysis. Furthermore the inclusion of important trait markers associated with 11 different genes and QTL adds value to parental selection in crossing schemes and marker-assisted selection in forward breeding applications. Conclusions This study validated the marker quality and robustness of the 1k-RiCA genotypic platform for genotyping populations derived from indica rice subpopulation for genetic and breeding purposes including MAS and genomic selection. The 1k-RiCA has proven to be an alternative cost-effective genotyping system for breeding applications. Electronic supplementary material The online version of this article (10.1186/s12284-019-0311-0) contains supplementary material, which is available to authorized users.


Background
Rice (Oryza sativa) is the staple food for more than 3.5 billion people (Wang et al. 2018). To meet the future global demand for rice an estimated additional 116 million tons of it will be needed by 2035 (Seck et al. 2012). Based on this assumption up to 1.5 to 2.4% yield increase per year will have to be attained despite limiting water supplies, reduced cultivation area, and fluctuating climatic conditions (Seck et al. 2012;Ray et al. 2013). Recent studies indicated that rice yield increases have plateaued in different regions of the world (Ray et al. 2013). Overcoming this yield stagnation requires great efforts and innovation to improve the efficiencies in cultivation, management and, not least of all, varietal development. Rice breeders and geneticists need to capitalize on the latest methods and tools to accelerate variety development and increase the annual rate of genetic gains for multiple traits to maintain a stable food supply that meets the needs of a growing population .
Recent advances in next-generation sequencing (NGS) and single nucleotide polymorphism (SNP) genotyping promise to accelerate crop improvement provided they are properly integrated and deployed into breeding programs (Thomson 2014). SNPs are the markers of choice for most high-throughput genotyping applications. They are abundant, co-dominant and evenly distributed along the genome. High-throughput SNP genotyping platforms have enabled rapid, routine and cost effective genotyping solutions for targeted marker assisted selection (MAS) on large effect QTLs or genes of interest (Ramkumar et al. 2015;Kurokawa et al. 2016). Genome-wide SNP genotyping combined with effective and precise sample -tracking, −collection, and -DNA extraction is a powerful tool that can reshape breeding programs and facilitates increasing gain from selection. Genome-wide SNP genotyping enables integrating targeted MAS and genomic selection (GS) approaches into different breeding strategies and contributes to increasing the efficiency of multiple other breeding activities (Chen et al. 2016), such as seed purity testing, pedigree verification, and varietal identification (Tian et al. 2015).
NGS and array-based technologies are the two dominant SNP detection systems for genome-wide genotyping. NGS methods, commonly termed genotyping-by-sequencing (GBS), range from whole genome re-sequencing (or skim sequencing) (Scheben et al. 2018) to reduced representation sequencing (RRS) (Elshire et al. 2011). GBS-type technologies have been applied to a range of crops, providing large data volume at a low cost per data point and are independent of prior genomic information, genome size, genome organization or ploidy (Elshire et al. 2011). In rice, GBS has been used to characterize bi-parental populations (Spindel et al. 2013;Arbelaez et al. 2015), multi-parent mapping populations (Bandillo et al. 2013), nested association mapping populations (Fragoso et al. 2017), and breeding populations Spindel et al. 2015). Although NGS platforms have the advantage of minimal ascertainment bias, they currently require complex experimental protocols, sophisticated data analysis, and bioinformatics pipelines to process raw sequence data into useful genotypic matrices. This added capacity cost currently limits the applicability of NGS platforms in many public breeding programs (Chen et al. 2014).
Array-based genotyping platforms provide highmedium-and low-density genome scans, robust highquality allele calling, easy handling, and simplified analysis to routinely generate genotypic datasets (Rasheed et al. 2017;Scheben et al. 2018). Their main disadvantages are lack of flexibility due to ascertainment bias, and, despite a reasonable low cost per data point, a comparatively high cost per sample (Rasheed et al. 2017). A number of low-, medium-, and high-resolution SNP arrays have been developed for rice and their utility was demonstrated across a range of applications. In the low and medium density range this includes, but is not limited to, the 384-plex BeadXpress (Chen et al. 2011), the GoldenGate 1536 SNPs ) and two Illumina Infinium-based 6 K arrays, the RiceSNP6K ) and the C6AIR . They have been used for diversity analysis, QTL mapping, marker assisted backcrossing (MABC) and pedigree verification among breeding lines. On the high density end, the 700 K High Density Rice Array (HDRA700K) , two 50 K arrays (RiceSNP50K and Affymatrix 50 K) (Chen et al. 2014;Singh et al. 2015) and a 44 K array (GeneChip Rice 44 K)  were deployed mainly for genome-wide association studies (GWAS) (Famoso et al. 2011;Crowell et al. 2016). These arrays were developed to be highly informative across diverse germplasm including different rice subpopulations and they were optimized to dissect (phylo-) genetic relationships and phenotype to genotype associations.
Less effort has been made in developing informative high-throughput, and cost effective genotyping solution specifically designed for applied breeding programs. Large-scale application in rice breeding with emphasis on population improvement strategies that integrate GS, necessitates routine genotyping of thousands of lines per season at the shortest turnaround time possible to make in-season decisions based on genomic estimated breeding values (GEBV). Globalized breeding programs require breeding populations to be evaluated at multiple locations with different planting dates leaving a very small window to sample, process and analyze genotypic data. The current complexity of GBS technologies and cost of array technologies are limiting in this context. GS approaches are increasingly being adopted to accelerate the rate of genetic improvement of key agriculturally important species, including rice Grenier et al. 2015;Monteverde et al. 2018). GS uses a 'training population' of individuals that have been both genotyped and phenotyped to develop a model that takes genotypic data from a 'candidate population' of untested individuals and produces GEBVs (Jannink et al. 2010). A major challenge in implementing GS in most plant breeding programs is the cost of genotyping per sample. The expected value of the information gained by genotyping must exceed the cost of obtaining the genotypes (Boichard et al. 2012). Studies that have evaluated the effect of the number of markers on GS accuracies in closely related breeding germplasm consistently show diminishing returns from increasing marker number (Rutkoski et al. 2013;Gorjanc et al. 2017;Raoul et al. 2017). Provided high levels of informativeness at each target locus, genotyping at relatively low density could be an effective way to reduce cost with minimal impact on GS accuracy leading to a greater return on investment from genotyping (Abed et al. 2018). Sequencing of multiplex PCR-based amplicons to capture high value SNPs may be an ideal low-density genotyping platform for GS applications, featuring low cost, robustness and scalability. Amplicon sequencing technologies assay a polymorphic panel of hundreds to a few thousand target SNP markers at the population scale with demonstrated applicability in phylogenetics (Dupuis et al. 2018), structure analysis (Andrews et al. 2016), and QTL mapping (Onda et al. 2018). Moreover, targeted amplicon sequencing effectively allows genotyping of small amounts of low quality DNA, even derived from dried herbarium samples (Beck and Semple 2015;Csernak et al. 2017). Therefore high throughput in-field sampling with minimal considerations on remote field stations is suitable. Amplicon panels are available for both the Illumina (Csernak et al. 2017) and the Ion torrent (Glotov et al. 2015) platforms and aim at interrogating allelic diversity at known loci of interest. The vast amount of genomic information available in rice, including a high quality reference genome (Matsumoto et al. 2005;Kawahara et al. 2013), 3010 re-sequenced varieties (Wang et al. 2018), de novo assemblies' within different rice subpopulations (Schatz et al. 2014;Duitama et al. 2015) and high-density genotypes of diversity panels (Huang et al. 2010;McCouch et al. 2016), constitutes an ideal resource to accurately design custom amplicon panels with carefully selected, highly informative SNPs uniformly distributed across the rice genome and tailored for specific breeding applications.
Here, we report the development of a 1000-SNP (1 K) Rice Custom Amplicon assay, or 1k-RiCA, for the costeffective amplification and sequencing of a thousand highly informative SNP sites in a 384-plex protocol that has wide applicability, good repeatability, high accuracy and high efficiency in genotyping breeding lines and populations derived from the indica subspecies of Oryza sativa. Furthermore we demonstrate applicability of 1k-RiCA for indica diversity studies, and genomic prediction for indica based breeding programs.

Plant materials
Rice accessions genotyped with the 1k-RiCA and used for analyses in this study are listed in Additional file 1: S1. A set of 700 samples from a panel of 283 diverse inbred Oryza sativa rice lines replicated at different levels was used to estimate the markers call rate, heterozygosity, repeatability, concordance properties. Among the 283 diverse accessions, 185 have been classified in different subpopulations according to structure analysis performed by Wang et al. (2018) and McCouch et al. (2016), and specific breeding germplasm source (IRRI irrigated indica breeding program) with 150 known indica lines, that grouped the subpopulations ind, ind1A, ind1B, ind2, ind3, and indx, 24 japonica, grouping the subpopulations trj, tej, temp, trop, and japx, 8 aus, 2 aromatic (subpopulation aro) and 1 admix. The remaining 98 unclassified lines came from a panel of 70 pigmented 'black rice' lines, and 28 were advanced lines from different rice breeding programs. Consensus genotypes generated for the 283 accessions were used for principal component (PCA). An additional PCA analysis solely on O. sativa sp. indica lines was done with 177 diverse indica rice accessions, 41 indica 'black rice' accessions and 213 elite recombinant inbred lines (RILs) derived from 11 elite-by-elite indica x indica bi-parental populations from the International Rice Research Institute (IRRI) Favorable Environments Breeding Program (FEBP). A set of 57 F 1 plants from six families derived from indica × indica, and indica × japonica crosses were used to test the accuracy of the 1k-RiCA to call heterozygous genotypes. The value of the 1k-RiCA for Genomic Selection (GS) was tested using 353 indica elite breeding lines derived from 30 elite-by-elite indica × indica bi-parental populations from IRRI's FEBP in a series of cross-validation experiments.

Design of the 1k-RiCA SNP assay
The 1K Rice Custom Amplicon assay or 1k-RiCA was designed on Illumina's TruSeq Custom Amplicon (TSCA) 384 Index Kit technology (https://www.illumina. com) using Illumina's proprietary workflow. Initially 1, 554 genome-wide SNPs and 28 markers associated with highly valuable traits were provided to Illumina for an in-silico testing. Of the original 1,582 SNPs supplied to Illumina, 967 uniformly distributed genome-wide and 28 trait markers were retained in the 1k-RiCA after the iterative TruSeq custom amplicon design and validation process. The 967 genome-wide SNPs were selected from two publically available resources, the Cornell_6K_ Array_Infinium_Rice or C6AIR chip  and the 3,000 rice genomes Mansueto et al. 2017;Wang et al. 2018). The 604 markers from the C6AIR data set were selected based on their high call rates (> 95%) and high minor allele frequencies (MAF ≥ 0.4) determined from genotypic data available on 1,172 IRRI indica rice breeding lines and indica released varieties genotyped with the C6AIR. The remaining 363 markers from the 3,000 rice genomes were selected to fill physical distance gaps not captured by the C6AIR's SNPs and filtered for high call rates (> 95%) and high minor allele frequencies (MAF ≥ 0.4) estimated across 1,174 indica landrace accessions and cultivated indica varieties sequenced within the 3,000 rice genomes dataset Mansueto et al. 2017, andWang et al. 2018).
The 28 trait-related markers linked to 11 different important trait associated genes/QTLs were obtained from functional markers or markers demonstrated to be linked and associated with the respective gene/QTL as reported in the literature. These markers include one marker for gelatinization temperature (GT); starch synthase IIa or alk (Gao 2003, andBao et al. 2006), three for apparent amylose content (AAC) associated with the Waxy gene alleles wx, Wx t , Wx g1 , Wx g2 , and Wx g3 (Dobo et al. 2010;Teng et al. 2017

Genotyping and SNP calling
Genomic DNA (gDNA) was extracted from leaf tissue of single plants using methodologies described in Dilla-Ermita et al. (2017) and based on either using CTAB (Murray and Thompson 1980), or KingFisher SBEadex kits (https://www.thermofisher.com). DNA quality was checked visually on 1% agarose gel, while DNA quantity was assessed using PicoGreen® (https://www.biotek.com), and Qubit 2.0 (https://www.thermofisher.com) fluorometric kits. The concentration of DNA was adjusted to be close to 10 ng/μL for library preparation. 384-plex indexing and pooling was performed as instructed by the manufacturer (https://www.illumina.com). Sequencing was performed using the MiSeq Sequencing-by-Synthesis Technology System as specified by illumina® (https:// www.illumina.com). A custom SNP-calling pipeline described in the Additional file 1: S2, was used to assign variants on the 1k-RiCA amplicons through alignment to the Nipponbare rice genome MSU7 version (Kawahara et al. 2013). Final SNP data were merged with SNP map information and encoded with the physical position and chromosome number of the SNP markers in a Hapmap format (International T, Consortium H 2003).

SNP filtering, repeatability, concordance and imputation
SNPs were removed if minor allele frequency (MAF) ≤ 0.01; heterozygous calls ≥10%; and call rate (CR) ≤ 75% using customs scripts written in R version 3.5.0 (R Core Team 2018) and deposited in Github (https://github. com/jdavelez/1k-RiCA-geno-filters/blob/master/jdave-lez_1k-RiCA.R). For each SNP, heterozygosity was determined as the proportion of heterozygous calls among all successfully called genotypes. SNP call rate was defined as the proportion of successfully called genotypes among all samples used in the study. Repeatability, or the degree of consistent genotype calls between independent samples from the same accession, was calculated among 38 different accessions that had 4 or more independent replicates as R = 100 − e l , where 100 is the maximum value expressed in percentage of consistent genotype calls between independent replicates, minus the mean error rate per locus or e l described by Pompanon et al. (2005) and measured as the ratio between the number of single-locus genotypes with at least one allelic mismatch (m l ) and the number of replicated single-locus genotyped (n l ) compared to a reference genotype (e l = m l /n l ), averaged across all replicated accessions. Concordance rate or the degree of consistent genotype calls from common SNPs assayed in two different genotyping platforms for the same accession was measured as the proportion of exact matched genotypes between common SNPs genotyped using two different genotypic platforms, 1k-RiCA versus C6AIR or 1k-RiCA versus 3,000 genomes, in the same accessions. For further GS analysis, the SNP filtered data was imputed in TASSEL v5.0 (http://www.maizegenetics.net/tassel) (Bradbury et al. 2007) using the LD KNNi imputation methodology with a High LD Sites 30 and Number of nearest neighbors of 30 using a LinkImpute algorithm (Money et al. 2015).

Hierarchical clustering and principal component analysis
A hierarchical clustering analysis using Ward's minimum variance method (Sokal and Michener 1958;Murtagh and Legendre 2014) was done using the R version 3.5.0 function 'hclust' (Murtagh and Legendre 2014; R Core Team 2018) where Ward's clustering criteria is implemented and the dissimilarities are squared before cluster updating. A dendogram graph was built in R using the function 'plot (asphylo())' (R Core Team 2018). A principal component analysis (PCA) (Pearson 1901) was performed and visualized using the R function 'prcomp' (Mardia et al. 1979, and R Core Team 2018). The numbers of optimal clusters (k-means) observed in the PCA analysis was determined using the Silhouette method (Rousseeuw 1987) using the R function 'silhouette' from the R package 'cluster' (Maechler et al. 2013).

Trait markers quality control evaluation
The ability of the 1k-RiCA trait markers to correctly identify the samples with the desired and undesired alleles was determined using the SNP Quality Control methods and variables described by Platten et al. (2019). The variables used were: i) 'Utility' described by Platten et al. (2019) as the "proportion (percentage) of a prospective breeding pool across which a marker could be used to introgress a QTL. This is equivalent to the proportion of the pool which does NOT carry the donor allele of a marker", calculated as: #cultivars withOUT favorable allele/Total # cultivars assesed, ii) 'False Positive Rate' ('FPR'), or "the proportion known negative genotypes incorrectly classified as having the target QTL allele. Assayed as the number of known recipients identified as not having an unfavorable allele of the marker (and thus incorrectly classified as having the target QTL allele)", calculated as: #recipients withOUT unfavorable allele/Total # recipients, and iii) 'False Negative Rate' ('FNR') or "the converse of FPR, the proportion of known target QTL genotypes incorrectly classified as not having the desired QTL allele due to not having a favorable allele of the marker", calculated as: #donor with-OUT favorable allele/Total # donors. Utility, FPR and FNR were measured and analyzed for each individual trait marker and/or trait haplotypes for those traits with more than one molecular marker associated with them.

Genomic selection
Three hundred fifty-three elite indica breeding lines from IRRI's Favorable environment Breeding Program (FEBP) and 6 different agronomical checks were selected for genotyping with 965 polymorphic markers from the1k-RiCA. These lines were derived from 30 different bi-parental families of sizes varying from 1 to 36 individuals, with an average of 12 plants per family. Phenotyping of these lines took place at IRRI's -Los Baños experimental station during the 2017 wet-season (WS) and 2018 dry-season (DS), and PhilRice's Nueva Ecija experimental station during 2017 WS. The lines were phenotyped using an augmented p-rep design (Williams et al. 2011) with a replication of 1.2, for total 430 plots evaluated on each yield trial. Plot sizes were of 6.48 m2 (6 rows × 27 hills) in Los Baños, and 5.4 m2 (5 rows × 27 hills) in Nueva Ecija.
The target traits evaluated were days to flowering ('FLW'), grain yield ('GY'), and plant height ('PH'). FLW was recorded as the number of days after sowing, when 50% of the plants in the plot produced flowers. GY was estimated from a 3.12 to 5 m 2 plot harvested and weighed and corrected for moisture content using the formula: Þ Â ð0:01Þ . From this sample the grain yield per hectare was calculated. PH was the actual measurement in cm from soil to the tip of the tallest panicle (International Rice Research Institute 1996).
Mixed linear models using the function lmer() contained in the R package 'lme4' (Bates et al. 2014) were used to estimate BLUEs (Best Linear Unbiased Estimate) and BLUPs (Best Linear Unbiased Predictor) for all traits. Restricted Maximum Likelihood Estimation or RMEL method was used to estimate the variance components by setting the argument RMEL = TRUE in the lmer() function. The model was fit according to: Y ijk = μ + g i + t j + r(t) jk + e ijk . Where Y ijk is the phenotypic observation on genotype i, in the trial j and in replicate k, μ, the overall mean, g i , the genotype effect, t j , the trial effect, r(t) jk , is the replicate within trial effect, and e ijk the residual. To obtain BLUPs for the genotypes, except for the overall mean, all the effects were considered random. To obtain BLUEs for the genotypes, the overall mean and genotypes were considered as fixed effects. Adjusted means of accessions (BLUEs) were extracted for each trait to be used as phenotypes in the genomic prediction models.
Broad sense heritability of accession means, H 2 , was calculated for each trait using the formula of Hallauer et al.
(2010) as follows: . Where t represents the mean number of trials in which accessions were tested and r, the mean number of plots per accessions across trials. Genotypes were assumed independent and identical distributed for estimating H 2 . Variance components were estimated from a linear mixed model using 'lmer4' and the model Y ijk = μ + g i + t j + r(t) jk + e ijk , where Y ijk is the phenotypic observation on genotype i, in the trial j and in replicate k, μ, the overall mean, g i , the genotype effect, t j , the trial effect, r(t) jk , is the replicate within trial effect, and e ijk the residual. The argument RMEL in the lmer() function was set TRUE to estimate the variance components.
For the cross-validation studies six different genomic selection models were used to estimate genomic estimated breeding values (GEBVs) including ridge regression (Endelman 2011) and five Bayesian models; BayesA (scaled-t), BayesB (gaussian mixture), BayesC (scaled-t mixture) (Meuwissen et al. 2001;Habier et al. 2011), Bayesian Lasso (BL) (Park and Casella 2008), and Reproducing Kernel Hilbert Spaces Regressions fitting the markers and pedigree relationships (RKHS G + A) as random effects (Pérez and de los Campos 2014). The model RKHS G + A implements a reproducing kernel Hilbert space (Wahba 1990) regression fitting two random effects, one representing a regression on pedigree, a $ Nð0; Aσ 2 a Þ, where A is a pedigree-derived relationship matrix, and one representing a linear regression on markers, g $ Nð0; Gσ 2 gu Þ where, G is a marker-derived genomic relationship matrix. The ridge regression model was tested using the R package 'rrBLUP' (Endelman 2011). The Bayesian models were implemented using the R package 'BGLR' (Pérez and de los Campos 2014) and the default prior parameters described in Pérez and de los Campos (2014) with a thinning value of 5, and 12,000 iterations with the first 2000 iterations discarded as burn-in. Trace plots as described by Pérez et al. (2010) were used to visually check conversion for some models selected at random. The samples residual variance data given by BGLR outputs can be plotted using the following script; plot (scan("varE.dat"), type = "o"), where "varE.dat" is a vector of the residual variance. Additionally pedigree-BLUPs (Henderson 1975) using the pedigree relationship matrix were estimated to compare the performance of genomic selection models.
A 5-fold (k = 5) cross validation experiment using 4/5 of the 353 lines as the training set to predict the remaining 1/ 5 of the validation set was used. Each cross validation was repeated 10 times using 10 independent partitioning of the accessions into the training set and validation set. The presence of highly related individuals in the dataset could have the effect of artificially inflating prediction abilities if the closest individuals are randomly assigned to different folds, and one of those folds are used a training. To control for this possibility a stratified cross validation strategy was used when designing the different folds by sampling individuals randomly within families defined using the pedigree information of the lines. The accuracy of each cross validation experiment was computed as the mean value of the 10 Pearson correlations (Pearson 1901) between the observations and the cross-validated GEBVs, also known as the predictive ability (Heslot et al. 2012).

1k-RiCA SNP assay design
The 1k-RiCA was explicitly designed to be informative for Oryza sativa L. ssp. indica rice germplasm (see Materials And Methods). Out of the total 995 SNPs included and amplified in the 1k-RiCA, 604 markers were made up from the C6AIR , 363 markers from the '3000 rice genomes' (Mansueto et al. 2017, andWang et al. 2018), and 28 markers linked to 11 different 'high-valued' trait genes/QTLs ( Fig. 1 and Additional file 1: S3). Of the 995 SNPs, 482 markers localized within MSUv7 gene models (http://rice.plantbiology.msu.edu), and 513 were located within intergenic regions (Additional file 1: S3). The average physical distance between two adjacent markers across the whole genome in the 1k-RiCA set was 372 kb, or~1.524 cM (SD = 1.2 cM), with 1 cM equal to~244 kb (Chen et al. 2002). More than 50% of the markers are spaced from each other at a distance of 293 kb (~1.2 cM) or less (Additional file 2: Figure S1). The median SNP minor allele frequency (MAF) estimated from 1k-RiCA genotypic data on 431 indica accessions was 0.36 with 50% of the markers having MAF between 0.28 and 0.44 (Additional file 2: Figure S2).
The utility and accuracy of 28 markers associated with 11 different traits and designed for MAS strategies were  (Mansueto et al. 2017, andWang et al. 2018), and trait-markers are in green evaluated using different Quality Control (QC) parameters described by Platten et al. (2019). Based on the results of the 'Utility' , 'False Positive Rates' ('FPR'), and 'False Negative Rates' ('FNR') QC parameter 21 SNPs were selected either to be used individually or as a haplotype for MAS for 11 different traits. A detailed description of the analysis results and the allelic interpretation for the selected SNPs is presented in Table 1. Individual SNPs for the traits associated with the loci GS3, xa5, Xa4, Xa21, rstv, and ALK are suitable for MAS. In addition, two or more markers associated with the loci Xa7, xa13, Xa23, sub1, and Wx for apparent amylose content (ACC) can be used as haplotypes for MAS applications (Table 1). More specifically, in the case of AAC, the haplotypes from the SNPs chr06:1.766, chr06:1.768, and chr06:1.769 can be used to differentiate high (

G-A-C / G-A-T), intermediate (G-C-C), and low (T-A-C / T-C-C) ACC (Dobo et al. 2010).
SNP call, heterozygosity, repeatability and concordance rate of the 1k-RiCA SNP call rates and heterozygosity were empirically determined in the 1k-RiCA using 700 independent DNA samples derived from 283 partially replicated rice accessions. The mean call rate across all SNPs was 95% (or 0.95) (Additional file 2: Figure S3), and the mean heterozygosity observed among SNPs was of 1.5% (or 0.015) (Additional file 2: Figure S4). After removing 97 SNPs with less than 75% call rates and 5 more SNPs with heterozygosity values higher than 10% a total of 895 SNPs were kept for subsequent analysis in this set of samples.
The 1k-RiCA marker repeatability was measured on 38 different accessions replicated 4 or more times (Additional file 2: Figure S5). The average SNP repeatability observed on the replicated accessions genotyped with the 1k-RiCA was of 99% (Additional file 2: Figure S6). A thorough look at the 1% average genotyping mismatches showed a bias on miscalled heterozygous based on the inbred nature of these lines, accounting for 0.7% of the observed 1% mismatches. After heterozygous calls were removed and imputed the mean repeatability increased to 99.7% (Additional file 2: Figure S6). SNP concordance rate between the 1k-RiCA and Cornell's C6AIR, and the 1k-RiCA with the '3000 rice genomes' were estimated and averaged across overlapping SNPs in commonly genotyped samples. To compare the 1k-RiCA and the C6AIR platform calls, a set of 600 overlapping SNPs across 34 different accessions were genotyped with both platforms. Concordance values across samples ranged between 96.5 and 100% with an average of 99.3%. Concordance of 271 overlapping SNPs between 1k-RiCA and the '3000 rice genomes' data set was assessed across 10 different accessions and ranged from 97.7 to 100% with a mean of 99.17%.
The SNP technical quality properties of the 1k-RiCA were not affected when two different DNA extraction protocols, a modified CTAB (Murray and Thompson 1980) and a King-Fisher Kit (http://www.thermofisher.com) were tested in this study. The average SNP concordance rate between samples extracted with CTAB and King-Fisher Kit was 99.51% (Additional file 2: Figure S7).

Principal component analysis in O. sativa
A PCA using the 1k-RiCA was performed on 283 accessions, with 150 known indica lines (ind), 24 japonica, 8 aus, 2 aromatic (aro), 1 admixture (admix), and 98 undetermined (und) rice lines. The first principal component (PC1) explained 57% of the total genetic variation and separated the indica, aus, and the japonica varietals (jap, temp, trop and aro) accessions (Fig. 2). The second PC (PC2) explained~20% of the total genetic variation and differentiated the aus from the japonica varietals. In addition, PC1 and PC2 captured a great portion of the variation within the indica accessions (Fig. 2). The optimal number of clusters estimated using Silhouette method identified 3 groups based on the genetic variance explained by the 1k-RiCA (Additional file 2: Figure  S8A). One group contained the Japonica and aromatic lines (jap, temp, trop, and aro), the second group clustered aus, and indica lines (most of indica landraces, and indica "black rice") and the final group contained most of the released and elite indica lines (Additional file 2: Figure S8B).

Principal component analysis within indica lines
To further determine the ability of the 1k-RiCA to assess the diversity within 'Indicas' , a PCA was performed using 431 'Indica' samples consisting of 177 diverse accessions, 41 'black rice' accessions classified as 'Indica' , and 213 lines derived from seven different bi-parental families developed from crosses between elite 'Indica' rice lines (Fig. 3). The 'Indica' accessions were well distributed across the first and second principal components (Fig. 3). The 'black rice' samples were clearly separated from the bi-parental families in the first PC that explained 36% total genetic variance. 'Black rice' accessions grouped exclusively in the upper left corner of the PC1 vs. PC2 scatter plot. This cluster was mainly composed of 'Indica' landraces (Additional file 2: Table S2) while the bi-parental populations clustered in the opposite side of the first PC where most of the modern IRRI breeding lines were located (Fig. 3, and Additional file 2: Table S2). Within the bi-parental families the 1k-RiCA further distinguished the structure between the half-sibs families Family-1, Family-4, Family-5 and Family-7 that grouped closer to each other than the other families Family-3, and Family-6 ( Fig. 3).

Polymorphism rates across pairwise combinations
Genotypic data generated from 275 accessions genotyped with the 1k-RiCA representing the two main rice sub-groups employed in rice breeding programs, 'Indica' (177 diverse lines and 41 'black rice' 'Indica' accessions) and 'Japonica' (57 accessions), were used in pairwise comparisons to assess the number of polymorphic SNPs within each group and across groups. Within the 218 'Indica' accessions the mean number of polymorphic SNPs across all pairwise combinations was 395 with 50% of the values ranging between 354 and 433 polymorphic SNPs (Fig. 4A). The average median gap between two SNP pairs across all possible cross combinations between the 'Indica' accessions was 0.542 Mbp ranging between 0.01 Mbp to 15.55 Mbp. Within the 'Japonica' group the median number of polymorphic SNPs was 97 ( Fig. 4B) with 50% of the values ranging from 79 to 114. The average median gap between the 'Japonica' accessions was 1.61 Mbp ranging from 0.08 Mbp to 25.68 Mbp. When pairwise combinations were estimated for interspecific crosses between 'Indica' and 'Japonica' accessions the median number of polymorphic SNPs observed was 465 (Fig. 4C). The average median gap between two SNP pairs across all possible cross combinations between the

F 1 genotypes
For each bi-parental family, F 1 -plant genotypes were compared to the 'predicted-F 1 ' progeny genotype. Across the 55 F 1 plants the average percentage of similarity with the 'predicted-F 1 ' was 99.22%, ranging from 97 to 99.87% (Additional file 2: Figure S10).

Genomic selection
Phenotypic data on the three traits used for cross validation of GS models, flowering time (FLW), grain yield (GY), and plant height (PH), exhibited a Gaussian distribution (Additional file 2: Figure S11). Broad sense heritability of accessions means, H 2 , was higher for PH (0.8), follow by FLW (0.85), while GY (0.5) had the lowest H 2 . Average selection predictive ability across 21 cross validation studies involving six different genomic selection models and one pedigree BLUP model on three traits were estimated. The predictive ability across all eight models ranged from 0.69 to 0.73 for FLW, from 0.27 to 0.38 for GY, and from 0.63 to 0.66 for PH. The genomic selection model RKHS that included the marker derived genomic and the pedigree derived relationship matrixes (G + A) had the highest predictive ability for GY and PH, while Bayes. A had the highest predictive ability for FLW (Fig. 5). The pedigree BLUP estimation method had the lowest predictive ability for FLW and GY while ridge regression had the lowest predictive ability for PH (Fig. 5).

Discussion
The 995 SNPs on the final 1k-RiCA assay are uniformly distributed across the rice genome. The physical length of the Nipponbare reference genome is 373,245, 519 bp (http://rice.plantbiology.msu.edu) or 1,529.7 cM, with 1 cM equivalent to 244 kb (Chen et al. 2002). If 995 markers were uniformly distributed across all chromosomes they would be on average 1.53 cM apart. This value does not differ from the empirical estimations for the 1k-RiCA with a mean distance of 1.524 cM between adjacent markers. Evident gaps, or regions without a marker were found in the centromeric region of chromosomes 2 and 7, the long arm of chromosomes 6 and 9, as well as the telomeric regions of chromosomes 11 and 12. These few regions where markers are less uniformly distributed could be addressed in the future since this technology allows the introduction of additional target loci when new kits are designed. Uniformly distributed SNP sets have been reported to be useful for breeding applications to develop interspecific populations (Orjuela et al. 2010), conduct QTL analysis, characterize the genetic structure of rice populations , and implement genomic selection (Habier et al. 2009). Equidistant distribution maximizes detection ability of recombination events with the given marker density and minimizes distance to causal trait-contributing polymorphisms for QTL detection.
As opposed to classical genotyping-by-sequencing (GBS) approaches (Elshire et al. 2011) the 1k-RiCA consistently scores the same set of SNPs with SNP call rates average of 95%, resulting in identical genotype matrixes, which facilitates analyses across multiple runs without further bioinformatics. While array based technologies provide similar consistencies at higher densities, they are significantly more expensive to run at a significantly lower throughput making the 1k-RiCA more suitable for high throughput breeding applications that rely on fast turnaround time for decision making and benefit from a lower per-sample cost at the sacrifice of SNP density.
Markers from the 1k-RiCA showed high levels of repeatability across independently genotyped samples (> 99%) and high concordance rates with the respective SNP alleles reported for the same accession within the C6AIR and 3000 rice genomes datasets. The 1% genotypic mismatches observed in the repeatability analysis was to a large portion due to miscalled heterozygous loci in inbred lines and through imputation this source of error was reduced to 0.3% increasing repeatability to 99.7%. Comparison of SNP calls between 1k-RiCA and two different platforms, C6AIR and 3000 rice genomes showed on average values of 99% accuracy. These empirical results demonstrated the high levels of repeatability, accuracy, and robustness of the 1k-RiCA, which are comparable to those reported in TruSeq Amplicon panels for cancer clinical testing (Simen et al. 2015;Misyura et al. 2016), and other genotyping platforms, such as the C6AIR in rice , and the CottonSNP80K in cotton (Cai et al. 2017).
Quality of the 1k-RiCA did not differ when two different DNA extraction protocols were compared. The cheap and 'quick and dirty' manual CTAB extraction method and semi-automated magnetic bead-based DNA extraction and purification yielded similar results, giving the 1k-RiCA an additional advantage over reduced-representation sequencing genotyping approaches which necessitate the use of specific restriction enzymes, that tend to be sensitive to contaminants often carried over by precipitation-based extraction methods such as CTAB. The highly purified DNA preparations using column-or magnetic-based systems required for reduced-representation sequencing genotyping approaches, however, are significantly more expensive to obtain and often not available to resource-limited breeding programs.
The 1k-RiCA adequately captured the diversity present in 'Indica' accessions. It was able to differentiate within and between 'Indica' landraces and breeding populations (Fig. 3). It did not, however, adequately differentiate between and within the aromatic, temperate, and tropicaljaponica subgroups (Fig. 2).
These limitations stem from the original design and purpose of this platform, which was to capture the diversity of 'Indica' lines of tropical rice breeding programs across South Asia and South East Asia, including IRRI's FEPB. By filtering for high MAF within the 'Indica' subgroup, many of the selected SNPs displayed low MAF or even monomorphism within 'Japonica' subgroups and hence did not contribute to power of discrimination. The low average number of polymorphic rates between 'Japonica' and 'Japonica' comparisons (97) makes the 1k-RiCA unsuitable for temperate and tropical japonica breeding programs or diversity analyses. On the other hand the 1k-RiCA showed similar average numbers of polymorphic SNPs in pairwise combinations between 'Indica' and 'Indica' (395) and between 'Indica' and 'Japonica' (465) (Fig. 4), making it suitable for the analysis of intra-specific breeding populations and inter-specific populations between 'Indica' and 'Japonica' accessions.
The genotyping results of the F 1 indica × indica, and indica × japonica plants showed the 1k-RiCA accurately called heterozygous genotypes. Considering that the mean repeatability estimated in the study is about 99%, the 1% of dissimilarities between predicted and F 1 calls can be explained by genotypic errors of the 1k-RiCA. The ability to robustly call heterozygous genotypes makes the 1k-RiCA suitable for genotyping segregating populations, marker assisted backcrossing (MABC) and genomic prediction in segregating populations.
A major challenge in implementing GS in public rice programs is the cost associated with genotyping. The expected value of the information gained by genotyping must exceed the cost of obtaining genotype information (Boichard et al. 2012). The most straightforward approach to reduce persample genotyping cost is by reducing SNP density and increasing multi-plexing of samples per NGS run to a point that does not jeopardize prediction accuracies.
Testing the 1k-RiCA data as genotypic input for genomic prediction in 21 cross-validation experiments using six different GS, and one pedigree model demonstrated its suitability for predicting complex traits such as flowering time (FLW), grain yield (GY), and plant height (PH). The genomic selection prediction abilities for FLW (0.69-0.73), GY (0.27-0.38) and PH (0.63-0.66) were comparable to those reported by Spindel et al. (2015) for the same traits (FLW = 0.63, GY = 0.31, PH = 0.34), using 73,147 SNP markers, and rice materials from the same breeding program. Furthermore, the observation that the RKHS G + A model was more accurate than the pedigree BLUP model indicates that the markers are effective in capturing the variation among relatives due to Mendelian sampling, which is key for being able to select within families effectively based on prediction. Spindel et al. (2015) suggested that using~1 SNP every 0.2 cM (~6 K SNPs) could be ideal for performing selection in inbred rice breeding populations. Grenier et al. (2015) estimated that in rice with a map of 18 Morgans, and effective population size (Ne) of~50, about 3,600 SNPs would be needed under an infinitesimal model with additive effects and under the assumptions of evenly distributed QTLs on the chromosome for genomic prediction purposes. However, in their empirical confirmation study using a cross validation analysis in upland rice, the greatest accuracies were achieved with a matrix size of 1,700 SNPs, suggesting that the assumptions presented in the simulation study did not necessarily apply. Furthermore a GS optimization study in wheat has shown that 1,000 marker were enough to reach the highest predictive ability for GS in a breeding program (Cericola et al. 2017). Similar optimization results were also obtained for a GS study in barley where a minimum marker set of 1,000 was found to be necessary in order to decrease the risk of low prediction accuracies (Nielsen et al. 2016). The predictive abilities obtained in this study suggest that the marker density for the 1k-RiCA may be sufficient and currently cost-effective for the application of GS in elite rice indica germplasm.
Alternatively the integration of imputation in the 1k-RiCA could increase genomic prediction accuracies if high density genotypic information is generated from the parental material of GS tested population. Increases in prediction accuracies using this approach have been observed in simulation studies (Gorjanc et al. 2017) and empirical studies in cattle (Wang et al. 2016), salmon (Tsai et al. 2017), and rapeseed (Werner et al. 2018).
The introduction of 21 trait markers associated with 11 different traits of agronomic importance adds to the utility of the 1k-RiCA. While it would not be cost effective to run the 1k-RiCA solely for the trait information (single SNP assay chemistry such as KASP would be cheaper), it enables the application of marker assisted selection (MAS) and trait profiling in conjunction with fingerprinting. In breeding programs this facilitates the enrichment of favorable alleles, while in diversity-type analyses it allows for the assessment of presence/absence of a range of traits. While some of the markers are diagnostic and can be used individually others are only linked to the causal polymorphism and are best used in combination as haplotypes. The use of haplotypes adds to robustness, since single linked markers might not be predictive in unknown backgrounds, where linkage may be broken.
Apart rom direct use in MAS, the presence of these markers opens the possibility of refining genome wide prediction models. Using trait marker information as fixed effect parameters has the potential to increase selection accuracies as reported by Rutkoski et al. (2014) for adult plant resistance to stem rust in wheat, by Spindel et al. (2015) for rice plant height and by Lopes et al. (2017) in livestock. The 1k-RiCA can be used efficiently by combining these two different molecular breeding approaches for traits associated with bacterial leaf blight, grain physical and chemical quality traits, submergence tolerance and other biotic stresses.

Additional files
Additional file 1: S1. List of accessions used in this study. S2. SNP call pipelined used to identify genotypes in the 1k-RiCA. Additional file 2: Figure S1. Distribution of physical distance gaps between adjacent SNPs in the 1k-RiCA. Figure S2. 1k-RiCA SNP minor allele frequency (MAF) estimated on 431 indica lines. Figure S3. 1k-RiCA SNP call rate distribution. Figure S4. 1k-RiCA SNP heterozygosity distribution. Figure  S5. Hierarchical cluster analysis estimated in 38 highly replicated accessions using the 1k-RiCA set. Figure S6. 1k-RiCA average SNP repeatability. Figure  S7. 1k-RiCA SNP concordance rate distribution between CTAB and King-Fisher DNA extracted samples. Figure S8. 1k-RiCA Oryza sativa PCA optimal clustering and Silhouette width test. Figure S9. 1k-RiCA Oryza sativa PCA classification. Figure S10. 1k-RiCA distribution of F 1 percentage of similarity between true F 1 and predicted F 1 . Figure S11. Phenotypic distribution for flowering time (FLW), grain yield (GY), and plant height (PH). Table S1. Cross-reference identification for 12 'undermined' rice accessions classified using PC coordinates. Table S2