Genotypic Data
Single nucleotide polymorphisms (SNPs) from 163 high-depth re-sequenced O. glaberrima accessions were used in this study (Cubry et al. 2018). SNPs were identified based on mapping to the Oryza sativa japonica cv. Nipponbare high quality reference genome in terms of assembly and annotation (Kawahara et al. 2013). The bioinformatic mapping pipeline, software and SNP filtering steps that were used are described in Cubry et al. (2018).
SNPs with more than 5% missing data (minor fraction of total SNP set) were filtered out (Cubry et al. 2018). As missing data can reduce the power of association studies (Browning 2008; Marchini and Howie 2010), we imputed the remaining missing data based on a matrix factorization approach using the “impute” function from the R package LEA (Frichot and François 2015). This approach uses the results f ancestry estimation from a sparse non-negative matrix factorization (sNMF) analysis to infer missing genotypes (Frichot et al. 2014). In sNMF, we set K to infer four clusters and kept the best out of 10 runs based on a cross entropy criterion.
Phenotyping of Flowering Time and Panicle Morphology
Phenotyping of flowering time and panicle morphology was performed near Banfora (Burkina-Faso) under irrigated field conditions at the Institut de l’Environnement et de Recherches Agricoles (INERA) station in 2012 and 2014. Plants were sown at two different periods in the same year: the first at beginning of June (“early sowing”) and second in mid-July (“late sowing”). A total of 15 plants per plot of 0.5 m2 were grown. The field trials followed an alpha-lattice design with two replicates (Patterson and Williams 1976) per date of sowing per year. Each single block included 19 accessions (i.e. 19 plots). In total, 87 O. glaberrima accessions were planted in 2012 and 155 in 2014.
Flowering date (DFT) was scored when 50% of the plants for a given accession harbored heading panicles for both early and late sowings in 2012 and 2014 (Table 1). Fourteen days after heading date, the three main panicles from three central plants per plot per repeat were collected (i.e. nine panicles/accession/repeat) from the early sowing only, over the 2 years. Each panicle was fixed on a white paper board, photographed and phenotyped using the P-TRAP software allowing the quantification of eight morphological traits (AL-Tam et al. 2013) (see Table 1). All statistical analyses of the dataset were performed using R (R core team 2020) packages ade4 (Dray and Dufour 2007) and corrplot (Wei and Simko 2017) as described in Ta et al. (2018).
RYMV Resistance Phenotyping
Resistance was evaluated based on ELISA performed on infected plants cultivated in the greenhouse, under controlled conditions. As high resistance to RYMV has been already well studied in African rice, we excluded highly resistant accessions, i.e. in which no virus can be detected with ELISA (Pidon et al. 2020), and we focused only on quantitative resistance. We therefore assessed resistance on a set of 125 accessions. Two varieties were used as susceptibility controls, IR64 (O. sativa ssp. indica) and Nipponbare (O. sativa ssp. japonica), and one as a high resistance control, Tog5681 (O. glaberrima). Three replicate experiments of all varieties were performed. In each experiment, plants were organized in two complete blocks with four plant replicates per accession.
Plants were mechanically inoculated 3 weeks after sowing with CI4 isolate of RYMV (Pinel et al. 2000). A single batch of inoculum for all replicate was prepared, plants were inoculated with a needleless syringe on two points at the basis of the last emerged leaf. Four discs of 4 mm diameter were cut on the last emerged leaf of each plant 17 and 20 days after inoculation (dai) and discs from the four plants of the same block were pooled. Samples were ground with a QIAGEN TissueLyser II bead mill and resuspended in 750 μL 1X PBST (Phosphate buffer saline with Tween 20). Virus content was estimated by DAS-ELISA (Pinel-Galzi et al. 2018). Preliminary tests on a subset of samples were performed to assess the dilution that best discriminated between samples. ELISA tests were finally performed at dilutions of 1/1000 for 17 dai sampling date and 1/2500 for 20 dai sampling date. Optical density values were normalized according to a standard range of virus dilutions loaded on each ELISA plate in order to correct a putative plate effect and the average of the measures of the two blocks was calculated in each replicate experiment. As virus content was highly correlated between 17 and 20 days after infection (R2 = 0,81), the resistance level was estimated as the mean of the two sampling dates. Resulting variables were named RYMV1, RYMV2 and RYMV3 for the three different experiments (Table 1).
Environmental Variables
For accessions with geographical sampling coordinates, we retrieved information for 19 climate-related variables (referred to here as bio) from the WORLDCLIM database at a 2.5 min resolution (Hijmans et al. 2005). We also retrieved the average monthly maximum temperature (referred to here as Tmax). We first performed a Principal Component Analysis (PCA) on each set of variables to build uncorrelated composite variables. PCA were performed using R package LEA (Frichot and François 2015). Association studies were performed using the first two components of each PCA (Table 1).
Treatment of Phenotypic and Environmental Variables
For each variable (Table 1; Additional file 1: Table S1a), we plotted the histogram of the trait distribution data as well as a quantile-quantile plot to visually assess the normality (Additional file 3: Fig. S1). We additionally performed two tests of normality, the Shapiro-Wilkinson’s and the Anderson-Darling’s statistics (Additional file 2: Table S2a). These analyses were made using the base graphics and nortest (Gross and Ligges 2015) packages for R.
As some of the variables did not fit a normal distribution, we applied a Box-Cox transformation of the data to approximate the normality (Table 1; Additional file 1: Table S1b). To do this transformation, we used the forecast package for R (Hyndman and Khandakar 2008).
The Box-Cox transformation writes as follow:
$$ B\left(x,\lambda \right)=\frac{x^{\lambda }-1}{\lambda }\ \mathrm{if}\ \lambda \ne 0\ \mathrm{and}\ B\left(x,0\right)=\log (x)\ \mathrm{if}\ \lambda =0 $$
We estimated the λ parameter of the transformation using the BoxCox.lambda() function with the « loglik » argument (i.e. using a maximum log likelihood approach). We then applied the transformation using the estimated λ with the BoxCox() function. As some variables (typically the environment variables) had some negative values that could prevent the use of the transformation, we used a translation of the data whenever negative values occurred in the variable with the following formula: f(x) = x + 1 − min(x) prior to apply the Box-Cox transformation. The histograms and quantile-quantile plots have been made again, as well as the normality tests for the resulting transformed variables (Additional file 2: Table S2b and Additional file 3: Fig. S1).
Apart from climate-related variables, each trait resulted from the combination of at least two repetitions. If one of the repetitions failed to reach the normality test, we used the transformed dataset for all repetitions. For climate variables, we used the transformation whenever the variable failed to pass the normality test.
Heritability was estimated for the following phenotypic trait: flowering time, panicle morphology and resistance to RYMV virus. We used a mixed model to estimate the inbred line variance, the block, the year and the residual variance. Raw (untransformed) data was used for this specific analysis. Heritability was calculated as the ratio of the line variance divided by the line variance and the residual variance (https://plant-breeding-genomics.extension.org/estimating-heritability-and-blups-for-traits-using-tomato-phenotypic-data/).
Linkage Disequilibrium
In order to assess the limits of the GWAS analysis, we computed the genome-wide Linkage Disequilibrium (LD) of our sample using the PopLDdecay software (Zhang et al. 2019). We used the imputed VCF as an input and specified the default parameters both for the analysis and the plotting. The genome-wide LD decay was then visually assessed.
Genetic Structure Assessment
In order to efficiently control for the confounding effect of individual’s relatedness, we assess the population genetic structure of our sample using the sparse non-negative matrix factorization (sNMF) approach implemented in the R package LEA. We assumed a number of ancestral groups (K) between one and 10 and we made five repetitions of the algorithm for each K. In order to evaluate which K best describe our data, we computed the cross-entropy criterion for each K and plotted it. We then selected the run for the considered K which exhibited the lowest cross-entropy and used it to plot the ancestries coefficient of each genotype. The estimated K was subsequently used as an input for some association genetics methods.
Geographic Mapping of Phenotypic Variables and Link with Genetic Structure
We used the raw data to compute mean values of the quantitative traits under consideration in this study for the accessions having sampling coordinates in their passport data. We then plotted these data using the ggplot2 (Wickham 2016) package for R.
To assess the impact of genetic structure on the phenotypic variables, we computed the Spearman’s rank correlation between the raw phenotypic values and each of the ancestry components retained using the rcorr function of the Hmisc R package (Harrell 2019). We then plotted the resulting matrix as a correlogram using the R package corrplot (Additional file 8: Fig. S5). To assess the significance of the results, we used either a p-value < 0.01 threshold (Additional file 8: Fig. S5a) or an FDR approach with a 5% threshold (Additional file 8: Fig. S5b), calculated using the qvalue R package (Storey et al. 2019).
Association Studies
For each trial, SNPs displaying a minimal allele frequency (frequency of the minor allele) lower than 5% were filtered out. We first adjusted a simple linear model (Analysis of variance, ANOVA) to associate phenotype and genotype. This simple method did not take into account any putative confounding factor and allowed us to assess whether taking into account relatedness and/or population structure could reduce false positive rates. Two classes of methods accounting for confounding factors were used: 1) mixed models using kinship matrix and/or population structure (Yu et al. 2006); and 2) latent factor methods (Frichot et al. 2013). We used both mixed linear models MLM (Zhang et al. 2010) as implemented in GAPIT R package (Lipka et al. 2012) and EMMA (Kang et al. 2008) as implemented in R package EMMA. For EMMA, the kinship matrix was estimated using the emma.kinship function. For MLM (Q + K model), the kinship (K matrix) was computed using the Van Raden method and the first three principal components (PCs) of a PCA of genomic data were used as the Q matrix. The PCs were used to correct for population structure only for the MLM method. Finally, we used latent factor methods (Frichot et al. 2013) that jointly estimated associations between genotype and phenotype and confounding factors. We used the R package LFMM2 (Caye et al. 2019) to perform these analyses. We first made the estimation of the confounding factors by using a subset of SNPs obtained by applying a 20% MAF filter, and we considered four latent factors (Cubry et al. 2018). We then used the resulting confounding matrix for the analysis of genotype/phenotype association. The results of all analyses were graphically represented by using a QQ-plot to assess confounding factor correction and Manhattan plots (R package qqman, Turner 2014). We used a 10− 5 p-value threshold to select candidate SNPs for each method. An additional false discovery rate (FDR) estimation was realized using the R package qvalue (Storey et al. 2019).
GWAS analysis was performed separately for each year and trial (see Additional file 1: Table S1). P-values obtained for the same traits or the same planting data were combined across experiments using Fisher’s method (Sokal and Rohlf 2012). We defined genomic regions for each trait using a genomic window approach, i.e. when two consecutive significant SNPs were distant from less than 50 kb, they were clumped together in the same region. We finally applied a filter on the selected regions by considering as candidate regions those detected at least by two methods. Annotation of retained candidate regions was performed by intersecting the candidate regions with the genome annotation data for MSU7 (Kawahara et al. 2013), considering genes within the defined region and extending 25 kb upstream and 25 kb downstream.
Finally, for flowering traits, we established a list of known genes of particular interest from published data (Tsuji et al. 2011; Hori et al. 2016). This “expert” list was then used to assess the performance of our GWAS approach to retrieve these potential candidates. We used a G-test to assess enrichment of candidates in our list of identified genes.