Development of a novel prediction method of cis-elements to hypothesize collaborative functions of cis-element pairs in iron-deficient rice
© Kakei et al.; licensee Springer. 2013
Received: 9 May 2013
Accepted: 13 September 2013
Published: 22 September 2013
Cis-acting elements are essential genomic sequences that control gene expression. In higher eukaryotes, a series of cis-elements function cooperatively. However, further studies are required to examine the co-regulation of multiple cis-elements on a promoter. The aim of this study was to propose a model of cis-element networks that cooperatively regulate gene expression in rice under iron (Fe) deficiency.
We developed a novel clustering-free method, microarray-associated motif analyzer (MAMA), to predict novel cis-acting elements based on weighted sequence similarities and gene expression profiles in microarray analyses. Simulation of gene expression was performed using a support vector machine and based on the presence of predicted motifs and motif pairs. The accuracy of simulated gene expression was used to evaluate the quality of prediction and to optimize the parameters used in this method. Based on sequences of Oryza sativa genes upregulated by Fe deficiency, MAMA returned experimentally identified cis-elements responsible for Fe deficiency in O. sativa. When this method was applied to O. sativa subjected to zinc deficiency and Arabidopsis thaliana subjected to salt stress, several novel candidate cis-acting elements that overlap with known cis-acting elements, such as ZDRE, ABRE, and DRE, were identified. After optimization, MAMA accurately simulated more than 87% of gene expression. Predicted motifs strongly co-localized in the upstream regions of regulated genes and sequences around transcription start sites. Furthermore, in many cases, the separation (in bp) between co-localized motifs was conserved, suggesting that predicted motifs and the separation between them were important in the co-regulation of gene expression.
Our results are suggestive of a typical sequence model for Fe deficiency-responsive promoters and some strong candidate cis-elements that function cooperatively with known cis-elements.
KeywordsCis-element Iron deficiency Transcription
Gene expression is regulated by various factors, including transcription factors (TFs), cis-acting elements, cofactors, and chromatin structure, and by processes such as methylation and acetylation. Many cis-acting elements essential for the regulation of gene expression have been identified, mostly upstream of transcribed sequences. Many reports have described transcription factors regulating gene expression by functionally coordinating with cis-elements (Raff and Kaufman 1991; Wilkins 1991; Gerhart and Kirschner 1997; Carroll et al. 2001) and binding to specific sites (Levine and Tjian 2003).
For more than 10 years, during which time a variety of genomes have been fully sequenced, much effort has been devoted to the development of in silico methods for predicting novel cis-acting sequences or motifs in prokaryotes and eukaryotes. These methods are categorized into two general groups (Hudson and Quail 2003; van Hijum et al. 2009): alignment (probabilistic) methods, such as MEME (Bailey and Elkan 1994), DMBP (Huang et al. 2005), AlignACE (Hughes et al. 2000), and Motif Sampler (Thijs et al. 2001), and enumerative methods (Hudson and Quail 2003; van Hijum et al. 2009). In prokaryotes, noncoding regions are typically short, and cis-elements are highly accumulated (Gama-Castro et al. 2008). Thus, existing methods can often correctly predict cis-elements in prokaryotes. In contrast, in eukaryotic genomes (especially higher eukaryotes such as humans and rice) the noncoding regions are much longer, which is believed to be one main reason as to why prediction in higher eukaryotes is more difficult. Additionally, many cis-elements co-localize in the long upstream sequences and cooperate in the regulation of gene transcription (Carrera and Treisman 2008). Vandenbon et al. (2012) reported that some cis-elements co-localize significantly in the fly genome; of these identified, they experimentally validated the co-regulation of a pair of binding sites within NF-κB and C/EBP. Therefore, predicting a series of cis-elements that function cooperatively has become increasingly important to understand transcriptional regulation in higher eukaryotes.
Alignment methods are designed to find commonalities in a group of upstream sequences, primarily by aligning similar sequences and creating a probabilistic model, such as a position–weight matrix. Alignment methods are often impaired by “false predictions” caused by the ubiquitously present short sequences throughout the genome. For example, A/T-repeats (e.g., AAAAAA) are often predicted. Such A/T-repeat sequences are known to be common in intergenic regions, although they are not known to be included in transcription. In enumerative methods, numbers of all the small sequences in a group of upstream sequences are counted and compared with those in a background group. They usually do not evaluate sequence similarity, although many cis-acting sequences are reportedly quite fuzzy (Collado-Vides et al. 1991). Clustering (i.e., grouping of similarly expressed genes) plays a key role in the prediction of cis-motif elements in both alignment and enumerative methods. However, clustering genes is difficult. For example, clustered genes do not always share the same cis-elements, and selection of the best thresholds in clustering is a difficult issue (Kundaje et al. 2007). Some clustering-free methods are available: REDUCE (Bussemaker et al. 2007) and a method by S.-Y. Kim and Kim (2006) use genome-wide gene expression as input without clustering. However, REDUCE is not applicable to plants, and the method by Kim and Kim (2006) is not designed to predict novel cis-motif elements.
The regulatory mechanisms of iron (Fe) deficiency-inducible genes were explored using molecular biological and plant physiological approaches in rice. We reported that Fe deficiency-responsive element 1 (IDE1: ATCAAGCATGCTTCTTGC) and IDE2 (TTGAACGGCAAGTTTCACGCTGTCACT) were critical cis-elements for several genes upregulated by Fe deficiency (Kobayashi et al. 2003). We also identified the transcription factors that associate with IDE1 and IDE2 (IDEF1, IDEF2; Kobayashi et al. 2007; Ogo et al. 2008). Furthermore, one of the Fe deficiency-inducible transcription factors, OsIRO2, was analyzed, and its binding sequence was investigated (Ogo et al. 2007). The TF-binding sequences (TFBSs) of these TFs are found in only 20–60% of genes regulated under Fe deficiency (Kobayashi et al. 2009), suggesting that novel cis-elements remain to be discovered. IDEF1 function as a master regulator in rice under iron deficiency. Therefore to find the other cis-elements function cooperatively with IDEF1-binding sequence is especially important.
To identify novel cis-acting motifs in Fe deficiency-induced genes in rice, we applied existing motif prediction methods, that is, MEME (Bailey and Elkan 1994), Motif Sampler (Thijs et al. 2001), and SIFT (Hudson and Quail 2003), to some different number of genes upregulated by Fe deficiency (results with the top 50 genes are shown in Additional file 1 online). However, transcription factor-binding sequences (i.e., IDEF1, IDEF2, and OsIRO2) were predicted after dozens of sequences were predicted as “more likely to be cis-elements” (according to their Higher Highest II, lower E-value, and P-values). These methods are designed to identify commonly shared cis-motifs from clustered genes. Under iron-deficient condition, OsIRO2 is regulated by IDEF1 (Kobayashi et al. 2009) and OsIRO2 regulates the expression of some other TFs (Ogo et al. 2007). Therefore, it was expected that this regulatory cascade of TFs makes it difficult to make a cluster of genes sharing common cis-elements. Iron-deficiency regulated genes may not have highly common cis-elements but they should have one of the binding sequences of IDEF1, IDEF2, OsIRO2 and other TFs regulated by OsIRO2. This failure motivated us to develop a novel prediction method able to extract functional cis-acting elements without clustering.
To effectively predict cis-motifs in eukaryotes, we developed a novel in silico method, which we named microarray-associated motif analyzer (MAMA). This method generates an ab initio prediction of cis-elements, which are independent from the predictions by existing methods. We attempted to evaluate the frequency of sequences that specifically exist in upregulated genes, the degree of mismatch and identity, and degree of gene expression without clustering using a MAMA score (Additional file 2). MAMA was applied to the microarray data in rice subjected to Fe deficiency, and the accumulation of motif pairs was also evaluated using this method. We found that the distribution and co-localization of predicted motifs are often conserved in the promoter region of treatment-regulated genes. MAMA was also applied to other microarray data of rice subjected to Zn-deficiency treatment and Arabidopsis thaliana subjected to NaCl.
Results and discussion
Development of the MAMA method and its application to O. sativa
Motifs predicted by MAMA using microarray data from iron-deficient and -sufficient rice roots
Motif containing IDEF1 binding sequence
BRE U -TATA motif 1
The CTATATAT motif recorded the third highest MAMA score (Table 1, Figure 2D) and was named the TATA-box motif. The TATA-box motif existed most frequently within 50 bp upstream of TSSs of genes that were induced more than twofold by Fe deficiency (Figure 2E, F). The TATA-box motif was also common upstream of genes whose expression was decreased less than 0.5-fold (Figure 2E). Several novel motifs that have not been reported to be related to Fe-deficiency responses were found to have high MAMA scores (Table 1). In particular, the ACGTACGT motif was predicted with the highest MAMA score (Table 1). We named this motif Fe deficiency-associated motif 1 (FAM1). FAM1 was frequently found within 500 bp upstream of TSSs of genes upregulated by Fe deficiency (Additional file 4 I online).
Motifs immediately downstream of TSSs
Motifs predicted from a region 50 bp upstream to 150 bp downstream of TSS
BRE U -TATA motif 2
Co-localization of predicted motifs in upregulated genes
MAMA successfully returned known cis-elements from rice roots subjected to zinc deficiency
MAMA successfully returned known cis-elements in A. thaliana
To investigate whether MAMA can predict cis-elements in other plants, we also applied it to A. thaliana microarrays. In microarray data generated from A. thaliana subjected to NaCl stress (Dinneny et al. 2008), the motif containing an abscisic acid (ABA)-responsive element (ABRE; ACGTG[G/T]C), which is a cis-element responsive for ABA, dehydration, low temperature, and high salinity (Narusaka et al. 2003), yielded the highest MAMA score (Figure 4B, Additional file 8 online). The motif containing a dehydration-responsive element (DRE; [A/G]CCGAC), which is involved in dehydration- and high salinity-responsive gene expression (Narusaka et al. 2003), recorded the sixth highest MAMA score (Additional file 8 online). Motifs including ZDRE, ABRE, and DRE consensus sequences were found at particularly high frequencies within 500 bp upstream of the TSSs of over twofold upregulated genes (Figure 4E, F, H, I).
Accuracy of the MAMA method
A transcription-simulation model built on the motifs predicted by MAMA showed the best performance (Figure 5A). Additionally, the best simulation model was improved when motifs predicted from sequences 50 bp upstream and 150 bp downstream of TSSs (near TSS) by MAMA were added to the motifs predicted in sequences 500 bp upstream of TSSs (upstream; Figure 5B). Furthermore, the best simulation model improved further when motif pairs predicted from sequences upstream and near the TSS that were enriched upstream and near the TSS of regulated genes were added (Figure 5B). When randomly selected gene sets were applied to this algorithm 10 times, the AUC-ROC of MAMA was significantly higher than that of the other methods (Figure 5C). The AUC-ROC improved significantly when the model was built on motifs predicted from both the upstream sequence and sequences around TSSs (Figure 5D; + near TSS). When the presence of several motif pairs was added, the AUC-ROC tended to improve (Figure 5D; ++ pairs). When 100 motif pairs were added, the AUC-ROC was significantly impaired (Additional file 9).
After optimization, the number of genes accurately categorized was 87.9% from microarray data on O. sativa subjected to Fe deficiency (13,779.4 genes were accurate on an average of five tests; the number of genes used in test data was 15,676), 97.2% from microarray data on O. sativa subjected to Zn deficiency (14,357.2; 14,769), and 93.3% from microarray data on A. thaliana subjected to NaCl stress (9,691.6; 10,385).
MAMA successfully predicted the functional cis-motifs
Motifs predicted by MAMA from microarray data of O. sativa subjected to Fe deficiency explained more than 87% of the transcription regulation accurately. Of the top 11 motifs extracted, four overlapped with cis-elements that were experimentally identified previously, such as IDEF1BS, OsIRO2BS, and IDEF2BS (Kobayashi et al. 2007; Ogo et al. 2007, 2008). The IDEF1BS motif was found at a high frequency in Fe deficiency-upregulated genes (Figure 2B). Moreover, it frequently occurred between 50 and 400 bp upstream of the TSSs of Fe deficiency-inducible genes but not in the gene population as a whole (Figure 2C). OsIRO2BS and IDEF2BS were also predicted with high MAMA scores (Table 1). These motifs were specifically overrepresented between 50 and 500 bp upstream of the TSSs of regulated genes (Additional file 6 online). These data demonstrate that MAMA successfully predicted functional cis-elements. Furthermore, MAMA successfully predicted ZDRE, ABRE, and DRE using O. sativa and A. thaliana microarray data (Additional file 7 and 8 online; Figure 5). These results suggest that MAMA can predict functional cis-elements involved in various kinds of stress responses not only in rice but also in other plants.
In addition to known cis-elements, MAMA predicted some novel motifs as strong candidate cis-elements that have not been reported before. Using the microarray data of rice under Fe-deficiency stress, FAM1 was returned with the highest MAMA score (Table 1). FAM1 was specifically overrepresented between 50 and 500 bp upstream of the TSSs of regulated genes, as is the case with other known cis-elements (Additional file 4 online; Table 1). Therefore, FAM1 is likely a functional cis-element of rice under Fe-deficiency stress. Generally, deletion of an essential cis-element resulted in an almost complete absence of response, whereas deletion of other parts of promoters merely lowered promoter activity (Guiltinan et al. 1990; Tong et al. 2006; Kobayashi et al. 2007). This is suggestive of the existence of important cis-elements, other than those reported to be essential, within promoters. Novel cis-elements predicted by MAMA may coordinate with known cis-elements to improve transcription.
MAMA predicted cis-motifs involved in the basal transcriptional machinery
The TATA-box motif recorded the third highest MAMA score (Table 1, Figure 2D) and was the most common motif within 50 bp upstream of TSSs. This is consistent with the characteristics of the TATA box (Burley and Roeder 1996). This localization was more common in genes upregulated by Fe deficiency than in the overall gene population (Figure 2F). TATA-box motifs also frequently exist upstream of genes downregulated by Fe deficiency (Figure 4E). A genome-wide analysis in yeast revealed that stress-response genes typically possess a TATA box in their promoters, whereas housekeeping gene promoters often lack this motif (Basehoar et al. 2004). Similar accumulation of the TATA box has been observed in plants (Yamamoto et al. 2011). A TATA box is a core element of the basal transcriptional machinery that regulates genes in conjunction with other cis-elements (Sadhale et al. 2007). Consistent with these reports, our data demonstrated that TATA-box motifs affect the response to Fe deficiency in rice by collaborating with Fe deficiency-specific transcription factors.
Downstream core elements (DCEs) were reported in yeast and mammalians downstream of TSSs, and are known to collaborate with the TATA box (Sadhale et al. 2007). Some TATA box-binding protein (TBP)-associated factors (TAFs) bind to DCEs (Sadhale et al. 2007). Our results showed that the DCEp1 (Figure 2H) motif was commonly found immediately downstream of TSSs of Fe deficiency-inducible genes. Also, the DCEp1 motif was highly co-localized with the TATA-box motif of genes upregulated by Fe deficiency (Figure 3G, H). Thus, we suggest that a unit of the basic transcription machinery, including a TATA-box motif and DCEp1 motifs, functions in the transcriptional regulation of rice under Fe-deficiency stress.
Co-localization of cis-motifs predicted by MAMA
Notably, the TATA-box, BREU-TATA motif 1, DCEp1, and IDEF1BS motifs strongly co-localized in regions upstream of Fe deficiency-inducible genes, and the separation (in bp) between them was conserved (Figure 3). IDEF1BS motifs and BREU-TATA motif 1 were frequently co-localized with a separation of 50 bp (Figure 3D), suggesting that the transcription factors binding to IDEF1BS and BREU-TATA motif 1 may interact. Additionally, when the separation (in bp) of motif pairs was plotted with the frequency (i.e., Figure 3B, D, F, H), the frequency often showed several peaks, and the separation (in bp) between these peaks was commonly around 150, 300, and 450 bp (Figure 3D, H). Peaks with a separation of 150 bp have been observed in many other co-localized motifs predicted from rice microarrays under Zn deficiency and from salt-stressed A. thaliana microarrays (Additional file 10 online). Nucleosome core particles contain approximately 150 bp of DNA (Davey et al. 2002). Moyle-Heyrman et al. (2011) reported collaborative competition between transcription factors and the nucleosome. Therefore, these 150- and 300-bp separations of co-localized motifs may indicate either collaborative or competitive binding of transcription factors and histone. Transcription factors may bind to the interspace of DNA coiled by histone.
Motif pairs improved the AUC-ROC in transcription simulation, but the difference from that without motif pairs was not significant. The motif pairs with lower P-values tended to improve, and those with higher P-values tended to impair the AUC-ROC (Additional file 9). Of the motif pairs with lower P-values, some improved while others impaired the AUC-ROC. Therefore, we suggest the Nmp (number of motif pairs used) with the highest AoAR (average of AUC-ROC) as a number of highly possible candidates of motif pairs that co-regulate transcription. In addition, we suggest that Nmp does not impair the AoAR as a number of possible candidates of motif pairs that co-regulate transcription.
In parameters power ν (controls the sensitivity for sequence similarity), power τ (controls the sensitivity for gene expression ratio), and number of motif pairs N mp , the change in ν was affected the most (Methods: Comparison of the effect of parameters), whereas τ was affected second and N mp was affected last. Therefore, using this method, the parameters were adjusted in this order (Methods: Optimization of parameters). We also evaluated the effect of highest_r_score, the highest limit for the r_score (5, 10, 50, 100), and the threshold (1.5, 2, 3), to classify upregulated and non-upregulated genes. However, the degrees of their effects were largely different and depended on which microarray data were used. Therefore, these heuristic parameters remained unoptimized (default values; highest_r_score = 10, threshold = 2). The parameter “highest_r_score” may reduce noise caused by signal ratios that were too high, which was frequently observed when the gene signal was low.
A model of transcriptional regulation under Fe deficiency
Performance of MAMA
We compared the motifs generated by MAMA, MEME, MotifSampler and SIFT from the top 50 rice genes upregulated by Fe deficiency (Table 1, 2, Additional file 1). Motifs predicted by MAMA contributed significantly more to build transcriptional simulation model compared to those predicted using other clustering-dependent methods when motif quality was checked using the AUC-ROC of the transcription simulation model (Figure 6). Plant researchers can use MAMA to predict cis-motifs from microarray data on a single treatment. For example, MAMA can be applied to a microarray data under some kind of stress. MAMA optimizes parameters automatically to maximize the accuracy of simulation of gene expression. Therefore, MAMA does not require most users to determine complicated parameters. We prepared a template file for A. thaliana microarray ATH1. Users can run MAMA after pasting the signal ratio from the microarray data to the template file. All the calculations of MAMA were performed using Desktop PC (Dell Vostro 470 with Quadro 2000, 8GB RAM, Windows 7) and the calculation of a data set took from 11 to 54 hours. We developed the main software using GPGPU (CUDA; supported by NVIDIA GeForce (8 or higher), Tesla or Quadro series). Using the CUDA environment, optimization can be completed within 3 days. However if you do not have CUDA environment, some parameters optimization using CPU (core i7 3770) in MAMA requires several weeks.
We expect MAMA to increase our understanding of the complex regulation of gene expression in higher eukaryotes from the co-localizations and the separation (in bp) between them. A method developed by Huttenhower et al. (2009) generates regulatory modules: co-regulated genes, the conditions under which they are co-regulated and sequence-level regulatory motifs. Using COALESCE, the genes upregulated under iron deficiency may be separated into a subcluster regulated by a model including IDEF1BS and another cluster regulated by another model including OsIRO2BS, and we may analyze more specifically about the regulation occurred in each subgroup. It is necessary to prepare microarray data similar to the one under iron deficiency to perform COALESCE effectively. MAMA and all the programs used in this study are available for download at http://park.itc.u-tokyo.ac.jp/pbt/MAMA.
N is the number of genes from the microarray data, N(A) is the number of genes containing motif A, N(!A) is the number of genes that do not contain motif A, N(UP) is the number of genes upregulated more than twofold, CR(A) is the cover ratio of motif A, and CR(A) = N(A)/N.
Preparation of sequences and microarray data
The rice genome sequence (IRGSP1.0) was downloaded from the RAP-DB Web site (http://rapdb.dna.affrc.go.jp/). Genes possessing identical promoters were treated as a single gene (ID). In these cases, the geometric mean of their gene expression ratios was used. Ratios of expression in Fe-deficient and -sufficient plants, obtained using microarrays (Ogo et al. 2008) (cv. Tsukinohikari), were used in subsequent studies. Microarray data on rice root under Zn-deficient and -sufficient plants were obtained from a published paper (Suzuki et al. 2012) (cv. Nipponbare). The genome sequence of A. thaliana and gene annotation data (TAIR10) were retrieved from TAIR (http://www.arabidopsis.org). Microarray data generated using A. thaliana subjected to NaCl stress were obtained from a previous report (Dinneny et al. 2008). Random sequences were generated using a random sequence generator with probabilities of A:C:G:T as 0.25:0.25:0.25:0.25 (http://tandem.bu.edu/rsg.html).
Calculation of MAMA scores
The h_score was designed to calculate the similarity of a promoter to a candidate sequence. In the present study, both DNA strands ware used to calculate h_scores. Every part of a promoter with the same length as a candidate sequence was compared with candidate sequences, and the highest h_score in a promoter was selected as the h_score(n) of a gene(n). Uninterrupted identity to the candidate sequence and short separations between identical sequences yielded higher h_scores. To control the effect of the separation between them, penalty â was set. To control the sensitivity for sequence similarity on the MAMA score, the result is raised to the power v. For each gene, the r_score(n) represents the microarray gene expression ratio. To control the influence of expression ratio on the MAMA score, a threshold highest_r_score was set. In cases in which the gene expression ratio exceeded this threshold, the r_score was set to the threshold. The threshold highest_r_score was set to 10.0 (default). When calculating correlations between sequence and upregulation, MAMA offers the option of removing downregulated genes from the analysis or setting the r_score to 1.0 or 1/expression ratio. r_scores for downregulated genes were set to 1.0 (default). To control the sensitivity for gene expression ratios on the MAMA score, the r_scores were raised to the power τ.
Grouping of similar sequences
High-scoring candidate sequences were identified after MAMA score calculation. For the 5% highest-scoring candidate sequences, similar and lower-scoring candidate sequences were grouped into the same motif group as the higher-scoring one. In the present study, two mismatched bases were permitted (i.e., ≥6 bp identity to the higher-scoring candidate motif).
Evaluation of predicted motifs using a transcription-prediction algorithm
To evaluate the correlation of the presence of predicted motifs with upregulation of genes, we used a classification algorithm by Support Vector Machine (SVM; Joachims 1999). All SVM runs were performed by LIBSVM3.1 (Fan et al. 2005). The problem “how predicted motifs may be used to simulate upregulation of transcription” was formalized as a machine-learning classification problem (Zou et al. 2011). We were interested in assigning genes into two classes, namely, inducible (1) and non-inducible (-1) based on a feature vector describing the presence (1) and absence (0) of motifs and motif pairs in a gene. For training of the models, genes upregulated more than twofold by treatment were used as positive examples. Genes that were not upregulated more than twofold were used as negative examples. For each SVM run, genes were randomly separated into training and test sets. Because the number of positive examples was much smaller than that of negative examples, random undersampling of negative examples was applied to improve the performance of the highly imbalanced data (Tang et al. 2009). Ru (proportion of negative samples) was set to 1/16 of negative samples. For each training set, the optimal parameters for C (trade-off between training error and margin) and γ (gamma in the kernel function) were examined by grid search. The performance of the classifier was measured by the AUC-ROC during the optimization, and optimal parameters that resulted in the highest AUC-ROC were applied to test sets.
Evaluation of motif co-localization
A region from 500 bp upstream to 150 bp downstream of TSSs was used to evaluate the co-localization of motif pairs (e.g., motif A and B). In the above equation, Oi represents the observed number of N(A|B|UP), N(!A|B|UP), N(A|!B|UP), N(!A|!B|UP)…while Ei represents the expected number, N(UP)N(A|B|!UP)/N(!UP), N(UP)N(!A|B|!UP)/N(!UP), N(UP)N(A|!B|!UP)/N(!UP), N(UP)N(!A|!B|!UP)/N(!UP); two enrichments of motif A were simultaneously evaluated as EN1 and EN2. EN1 was defined as N(A|B|UP)/N(B|UP) divided by N(A|!B|UP)/N(!B|UP). EN2 was defined as N(A|B|UP)/N(B|UP) divided by N(A|B|!UP)/N(B|!UP). Enriched motif pairs were defined as motif pairs of which EN1 and EN2 were greater than 1. When the number of motif pairs used in MAMA was set to N mp , motif pairs with the top N mp lowest P-value were used for the simulation of gene expression. If motif A and motif B contained identical sequences, co-localization was not evaluated.
Optimization of MAMA parameters
Parameters power ν (controls the sensitivity for sequence similarity), power τ (controls the sensitivity for the gene expression ratio), and number of motif pairs N mp applied for the SVM were optimized one by one in this order. These parameters started from 1, 1, and 0, respectively, and increased by 1 after a set of simulations. During the optimization of power ν, power τ = 1, 2, 3, 4, and 5 were tested five times each, and the average AUC-ROC (= AoAR(v)) was calculated from these 25 simulations. After simulation with increased power ν, if AoAR(v) < AoAR(v–1), then the optimized power ν was set toν – 1; otherwise, power ν was increased further. An increase in power τ reached a plateau of the AUC-ROC value. During the optimization of power τ, power τ = power τ, power τ + 1, power τ + 2, power τ + 3, and power τ + 4 was tested five times each, and the slope of the AUC-ROC (SoAR(τ)) was calculated using power τ and the AUC-ROC from these 25 simulations. If SoAR(τ) was not defined or bigger than the defined maximum value of SoAR(τ) (MaxSoAR(τ)), then MaxSoAR(τ) was set to SoAR(τ). After the increase in power τ, if SoAR(τ) < (MaxSoAR(τ)/2), then the optimized power τ was set to τ; otherwise, power ν was increased further. If τ was more than five, the integral 5/τ was added to τ. During the optimization of N mp , 1, 2, 3, 5, 10, 20, 30, 50, 100, and 200 were tested five times each. The average AUC-ROC AoAR( Nmp ) was calculated for each N mp value (10 tests each), and the N mp with the highest AoAR( Nmp ) was set to optimized N mp .
Comparison of the effect of parameters to AUC-ROC
Initially, we tested parameters ν (1, 2, 3, 4, 5), τ (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15), and Nmp (1, 2, 3, 5, 10, 20, 30, 50, 100, 200) using microarray data from rice subjected to Fe deficiency and Zn deficiency, and A. thaliana subjected to NaCl stress. For each ν, the average AUC-ROC calculated using all the tested power τ and Nmp was compared (AoAR(ν, ∑τ, ∑Nmp)). The difference between the highest AoAR(ν, ∑τ, ∑Nmp) and lowest AoAR(ν, ∑τ, ∑Nmp) was evaluated as the effect of ν. Similarly, the difference between the highest AoAR(∑ν, τ, ∑Nmp) and lowest AoAR(∑ν, τ, ∑Nmp) was evaluated as the effect of τ. The difference between the highest AoAR(∑ν, ∑τ, Nmp) and lowest AoAR(∑ν, ∑τ, Nmp) was evaluated as the effect of Nmp.
Prediction of cis-elements with existing methods
MEME (Bailey and Elkan 1994), Motif Sampler (Thijs et al. 2001), and SIFT (Hudson and Quail 2003) were used to compare the result of predicted cis-elements. Also, 500-bp upstream sequences from the TSS of the top 50 the most upregulated genes in microarray data on rice subjected to Fe deficiency were used as input. Background data were generated from 500-bp upstream sequences from the TSSs of rice genes, of which the gene expression ratio was between 0.8 and 1.2. Most parameters remained as default values. If word size was required, the word size was set to 8. The number of outputs was set to 1,250, and 1,250 motifs each were used to simulate gene expression by SVM.
Transcription start site
Microarray-associated motif analyzer
Transcription factor binding site
- IDE1 and 2:
Iron-deficiency responsive element 1 and 2
- IDEF1 and 2:
IDE1-binding factor and IDE2-binding factor
- IDEF1BS IDEF2BS and OsIRO2BS:
Motif containing binding sequence of IDEF1, IDEF2 and OsIRO2
Fe deficiency-associated motif 1
Putative downstream core element 1
Multiple Em for Motif Elicitation
- ROC curve:
A receiver operating characteristic curve
The area under the curve of ROC curve
Support vector machine.
This work was supported in part by a Grant-in-Aid from the Japanese Society for the Promotion of Science (JSPS).
- Assunção AGL, et al.: Arabidopsis thaliana transcription factors bZIP19 and bZIP23 regulate the adaptation to zinc deficiency. Proc Natl Acad Sci U S A 2010, 107: 10296–10301. 10.1073/pnas.1004788107PubMed CentralView ArticlePubMedGoogle Scholar
- Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 1994, 2: 28–36.PubMedGoogle Scholar
- Basehoar AD, et al.: Identification and Distinct Regulation of Yeast TATA Box-Containing Genes. Cell 2004, 116: 699–709. 10.1016/S0092-8674(04)00205-3View ArticlePubMedGoogle Scholar
- Burley SK, Roeder RG: Biochemistry and structural biology of transcription factor IID (TFIID). Annu Rev Biochem 1996, 65: 769–799. 10.1146/annurev.bi.65.070196.004005View ArticlePubMedGoogle Scholar
- Bussemaker HJ, et al.: Predictive modeling of genome-wide mRNA expression: from modules to molecules. Annu Rev Biophys Biomol Struct 2007, 36: 329–347. 10.1146/annurev.biophys.36.040306.132725View ArticlePubMedGoogle Scholar
- Carrera I, Treisman JE: Message in a nucleus: signaling to the transcriptional machinery. Curr Opin Genet Dev 2008, 18: 397–403. 10.1016/j.gde.2008.07.007PubMed CentralView ArticlePubMedGoogle Scholar
- Carroll SB, et al.: From DNA to diversity. Hoboken: Wiley-Blackwell; 2001.Google Scholar
- Collado-Vides J, et al.: Control site location and transcriptional regulation in Escherichia coli. Microbiol Rev 1991, 55: 371–394.PubMed CentralPubMedGoogle Scholar
- Davey CA, et al.: Solvent Mediated Interactions in the Structure of the Nucleosome Core Particle at 1.9 Å Resolution. J Mol Biol 2002, 319: 1097–1113. 10.1016/S0022-2836(02)00386-8View ArticlePubMedGoogle Scholar
- Deng W, Roberts SGE: Core promoter elements recognized by transcription factor IIB. Biochem Soc Trans 2006, 34: 1051–1053. 10.1042/BST0341051View ArticlePubMedGoogle Scholar
- Dinneny JR, et al.: Cell Identity Mediates the Response of Arabidopsis Roots to Abiotic Stress. Science 2008, 320: 942–945. 10.1126/science.1153795View ArticlePubMedGoogle Scholar
- Fan R-E, et al.: Working Set Selection Using Second Order Information for Training Support Vector Machines. J Mach Learn Res 2005, 6: 1889–1918.Google Scholar
- Gama-Castro S, et al.: RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucl. Acids Res 2008, 36: D120-D124. 10.1093/nar/gkn491View ArticleGoogle Scholar
- Gerhart J, Kirschner M: Cells, Embryos and Evolution 1st ed. New York: Wiley; 1997.Google Scholar
- Guiltinan MJ, et al.: A plant leucine zipper protein that recognizes an abscisic acid response element. Science 1990, 250: 267–271. 10.1126/science.2145628View ArticlePubMedGoogle Scholar
- Hijum SAFT, et al.: Mechanisms and Evolution of Control Logic in Prokaryotic Transcriptional Regulation. Microbiol Mol Biol Rev 2009, 73: 481–509. 10.1128/MMBR.00037-08PubMed CentralView ArticlePubMedGoogle Scholar
- Huang E, et al.: An algorithm for ab initio DNA motif detection. Info Process and Living Systems 2005, 2: 611–614.View ArticleGoogle Scholar
- Hudson ME, Quail PH: Identification of Promoter Motifs Involved in the Network of Phytochrome A-Regulated Gene Expression by Combined Analysis of Genomic Sequence and Microarray Data. Plant Physiol 2003, 133: 1605–1616. 10.1104/pp.103.030437PubMed CentralView ArticlePubMedGoogle Scholar
- Hughes JD, et al.: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 2000, 296: 1205–1214. 10.1006/jmbi.2000.3519View ArticlePubMedGoogle Scholar
- Huttenhower C, et al.: Detailing regulatory networks through large scale data integration. Bioinformatics 2009, 25: 3267–3274. 10.1093/bioinformatics/btp588PubMed CentralView ArticlePubMedGoogle Scholar
- Joachims T: Advances in kernel methods. Edited by: Schölkopf B, Burges CJC, Smola AJ. Cambridge: MIT Press; 1999:169–184.Google Scholar
- Kim S-Y, Kim Y: Genome-wide prediction of transcriptional regulatory elements of human promoters using gene expression and promoter analysis data. BMC Bioinforma 2006, 7: 330. 10.1186/1471-2105-7-330View ArticleGoogle Scholar
- Kobayashi T, et al.: Identification of novel cis-acting elements, IDE1 and IDE2, of the barley IDS2 gene promoter conferring iron-deficiency-inducible, root-specific expression in heterogeneous tobacco plants. Plant J 2003, 36: 780–793. 10.1046/j.1365-313X.2003.01920.xView ArticlePubMedGoogle Scholar
- Kobayashi T, et al.: The transcription factor IDEF1 regulates the response to and tolerance of iron deficiency in plants. Proc Natl Acad Sci U S A 2007, 104: 19150–19155. 10.1073/pnas.0707010104PubMed CentralView ArticlePubMedGoogle Scholar
- Kobayashi T, et al.: The rice transcription factor IDEF1 is essential for the early response to iron deficiency, and induces vegetative expression of late embryogenesis abundant genes. Plant J 2009, 60: 948–961. 10.1111/j.1365-313X.2009.04015.xView ArticlePubMedGoogle Scholar
- Kundaje A, et al.: Learning Regulatory Programs That Accurately Predict Differential Expression with MEDUSA. Ann N Y Acad Sci 2007, 1115: 178–202. 10.1196/annals.1407.020View ArticlePubMedGoogle Scholar
- Lagrange T, et al.: New core promoter element in RNA polymerase II-dependent transcription: sequence-specific DNA binding by transcription factor IIB. Genes Dev 1998, 12: 34–44. 10.1101/gad.12.1.34PubMed CentralView ArticlePubMedGoogle Scholar
- Levine M, Tjian R: Transcription regulation and animal diversity. Nature 2003, 424: 147–151. 10.1038/nature01763View ArticlePubMedGoogle Scholar
- Moyle-Heyrman G, et al.: Structural Constraints in Collaborative Competition of Transcription Factors against the Nucleosome. J Mol Biol 2011, 412: 634–646. 10.1016/j.jmb.2011.07.032PubMed CentralView ArticlePubMedGoogle Scholar
- Narusaka Y, et al.: Interaction between two cis-acting elements, ABRE and DRE, in ABA-dependent expression of Arabidopsis rd29A gene in response to dehydration and high-salinity stresses. Plant J 2003, 34: 137–148. 10.1046/j.1365-313X.2003.01708.xView ArticlePubMedGoogle Scholar
- Ogo Y, et al.: The rice bHLH protein OsIRO2 is an essential regulator of the genes involved in Fe uptake under Fe-deficient conditions. Plant J 2007, 51: 366–377. 10.1111/j.1365-313X.2007.03149.xView ArticlePubMedGoogle Scholar
- Ogo Y, et al.: A novel NAC transcription factor, IDEF2, that recognizes the iron deficiency-responsive element 2 regulates the genes involved in iron homeostasis in plants. J Biol Chem 2008, 283: 13407–13417. 10.1074/jbc.M708732200View ArticlePubMedGoogle Scholar
- Raff RA, Kaufman TC: Embryos, Genes, and Evolution: Developmental-Genetic Basis of Evolutionary Change. Bloomington: Indiana University Press; 1991.Google Scholar
- Sadhale P, et al.: Basal transcription machinery: role in regulation of stress response in eukaryotes. J Biosci 2007, 32: 569–578. 10.1007/s12038-007-0056-6View ArticlePubMedGoogle Scholar
- Suzuki M, et al.: Accumulation of starch in Zn-deficient rice. Rice 2012, 5: 1–8. 10.1186/1939-8433-5-1View ArticleGoogle Scholar
- Tang Y, et al.: SVMs modeling for highly imbalanced classification. Syst Man Cybern Part B: Cybern IEEE Trans 2009, 39: 281–288.View ArticleGoogle Scholar
- Thijs G, et al.: A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 2001, 17: 1113–1122. 10.1093/bioinformatics/17.12.1113View ArticlePubMedGoogle Scholar
- Tong Q, et al.: Participation of the PI-3 K/Akt-NF-κB signaling pathways in hypoxia-induced mitogenic factor-stimulated Flk-1 expression in endothelial cells. Respir Res 2006, 7: 101–101. 10.1186/1465-9921-7-101PubMed CentralView ArticlePubMedGoogle Scholar
- Tsai FTF, Sigler PB: Structural basis of preinitiation complex assembly on human Pol II promoters. EMBO J 2000, 19: 25–36. 10.1093/emboj/19.1.25PubMed CentralView ArticlePubMedGoogle Scholar
- Vandenbon A, et al.: A novel unbiased measure for motif co-occurrence predicts combinatorial regulation of transcription. BMC Genomics 2012, 13: S11.PubMed CentralView ArticlePubMedGoogle Scholar
- Wilkins RG: Kinetics and Mechanism of Reactions of Transition Metal., Complexes 2nd ed. Weinheim: Wiley-VCH; 1991.View ArticleGoogle Scholar
- Yamamoto YY, et al.: Characteristics of Core Promoter Types with respect to Gene Structure and Expression in Arabidopsis thaliana. DNA Res 2011, 18: 333–342. 10.1093/dnares/dsr020PubMed CentralView ArticlePubMedGoogle Scholar
- Zou C, et al.: Cis-regulatory code of stress-responsive transcription in Arabidopsis thaliana. Proc Natl Acad Sci U S A 2011, 108: 14992–14997. 10.1073/pnas.1103202108PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.