Micro RNAs (miRNAs) are the short endogenous RNAs (~22 nucleotides) and originate from the non-coding RNAs [1], produced in single-celled eukaryotes, viruses, plants, and animals [2]. miRNAs are capable of controlling homeostasis [2] and play significant roles in various biological processes such as degradation of mRNA and post-translational inhibition through complementary base pairing [3]. There are several miRNA databases which provide detailed information about the miRNA sequences, annotations, functions, and their predicted targets, among which miRBase is a primary online database for miRNA mature sequences and annotations [4-6]. This article explains the detailed structure and algorithm of miRBase.
miRBase is an online database which is available at www.mirbase.org [4-6]. The data can be downloaded from an FTP site (ftp://mirbase.org/pub/mirbase/CURRENT/) in different formats including FASTA and MySQL relational database dumps [7]. It provides a user-friendly interface to miRNA sequence data, its predicted gene targets, and annotations [6]. miRBase release 10.1 consists of 5071 miRNA loci from 58 species which expresses 5922 mature miRNA sequences [7]. miRBase has three main functions:
1. miRBase Registry:
It is a confidential source for assigning independent names to the novel miRNA genes even before their publication in any peer-reviewed journal [7]. This service is being used by over 70 publications. According to it, sequential numerical identifiers are assigned to the miRNAs, which uses 3 or 4 letters abbreviation to designate the species. For example, hsa-miR-101 (in Homo sapiens) [8].
2. miRBase Sequences:
It acts as a primary online repository for miRNA sequences and annotation. It provides the miRNA information, annotation, references, and links to other resources for all published and validated miRNAs [5,7]. This database consists of over 5000 sequences from 58 species.
3. miRBase Targets:
It is the database of predicted miRNA target genes. It predicts the targets for all published animal miRNAs [5,7]. The version 5 of this database predicts targets for over 5,00,000 mRNAs for all miRNAs in 24 different species. All individual miRNA-target binding sites, multiple conserved sites in the species, and multiple binding sites in UTRs are assigned a P-value [9], which helps the user to determine confidence in the predicted results.
miRBase has a nomenclature scheme for all predicted targets, its primary features are described as follows [7]:
- the predicted miRNA name composed of three or four alphabets species name as the prefix and a number as the suffix. For example, has-mir-212.
- A predicted mature miRNA sequence expressed from one or more hairpin precursor locus, composed of further numeric suffixes. For example, dme-mir-6-1 and dme-mir-6-2
- Related mature miRNA sequences expressed from the related hairpin loci, consists of further alphabets in their suffixes. For example, mmu-mir-181a and mmu-mir-181b.
- Plant miRNA genes are named as ath-MIR166a, where alphabets in the suffixes denote the distinct loci expressing all related mature miRNAs, and numbers are not used in the suffixes.
- Viral miRNA names consist of the locus from which they are derived. For example, ebv-mirBART1 from the Epstein-Barr virus BART locus.
miRBase Data
The latest release of miRBase (release 20) has updated the database with 24,521 hairpin sequences from 206 species, and 30424 mature sequences [10]. In many cases, the 5’ and 3’ arm of the hairpin precursor expresses the mature miRNAs suggesting that either both may be functional, or there is no sufficient data available to determine the predominant product [7]. Such miRNAs are depicted as has-miR-140-5p and has-miR-140-3p. The ‘Evidence’ field provides information about the origin of each sequence available in the database.
The miRBase:Targets database predict targets in the UTRs of 37 different animal genomes from Ensembl [5,7]. miRBase provides a list of mRNAs overlapping each miRNA defining its type (intron, UTR or exon) and the sense (forward or reverse) [7]. miRNAs are often clustered within a genome, therefore, miRBase provides a list of such miRNAs which can be retrieved for any organism. miRBase also displays the distribution of genomic features of miRNAs, showing CpG islands, poly-A site, EST, cDNA, TSS, and DITAGs. TSSs are predicted using the Eponine-TSS software.
How does it work?
miRBase uses the miRANDA algorithm [7,11] to identify all available miRNA sequences for a particular genome against 3’-UTR of that genome obtained from Ensembl [12]. The algorithm is based on dynamic programming which searches for maximum local complementarity alignments. For every pair of G:C and A:T, a score of +5 is assigned, for G:U wobble pairs, +2 is assigned, and -3 for mismatch pair, also the gap opening and gap elongation parameters are set to −8.0 and −2.0, respectively [11,12]. It calculates the optimal alignment score at the positions i, j by forming an alignment scoring matrix. The gap-elongation parameter has been used only if the extension to i, j of a given stretch of gaps ending at positions i–1, j or j–1, i (but not of stretches of gaps ending at i–k, j or j, i–k for k > 1) resulted in a higher score than the addition of a nucleotide-nucleotide match at positions i, j. Complementarity scores at the first eleven residues of the miRNA 5’-end, are multiplied by a scaling factor of 2.0 to achieve the experimentally observed 5’-3’ asymmetry, e.g., G:C and A:T base pairs contributed +10 to the match score in these positions. This value of scaling factor is adjustable. There are few rules for the target prediction: the threshold for candidate target site is S > 90 and ΔG < −17 kcal/mol, where S is the sum of single pair of matching residue scores over the alignment trace and ΔG is the free energy of duplex formation from a completely dissociated state [11,12].
The algorithm finds the optimal local matches above this threshold between a particular miRNA and a set of 3’ UTRs in each genome, after that it checks whether the sequence of this miRNA and target site position is conserved in orthologous genes, i.e., human, mouse, or rat, or fugu, and zebrafish [12]. The alignment between the target sites is transitive in nature (UTR to miRNA to UTR) through a homologous miRNA. It is necessary that the positions of the target sites pairs should fall between ±10 residues of the aligned UTRs. When this criterion gets fulfilled, the conserved target sites with 90% or more sequence identity (human versus mouse or rat) and 70% or more (fugu versus zebrafish), are selected as the candidate miRNA target sites and stored in the database (MYSQL) [11]. John et al., 2005, has predicted 10,572 target sites which are conserved in either mouse or rat in 4,463 human transcripts, of which 2,307 transcripts of 2,273 genes contained more than one target site. Similarly, using zebrafish as a reference species, they predicted 7,057 conserved target sites (conserved in fugu) in 4,820 zebrafish transcripts [12].
The conserved target sites for each miRNA are sorted according to the alignment score, in which free energy acts as the secondary sort criterion. When a single site of a mRNA (or within 25 nts) is targeted by multiple miRNAs, the miRNA having the highest scoring lowest energy is reported for that site [11].
Recently, MiRBase has introduced high-confidence miRNAs based on the pattern of deep-sequencing data [10]. In order to be classified under high-confidence miRNAs, a locus must fulfill the following criteria [10]:
- a minimum of 10 reads must map to the two possible mature miRNAs obtained from the hairpin precursor.
- a minimum of half of the reads mapping to each arm of the hairpin precursor must consist of the same 5′-end.
- a minimum or more than half fo the bases (60%) in mature sequences must pair with the predicted hairpin structure.
- the predicted hairpin structure must have a folding free energy of <0.2 kcal/mol/nt.
- the most abundant reads present at each arm of the hairpin precursor must be paired to the mature miRNA duplex with 0-4 nucleotides at their 3′-ends.
This was all about miRBase if you would like to read more about miRNA prediction using deep sequencing data, click here. We will be discussing other miRNA databases in detail in the upcoming articles.
References
- Bartel, D. P. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. cell, 116(2), 281-297.
- Liu, B., Li, J., & Cairns, M. J. (2012). Identifying miRNAs, targets and functions. Briefings in bioinformatics, 15(1), 1-19.
- He, L., & Hannon, G. J. (2004). MicroRNAs: small RNAs with a big role in gene regulation. Nature Reviews Genetics, 5(7), 522.
- Griffiths-Jones, S. (2006). miRBase: the microRNA sequence database. In MicroRNA Protocols (pp. 129-138). Humana Press.
- Griffiths-Jones, S., Grocock, R. J., Van Dongen, S., Bateman, A., & Enright, A. J. (2006). miRBase: microRNA sequences, targets and gene nomenclature. Nucleic acids research, 34(suppl_1), D140-D144.
- Griffiths‐Jones, S. (2010). miRBase: microRNA sequences and annotation. Current protocols in bioinformatics, 29(1), 12-9.
- Griffiths-Jones, S., Saini, H. K., van Dongen, S., & Enright, A. J. (2007). miRBase: tools for microRNA genomics. Nucleic acids research, 36(suppl_1), D154-D158.
- Griffiths‐Jones, S. (2004). The microRNA registry. Nucleic acids research, 32(suppl_1), D109-D111.
- Rehmsmeier, M., Steffen, P., Höchsmann, M., & Giegerich, R. (2004). Fast and effective prediction of microRNA/target duplexes. Rna, 10(10), 1507-1517.
- Kozomara, A., & Griffiths-Jones, S. (2013). miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic acids research, 42(D1), D68-D73.
- Enright, A. J., John, B., Gaul, U., Tuschl, T., Sander, C., & Marks, D. S. (2003). MicroRNA targets in Drosophila. Genome biology, 5(1), R1.
- John B, Enright AJ, Aravin A, Tuschl T, Sander C, et al. (2005) Correction: Human MicroRNA Targets. doi: info:doi/10.1371/journal.pbio.0030264