How To: Detecting Chimera in 16S rRNA Sanger Sequencing Reads

A typical chimeric sequence obtained from Pintail version 1.0

Detecting chimeric (or recombinant) sequences from a sequence dataset is an important part of sequence analysis especially for reconstruction of deep phylogenies as well as for sequence similarity analyses. This article focuses on methods of chimera detection in high quality 16S rRNA sequences from Sanger sequencing with good read length (>750bp). With such large size they become potential candidates for chimera formation. With culture-independent approaches for analyses of microbial diversity picking up fast with high throughput sequencing methods, the amount of chimeric sequences being published in the databases are also increasing exponentially. This is the era of Metagenomics or simply put community DNA analyses where DNA from thousands of species gets pooled up and is then analysed. This further increases chances of chimera formation. Chimeras are usually formed during polymerase chain reaction (PCRs) but in some rare cases they are for real. Therefore, it becomes relevant to adopt methods which can clean the sequence datasets of Chimeras.

Recently, a number of chimera detecting software for 16S rRNA gene sequences have been launched namely Pintail, Mallard and Bellerophon. First two software applications are available at http://www.bioinformatics-toolkit.org and the last one is available at http://greengenes.lbl.gov/cgi-bin/nph-index.cgi. Pintail and Mallard can detect chimeras and anomalies in the 16S rRNA genes based on extent of pair-wise percentage similarity between the query and related sequences. In chimera analysis by Pintail 1.0, the query sequences which could be putative recombinants are compared on a one (query)-on-one (subject) basis with a list of closely related sequences identified by BLAST searches. As Pintail is a one-on-one query-subject comparison, it is highly stringent. This is not the case with Mallard. In Mallard, one of the sequences from within a dataset of query sequences is randomly chosen as subject, while rest remain as query. A many (query)-on-one (subject) comparison follows, which is easy and completes in less time as compared to Pintail. This is to be noted that Mallard is of limited use if the query sequences are too diverse or really novel in the first place.

Another software for detecting chimeras in 16S rRNA genes i.e. Bellerophon ver 3.0 from Greengenes is more dedicated to 16S rRNA sequences. Here, the sequences are required to be submitted as NAST (Nearest Alignment Space Termination Tool) formatted file. The NAST alignment server at Greengenes has more than one million 16S rRNA sequence records. Upon submission of the NAST formatted file, the server launches a localized BLAST search for each query sequence with the 16S rRNA gene sequence library on its server. It

checks for potential chimeras in the respective query-subject alignment, one-on-one. The outcome of the entire process is a couple of EXCEL sheets emailed to the user with the query sequences, their best matches, and BLAST score values. The BLAST score threshold value can be set by the user, below which the software automatically removes the sequences not to be considered for chimera detection. Finally, it tells whether a potential break-point was found or not (in essentially Yes or No format). It is user-friendly and particularly good for large datasets with high amount of sequence diversity. The only demerit of the software is that if there is a relatively novel sequence in the query batch, it receives a low score being highly unrelated with the existing records and thus stands at a risk of getting omitted. Hence, one has to be really careful while using these programs as there could be loss of sequence diversity especially if the data is coming from an extreme site (with more newer/novel sequences) or if the data is coming from some NGS project with nice long reads and good coverage as in the case of Pac Bio Machine. It is worth mentioning here that while Pintail and Mallard can be applied for any given DNA sequence data, Bellerophon is a dedicated program for 16S rRNA.