Connect with us

Genomics

The basic concepts of genome assembly

Dr. Muniba Faiza

Published

on

Genome, as we all know, is a complete set of DNA in an organism including all of its genes. It consists of all the heritable information and also some regions which are not even expressed. Almost 98 % of human genome has been sequenced by the Human Genome Project, only 1 to 2 % has been understood. Still the human genome has to be discovered more whether it would be in terms of genes or proteins. Many sequencing strategies and algorithms have been proposed for genome assembly. Here I want to discuss the basic strategy involved in genome assembly, which sounds quite difficult but is not really complex if understood well.

Basic strategy involved behind discovering the new information of genome is explained in following steps:

  1. First of all, the whole genome of an organism is sequenced which results in thousands or hundreds of different unknown fragments starting from anywhere and ending upto anywhere.
  2. Now, since we don’t know what the sequence is and which fragment should be kept near to which one, the concept for ‘Contigs’ is employed. Contigs are the repeated overlapping reads which are formed when the broken fragments comes to each other only by matching the overlapping regions of the sequence. It means that many fragments which are consecutive are joined to form contig. Many such contigs are formed during the joining process.
  3. Now, the question that arises is how come we know that a fragment which may be a repeat has been kept in its right place as a genome may have many repeated regions? To overcome this, paired ends are used. Paired ends are the two ends of the same sequence fragments which are linked together, so that if one of the end of the fragment is aligned in lets say contig1 then the other end which is a part of the former will also be aligned in the same Contig as it is the consecutive part of the sequence. There are various software with the help of which we can define different lengths of the paired ends.
  4. After that all the Contigs combine to form a scaffold, sometimes called as Metacontigs or Supercontigs, which are then further processed and the genome is sequenced.

All of this is done by different assembly algorithms, mostly used are Velvet and the latest is SPADES.

According to my experiences, more efficient algorithms are which may provide us large information in one go. Just imagine that we got a thread of sequence with unknown base pairs, then what would we do with that thread and how would we identify and extract the useful information from this thread??

Thank you for reading, Don’t forget to share this article if you like it.

Dr. Muniba is a Bioinformatician based in New Delhi, India. She has completed her PhD in Bioinformatics from South China University of Technology, Guangzhou, China. She has cutting edge knowledge of bioinformatics tools, algorithms, and drug designing. When she is not reading she is found enjoying with the family. Know more about Muniba

Advertisement
3 Comments

3 Comments

  1. Fozail

    October 17, 2015 at 5:31 am

    I am delighted to read your article about basics of Genome assembly. But I would like to add on some fact over the same as well. You simply drafted out the steps involved in the GA and what methods are being used.
    Apart from what you have mentioned in your article, there exist a number of algorithms/methods being used for Genome assembly. These are

    1. SSAKE
    2. SHARCGS
    3. VCAKE
    4. Newbler
    5. Celera Assembler
    6. Euler
    7. Velvet
    8. ABySS
    9. AllPaths
    10.SOAPdenovo.

    These have been proved much better than other existing versions of Genome assembler, based on K mer and de Bruijn Graph theorem.

    You can read a full text article whose link is provided here, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2874646.

    Basically, using the above algorithms two ends of similar thread are overlapped in hierarchical manner and then grouped into contigs and contigs into scaffold, sometimes called as metacontigs or supercontigs. ANd them they are further processed into different format of reads.

  2. Dr. Muniba Faiza

    Muniba Faiza

    October 17, 2015 at 4:34 pm

    Yes you are right sir, I forgot to mention the scaffold step because I wanted to give the overview of the genome assembly, and that’s why i didn’t mention much about the algorithms except the latest one, i.e., Spades. But I will mention the scaffold step in it.
    Thank You

  3. Fozail

    October 17, 2015 at 5:08 pm

    You are most welcome..

You must be logged in to post a comment Login

Leave a Reply

Genomics

CoolBox- An open-source toolkit for genomic data visualization

Published

on

CoolBox- An open-source toolkit for genomic data visualization

A new toolkit called CoolBox is developed for the visual analysis of genomic data [1]. It makes it easy to visualize patterns in a large-scale genomic dataset. (more…)

Continue Reading

Genomics

VISPR- A new tool to visualize CRISPR screening experiments

Published

on

VISPR- A new tool to visualize CRISPR screening experiments

As CRISPR/Cas9 is a well-known genome editing technology, it is important to explore and analyze CRISPR screening experiments. In this article, we discuss a new tool developed for better visualization of CRISPR screening experiments. (more…)

Continue Reading

Genomics

How to install Cortex on Ubuntu?

Dr. Muniba Faiza

Published

on

Cortex - genome analysis framework

Cortex is a user-friendly framework for genome analysis [1]. It acquires less memory and is quite efficient in performance. It’s installation involves various steps. In this article, we will install Cortex on Ubuntu. (more…)

Continue Reading

Genomics

How to Compress and Decompress FASTQ, SAM/BAM & VCF Files using genozip?

Dr. Muniba Faiza

Published

on

compressing and decompressing files using genozip

genozip is a tool for lossless compression of large files including VCF, FASTQ, and SAM/BAM files [1]. In this article, we explain the usage of the genozip tool for the compression and decompression of these files. (more…)

Continue Reading

Genomics

Installing BCFtools on Ubuntu

Published

on

Installing bcftools on Ubuntu

BCFtools is a set of utilities that are used to manipulate variant call files (VCF) and binary call files (BCF). It can be used for both compressed and uncompressed sort of files. In this article, we will install BCFtools on Ubuntu. (more…)

Continue Reading

Genomics

Installing CRISPRCasFinder on Ubuntu

Dr. Muniba Faiza

Published

on

install crisprcasfinder on ubuntu

CRISPR/Cas9 is a genome editing technology trending fastly. It is used to identify CRISPR associated genes within the genomes of prokaryotic bacterias. Several tools are available for this. Amongst them, CRISPRCasFinder is one that is used to search for CRISPRs and Cas genes in sequence data [1]. In this article, we will install CRISPRCasFinder on Ubuntu. (more…)

Continue Reading

Genomics

Genozip- a new compression tool for VCF files

Published

on

vcf compression tool

Variant Call Format (VCF) is a text file format used to store thousands of genomic datasets. Since these files consist of a large number of gene sequences, their file size is quite large even after compression. Recently, a new compression tool has been introduced known as genozip [1]. (more…)

Continue Reading

Genomics

Methods to detect the effects of alternative splicing and transcription on proteins

Dr. Muniba Faiza

Published

on

Alternative splicing and the transcription are the most familiar processes amongst the biological processes. Alternative splicing is a process by which various forms of mRNA are generated from the same gene. A gene consists of various exons and introns and the exons are joined together in different ways [1]. (more…)

Continue Reading

Genomics

Conventionally unconventional: Anecdote of small RNAs discoveries

Published

on

Past decade has witnessed an incredible increase in a number of small RNAs. As the name indicates, small RNAs are RNA transcripts of small (approximately 21-24 nucleotide) length [1-8]. These small RNA transcripts regulate various biological processes ranging from a response to biotic/abiotic stress to the determination of tissue specificity [1-8]. Non-coding RNAs are basically classified based on their biogenesis protocol and mode of function. (more…)

Continue Reading

Genomics

GenVisR : A tool for genomic visualization

Dr. Muniba Faiza

Published

on

The ever-increasing progress of sequencing techniques has developed a massive amount of genomic data [1]. This has led to an exponential growth of genomic datasets which provide huge information to the scientists. For identifying patterns and investigating biological information, it is necessary to visualize the genomes, but it is quite difficult to develop such tools. (more…)

Continue Reading

Genomics

What is PRSice?

Dr. Muniba Faiza

Published

on

Etiology is the study of origination or causation of an event or phenomenon. Genetic etiology is the study of genes responsible for particular traits along with some other genes in an organism. The identification of genetic etiology has become a protocol while studying genotypes and/or phenotypes of individuals. For this, PRS which means, Polygenic Risk Score is calculated. (more…)

Continue Reading

Bioinformatics Programming

HTSeq : A Python framework to analyze high throughput sequencing data

Dr. Muniba Faiza

Published

on

High throughput sequencing is most widely used as it saves a lot of time and provide good results, and produces a huge amount of data which is difficult to manage and especially the tasks and operations performed on it are also very difficult. To ease this purpose, a Python framework have been introduced by  Simon Anders and team members, this framework is known as “HTSeq”. (more…)

Continue Reading

Bioinformatics News

Mycobacteriophages and their potential as source against Mycobacterial active biomolecules

Published

on

So, today is the great festival of Christmas……! Birthday of The Son of God.. And on this Auspicious day, We want to present before you all the power of Nature… How nature itself provides solution against the problem raised within it….. We all are aware of the epidemics of threat created by Mycobaterium tuberculosis and other related species. But, down here in this article we show how nature provides the solution against it.

As we know Bacteriophage (Bacterio= Bacteria’s, Phage= eater) infects several bacterium species. In contrast to it, a Mycobacteriophage is a member of a group of bacteriophages that infect mycobacterial species as their hosts e.g.,  Mycobacterium smegmatis and Mycobacterium tuberculosis, the causative agent of tuberculosis.

The rising incidence of tuberculosis, emergence of multi drug resistance in Mycobacterium tuberculosis and a slow progress in finding new drugs makes mycobacteriophage a potential candidate for its use as a diagnostic and therapeutic tool against TB.

All the characterized Mycobacteriophages are double-stranded DNA (dsDNA) tailed phages belonging to the order Caudovirales. Most are of the family Siphoviridae , characterized by  long flexible non contractile tails, whereas phages of the family Myoviridae, have contractile tails. There is a notable absence of mycobacteriophages from the family Podoviridae (containing short stubby tails), arising the question whether long tails are needed to traverse the relatively thick mycobacterial cell envelope. dsDNA tailed phages are either temperate, forming stable lysogens with a turbid plaque or lytic, forming clear plaques in which the host cells are killed. Mycobacteriophages can also be studied by the morphology of the plaques which vary in size and shape. Plaque morphology also depends on the burst size, which is the number of phage particles released on the lysis of the infected bacteria.

Genometrics of 70 sequenced Mycobacteriophages

Since the mycobacterial cell wall consists of a mycolic acid rich Mycobacterial outer membrane, attached to an arabinogalactan layer that is in turn linked to the peptidoglycan, it poses significant challenge to the phages. This challenge is met by a set of proteins, namely Lysin B proteins that cleave the linkage of mycolic acids to the arabinogalactan layer, holins that regulate lysis timing, and the endolysins (LysinAs) that hydrolyze peptidoglycan.

Phages affect hosts with a holin-endolysin system essential for programmed lysis. Endolysin is  found to be associated with a protein component of the phage tail involved in facilitating the penetration of the murein during injection of the genome into the host. Holins are small membrane proteins that form holes in the membrane through which the endolysin can pass. Holins control the length of the infective cycle for lytic phages so as to achieve lysis at an optimal time.

Endolysins can be a source of potential antibacterial because of its specificity (targeting only a few strains of bacteria) and thus replacing antibiotics (which have a more wide ranging effect), their low probababilty of developing resistance in Mycobacterium and novel mode of action.

Bioinformatics can assist this particular field of research by finding several other proteins existing on this planet or to prepare other such options having similar pharmacophore (physical and chemical attributes) properties. We can demolish the various disease threats by using natural options provided to us and can remain healthy on this planet. The only point to be remembered for this is,

NATURE CAN SATISFY OUR NEEDS, BUT IT CANNOT SUSTAIN OUR GREED….. AS A HEALTHY BODY CONSISTS OF A HEALTHY MIND, THE SAME WAY.. A CONSERVED PLANET CONSERVES ITS SPECIES TOO…..

(A major part of this article consist of some texts copied from

Hatfull, Graham F. “Mycobacteriophages: genes and genomes.” Annual review of microbiology 64 (2010): 331-356.

for any other information related references and queries, please let us know at [email protected]

Continue Reading

Genomics

Roary: Analysis of Prokaryote Pan Genome on a large-scale

Dr. Muniba Faiza

Published

on

The Microbial Pan Genome is the union of genes shared by genomes of interest. This term was first used by Medini in 2005.

(more…)

Continue Reading

Genomics

GenomeD3 plot : Easy visualization of genomes

Dr. Muniba Faiza

Published

on

As the needs say the importance of sequencing of genomes, it is equally important to visualize them. There exists some tools to visualize the genomes,but they are static and standalone, (more…)

Continue Reading

LATEST ISSUE

ADVERT