The basic concepts of genome assembly

in Genomics/Sequence Analysis by

Genome, as we all know, is a complete set of DNA in an organism including all of its genes. It consists of all the heritable information and also some regions which are not even expressed. Almost 98 % of human genome has been sequenced by the Human Genome Project, only 1 to 2 % has been understood. Still the human genome has to be discovered more whether it would be in terms of genes or proteins. Many sequencing strategies and algorithms have been proposed for genome assembly. Here I want to discuss the basic strategy involved in genome assembly, which sounds quite difficult but is not really complex if understood well.

Basic strategy involved behind discovering the new information of genome is explained in following steps:

  1. First of all, the whole genome of an organism is sequenced which results in thousands or hundreds of different unknown fragments starting from anywhere and ending upto anywhere.
  2. Now, since we don’t know what the sequence is and which fragment should be kept near to which one, the concept for ‘Contigs’ is employed. Contigs are the repeated overlapping reads which are formed when the broken fragments comes to each other only by matching the overlapping regions of the sequence. It means that many fragments which are consecutive are joined to form contig. Many such contigs are formed during the joining process.
  3. Now, the question that arises is how come we know that a fragment which may be a repeat has been kept in its right place as a genome may have many repeated regions? To overcome this, paired ends are used. Paired ends are the two ends of the same sequence fragments which are linked together, so that if one of the end of the fragment is aligned in lets say contig1 then the other end which is a part of the former will also be aligned in the same Contig as it is the consecutive part of the sequence. There are various software with the help of which we can define different lengths of the paired ends.
  4. After that all the Contigs combine to form a scaffold, sometimes called as Metacontigs or Supercontigs, which are then further processed and the genome is sequenced.

All of this is done by different assembly algorithms, mostly used are Velvet and the latest is SPADES.

According to my experiences, more efficient algorithms are which may provide us large information in one go. Just imagine that we got a thread of sequence with unknown base pairs, then what would we do with that thread and how would we identify and extract the useful information from this thread??

Thank you for reading, Don’t forget to share this article if you like it.

Download PDF

Muniba is a Bioinformatician based in the South China University of Technology. She has cutting edge knowledge of bioinformatics tools, algorithms, and drug designing. When she is not reading she is found enjoying with the family. Know more about Muniba

3 Comments

  1. I am delighted to read your article about basics of Genome assembly. But I would like to add on some fact over the same as well. You simply drafted out the steps involved in the GA and what methods are being used.
    Apart from what you have mentioned in your article, there exist a number of algorithms/methods being used for Genome assembly. These are

    1. SSAKE
    2. SHARCGS
    3. VCAKE
    4. Newbler
    5. Celera Assembler
    6. Euler
    7. Velvet
    8. ABySS
    9. AllPaths
    10.SOAPdenovo.

    These have been proved much better than other existing versions of Genome assembler, based on K mer and de Bruijn Graph theorem.

    You can read a full text article whose link is provided here, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2874646.

    Basically, using the above algorithms two ends of similar thread are overlapped in hierarchical manner and then grouped into contigs and contigs into scaffold, sometimes called as metacontigs or supercontigs. ANd them they are further processed into different format of reads.

  2. Yes you are right sir, I forgot to mention the scaffold step because I wanted to give the overview of the genome assembly, and that’s why i didn’t mention much about the algorithms except the latest one, i.e., Spades. But I will mention the scaffold step in it.
    Thank You

Leave a Reply