Genome, as we all know, is a complete set of DNA in an organism including all of its genes. It consists of all the heritable information and also some regions which are not even expressed. Almost 98 % of human genome has been sequenced by the Human Genome Project, only 1 to 2 % has been understood. Still the human genome has to be discovered more whether it would be in terms of genes or proteins. Many sequencing strategies and algorithms have been proposed for genome assembly. Here I want to discuss the basic strategy involved in genome assembly, which sounds quite difficult but is not really complex if understood well.
Basic strategy involved behind discovering the new information of genome is explained in following steps:
- First of all, the whole genome of an organism is sequenced which results in thousands or hundreds of different unknown fragments starting from anywhere and ending upto anywhere.
- Now, since we don’t know what the sequence is and which fragment should be kept near to which one, the concept for ‘Contigs’ is employed. Contigs are the repeated overlapping reads which are formed when the broken fragments comes to each other only by matching the overlapping regions of the sequence. It means that many fragments which are consecutive are joined to form contig. Many such contigs are formed during the joining process.
- Now, the question that arises is how come we know that a fragment which may be a repeat has been kept in its right place as a genome may have many repeated regions? To overcome this, paired ends are used. Paired ends are the two ends of the same sequence fragments which are linked together, so that if one of the end of the fragment is aligned in lets say contig1 then the other end which is a part of the former will also be aligned in the same Contig as it is the consecutive part of the sequence. There are various software with the help of which we can define different lengths of the paired ends.
- After that all the Contigs combine to form a scaffold, sometimes called as Metacontigs or Supercontigs, which are then further processed and the genome is sequenced.
All of this is done by different assembly algorithms, mostly used are Velvet and the latest is SPADES.
According to my experiences, more efficient algorithms are which may provide us large information in one go. Just imagine that we got a thread of sequence with unknown base pairs, then what would we do with that thread and how would we identify and extract the useful information from this thread??
Thank you for reading, Don’t forget to share this article if you like it.