Gareth Gordon Syngai1*, Pranjan Barman2, Rupjyoti Bharali2 & Sudip Dey3
1Department of Biochemistry, Lady Keane College, Shillong – 793001
2Department of Biotechnology, Gauhati University, Guwahati – 781014
3Sophisticated Analytical Instrument Facility, North Eastern Hill University, Shillong – 793022
email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
*Corresponding author: Gareth Gordon Syngai; email:email@example.com
BLAST which is a sequence similarity search program is an excellent starting point for teaching bioinformatics to students and it has the potential to enhance a student’s grasp of biomedical, biochemical, and biogeochemical concepts. This article discusses the underlying concepts of the BLAST algorithm, the scores and statistics of the alignments; with illustrations using the NCBI BLAST. The article also emphasizes the need for students to be familiarized with the basic concepts and programs of bioinformatics which is a necessity in biological sciences now-a-days because of the recent advances in high-throughput techniques for data generation and analysis.
Keywords BLAST, algorithm, introductory tool, bioinformatics teaching, bioinformatics applications
The Basic Local Alignment Search Tool (BLAST) is one of the most commonly used tools for comparing sequence information and retrieving sequences from databases and is thus an excellent starting point for teaching bioinformatics (Kerfeld & Scott, 2011). BLAST has been utilized in nearly every branch of biology, far beyond the scope of molecular genetics, molecular biology and protein biochemistry, and this tool has made great contributions to many scientific fields since its development (Altschul et al., 1997; Altschul, 1991). Currently, the work of most biologists, bioinformaticians, evolutionists and medical scientists cannot progress without the use of BLAST (Dong-Wook et al., 2012).
The major reasons for the ever-growing popularity of BLAST are the flexibility of the search algorithm, reliable statistical reports, continual software development and the speed attained by the heuristic search methods (Neumann et al., 2013).
On the other hand, by using BLAST, students can be introduced to concepts of molecular evolution (e.g., gene duplication and divergence; orthologs versus paralogs) which are often quite abstract in nature. This is possible because of the abundance of sequence data present in public databases which raises the far more attractive possibility of using searches tailored to a particular course, or, better yet, allowing the students to choose their own examples.
Another benefit of teaching students how the BLAST algorithm works is that it provides an opportunity to illustrate how mathematics functions as a language of biology. But of higher significance is the fact that understanding the steps in the calculation of an E-value provides an opportunity to show the relationship between how the algorithm works based on the fundamental principles of biochemistry and evolution (Kerfeld & Scott, 2011).
Here, this paper presents a concise and conceptual approach with simplified interpretations of the BLAST algorithm for helping the students understand the underlying basics of the BLAST program which in turn has the potential to enhance a student’s grasp of biomedical, biochemical, and biogeochemical concepts; thus helping widen the scope for multidisciplinary integration.
BLAST: The Tool
BLAST is a sequence similarity search program that can be used via a web interface or as a stand-alone tool to compare a user’s query to a database of sequences (Altschul et al., 1997; Altschul et al., 1990). There are several types of BLAST to compare all combinations of nucleotide or protein queries with nucleotide or protein databases (McGinnis & Madden, 2004). BLAST performs comparisons between pairs of sequences, searching for regions of local similarity (Pertsemlidis & Fondon III, 2001).
The rationale for local similarity searching is that functional sites (e.g., catalytic sites of enzymes) are localized to relatively short regions, which are conserved irrespective of deletions or mutations in intervening parts of the sequence. Thus, a search for local similarity may produce more biologically meaningful and sensitive results than a search attempting to optimize alignment over the entire sequence lengths (Attwood et al., 2007).
Sequence similarity searching, typically with BLAST, is the most widely used and most reliable strategy for characterizing newly determined sequences. Sequence similarity searches can identify “homologous” proteins or genes by detecting excess similarity between the newly determined sequence (the query sequence) and any similar sequence in the database; which in turn reflects common ancestry.
Homology implies that sequences may be related by divergence from a common ancestor or share common functional aspects. Homologous genes found in different species that evolved from the same gene in a common ancestor are called orthologs, whereas homologous genes in the same organism (arising by duplication of a single gene in the evolutionary past) are called paralogs. Homologous genes (both orthologs and paralogs) often have the same or related functions (Pierce, 2002).
Sequence homology searches are a key computational tool of molecular biology and they are important as their products, the high scoring alignments, are used in a range of areas, from estimating evolutionary histories, to predicting functions of genes and proteins, to identifying possible drug targets (Pearson, 2013; Bayat, 2002; Bailey & Gribskov, 1998).
The BLAST algorithm was described by Altschul et al. in 1990. It became popular largely because implementations of it have been very efficient and it has been optimized to work with parallel UNIX architectures from an early stage (Attwood et al., 2007). The BLAST algorithm is a heuristic program, which means that it relies on some smart shortcuts to perform the search faster (Madden, 2002). However, in this trade-off for increased speed, the accuracy of the algorithm is slightly decreased (Zhimin & Zhongwen, 2013).
The algorithm itself is straightforward, the important concept being that of the segment pair. Given two sequences, a segment pair is defined as a pair of sub-sequences of the same length that form an ungapped alignment. BLAST calculates all segment pairs between the query and the database sequences, above a scoring threshold. The algorithm searches for fixed-length hits, which are then extended until certain threshold parameters are achieved. The resulting high-scoring pairs (HSPs) form the basis of the ungapped alignments that characterize BLAST output.
Subsequently, a modification of the algorithm had been introduced for generating gapped alignments (Altschul et al., 1997). The new algorithm seeks only one, rather than all, ungapped alignments that make up a significant match, and hence speeds the initial database search. Dynamic programming is used to extend a central pair of aligned residues in both directions to yield the final gapped alignment. Having dropped the requirements to find all ungapped alignments independently, the new algorithm is three times faster than its predecessor (Attwood et al., 2007).
There are three major steps in the BLAST algorithm and the details of which are as described below:
Step 1: BLAST filters the low complexity regions (e.g., CA repeats) and removes them from the query sequence (Pertsemlidis & Fondon III, 2001). The reasons being that low-complexity regions and interspersed repeats typically match many sequences, and as such these matches are normally not of biological interest which may in turn lead to spurious results, and confound the statistics used by BLAST.
BLAST offers two query masking modes to avoid such matches. One is known as “hard-masking” and replaces the masked portion of the query by X’s or N’s for all phases of the search. On the other hand, “soft-masking” makes the masked portion of the query unavailable for finding the initial word hits, but the masked portion is available for the gap-free and gapped extension once an initial word hit has been found (Camacho et al., 2009). Filtering is only applied to the query sequence (or its translation products), not to the database sequences. Default filtering is by the Nucleotide Dust Masker program (Morgulis et al., 2006) and SEG program (Wootton & Federhen, 1996). The BLAST formatter now can represent these regions by lower-case letters, making them distinct from the (upper-case) non-filtered regions. In addition, the user may select from three colors (black, gray, red) to vary the emphasis on these regions. This new display option is now the default, showing the masked regions in gray lower-case (Ye et al., 2006).
Next, the query sequence which is a long string of either nucleotide or amino acids is first broken into small pieces called “words”. As a default setting, the DNA sequences are broken into 11 consecutive letters (word length) and amino acids into 3 letters. However, users can change the ‘word length’ as desired (Pertsemlidis & Fondon III, 2001). For example, a nucleotide sequence ATCGTCGAT with word length 7 produces three different words ATCGTCG (first word), TCGTCGA (second word) and CGTCGAT (third word).
There can be at most L-w+1 such words, where L is the query sequence length, and w is the word length; in case of amino acid sequences w = 3 and for nucleotide sequences w = 11.
BLAST then uses a scoring matrix BLOSUM (block substitution matrix) or PAM (percent accepted mutation) to determine all high-scoring matching words from the database for each word in the query sequence (BLOSUM 62 is used as a default setting for amino acids). No gaps are allowed. The list of matches is reduced by taking only those that will score above a given threshold, called the neighbourhood word-score threshold (T). After doing this, approximately 50 of these matches are usually kept for each of the words generated from the original query.
Step 2: BLAST searches through the target sequence database for exact matches to the word list generated. If a match is found, it is used to seed (hit) a possible alignment between the query and the database sequences.
Step 3: These initial neighbourhood word hits act as seeds for initiating searches to find longer high-scoring pairs (HSPs) containing them.
The word hits are then extended in both left and right directions along each sequence for as far as the cumulative alignment score can be increased. The critical parameter controlling extension is called X. Low values of X cause alignments to terminate after only a few mismatches have been found, while high values of X allow alignments to continue through dissimilar regions. Extension of the word hits in each direction are halted when: the cumulative alignment score falls off by the quantity X from its maximum achieved value; the cumulative score goes to zero or below due to the accumulation of one or more negative-scoring residue alignments; or the end of either sequence is reached (Deusdado & Carvalho, 2008; Korf, 2003).
Gapped BLAST (Altschul et al., 1997) uses a lower threshold for generating the list of high-scoring matching words; the algorithm uses short matched regions with no insertions or deletions between them and within a certain distance of each other as the starting points for longer ungapped alignments. These joined regions are then extended using the same method as in the original BLAST.
Next, BLAST identifies and list the maximal scoring segment pairs (MSPs) from the entire database (Pertsemlidis & Fondon III, 2001). A maximal scoring segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences. An MSP is reported if its score exceeds a cutoff value S (Altschul et al., 1990), which is calculated by using the parameters of W (word length), T (the neighbourhood word score threshold), X (the maximum permissible drop off of the cumulative segment score), and a substitution matrix like the BLOSUM 62 for most of the BLAST programs (Deusdado & Carvalho, 2008; Rivera et al., 1998).
Finally, the score and statistics of the alignments which are calculated are then depicted in the form of results on the output window.
BLAST Scores and Statistics
BLAST provides three related pieces of information in the form of the raw scores, bit scores, and E-values that allows interpretation of its results.
The raw score for a local sequence alignment is the sum of the scores of the maximal-scoring segment pairs (MSPs) that make up the alignment. Because of differences between scoring matrices, raw scores aren’t always directly comparable. Bit scores on the other hand, are raw scores that have been converted from the log base of the scoring matrix that creates the alignment to log base 2. This rescaling allows bit scores to be compared between alignments even if different scoring matrices have been used (Madden, 2002; Gibas & Jambeck, 2001).
Thus, BLAST uses statistical theory to produce a bit score and expect value (E-value) for each alignment pair (query to hit).
The bit score gives an indication of how good the alignment is; with the higher the score, the better the alignment. In general terms, this score is calculated from a formula that takes into account the alignment of similar or identical residues, as well as any gaps introduced to align the sequences. A key element in this calculation is the “substitution matrix”, which assigns a score for aligning any possible pair of residues. The BLOSUM62 matrix is the default for most BLAST programs, the exceptions being blastn and MegaBLAST (programs that perform nucleotide-nucleotide comparisons and hence do not use protein-specific matrices).
The E-value on the other hand, gives an indication of the statistical significance of a given pair-wise alignment and reflects the size of the database and the scoring system used. The lower the E-value, the more significant is the hit. A sequence alignment that has an E-value of 0.05, means that this similarity has a 5 in 100 (1 in 20) chance of occurring by chance alone. Thus, an E-value greater than 1 indicates that the alignment probably has occurred by chance, and that the query sequence has been aligned to a sequence in the database to which it is not related. E-values less than 0.1 or 0.05 are typically taken to represent biological significance (Madden, 2002; Pertsemlidis & Fondon III, 2001); with the default E-value being 10, that is, 10 hits are expected to occur by chance with scores equal to or greater than the alignment score.
The BLAST Family of Programs
Since 1990, many variants of BLAST have been developed, each with its own specialized features. Early on, the original BLAST was split into two adaptations: NCBI BLAST and Washington University BLAST (WU BLAST). Both the BLASTs have program variations. Examples of the programs include BLASTN which can be used to compare a nucleotide sequence with a nucleotide database; BLASTP which can be used to compare a protein sequence with a database of protein sequences; and BLASTX which can take a nucleotide sequence, translate it, and query it versus a protein database in one step (Gish & States, 1993). TBLASTN can compare a protein query sequence to all six possible reading frames of a translated nucleotide database and is often used to identify proteins in new, un-described genomes. Finally, TBLASTX compares all six reading frames of a translated nucleotide query sequence to all six reading frames of a translated nucleotide database.
In addition, NCBI has some of its own specialized variants of BLAST. For example, MEGABLAST is a program that can rapidly complete searches for sequences with only minor variations and it can more efficiently manage queries with longer sequences (Altschul et al., 1994). PSI- and PHI- are powerful BLAST tools that allow more complex and evolutionary divergent proteins to be aligned (Altschul et al., 1997). These and other programs, as well as genomic BLAST databases, are all available on the NCBI BLAST website (Lobo, 2008).
BLAST+, CS-BLAST and DELTA-BLAST are the other user-friendly BLAST interfaces with increasing computer processing power and new algorithms (Neumann et al., 2013).
Performing the BLAST Run
BLAST search can be performed using the NCBI website from the web address http:// www.ncbi.nlm.nih.gov/Blast. There are various BLAST options available on this home page (Figure 1)
Further, for performing a BLAST search, the query sequence should be in FASTA format as shown in (Figure 2).
In order to perform the run, first we have to open the NCBI home page and then click on BLAST. Then the type of BLAST options can be selected from this window. In this case, we have selected the nucleotide blast option which is in Basic BLAST. Subsequently, the query sequence which is in FASTA file is pasted in the Enter Query Sequence section of the window. Next, the database for performing the BLAST is selected from the drop down menu; in our case we have selected the nr database; and from the program selection we have optimize for blastn (Figure 3). Subsequently we can click the BLAST option and wait patiently for a few seconds.
After a few seconds, the BLAST output window will appear showing the results of the BLAST search (Figs 4, 5 & 6).
Applications of BLAST in Biological Sciences
The BLAST tool finds its use in a wide range of biological applications and some of which includes: identification of homologous gene candidates across diverse genomes (Lu et al., 2006); species comparison by identifying similar genes in different organisms (Holton, 2004); comparative gene prediction which involves conducting a search between two genome sequences to provide both sensitive and specific gene predictions (Parra et al., 2003); functional annotation of genomes for the identification of functional properties and biological roles of the genes in the genomes (Moriya et al., 2007); contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies (van Hijum et al., 2005); pseudogene identification for understanding the evolutionary history of genes and genomes (Zhang et al., 2006).
This tool is also helpful in building datasets for phylogenetic analysis (Dereeper et al., 2010), and constructing phylogenetic dendrograms/trees from protein sequences (Kelly & Maini, 2013). Further, it is also used for designing target-specific primers for polymerase chain reaction (Ye et al., 2012).
The parallel development of large-scale sequencing projects and bioinformatics tools like BLAST has enabled scientists to study the genetic blueprint of life across many species and has helped bridge the gap between biology and computer science in the maturing field of bioinformatics(Lobo, 2008). It is noteworthy to mention here that as the biological sequence data are generated at an ever increasing rate, the role of bioinformatics in biological research will also continue to grow (Newell et al., 2013).
Hence, bioinformatics tools that allow scientists to explore genome sequence data have become a cornerstone of current biological research and as such should be included in any modern biology curriculum (Klein & Gulsvig, 2012; Ditty et al., 2010; Ranganathan, 2005). No science curriculum can remain current without a bioinformatics component. Undergraduate students increasingly need training in methods related to finding and retrieving information stored in vast databases (Maloney et al., 2010).
It is in this context, that BLAST finds its ideal place as an important introductory tool for students to bioinformatics applications and this tool is also one of the most used bioinformatics approaches which is accessible to any researcher over the internet, and is routinely used to assign sequences into functional and taxonomic categories; with its application ranging from the analysis of raw sequence data and genome comparisons, often extending into sequence-based data mining (Neumann et al., 2013).
Nonetheless, it is therefore important that a modern biology course ought to have a nature of instruction that familiarizes the students with the basic concepts and programs of bioinformatics which is a necessity in the biological sciences now-a-days as because the biologists will continue to use the so-called wet labs to give students a chance to experience experimental techniques involving organisms, tissues, and cellular
The authors gratefully acknowledge the vital inputs given by Kitriphar Tongper, Gopi Ragupathi, and Alagu Lakshmanan during the writing of this manuscript.components first hand. But in silico dry labs involving bioinformatics techniques and virtual lab exercises can be very effective, especially in genetics, cell biology, and molecular biology (Maloney et al., 2010).
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25: 3389–3402.
Altschul, S.F., Boguski, M.S., Gish, W., Wootton, J.C. 1994. Issues in searching molecular sequence databases. Nature Genetics 6: 119–129.
Altschul, S.F. 1991. Journal of Molecular Biology 219: 555–565.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. 1990. Basic local alignment search tool. Journal of Molecular Biology 215: 403–410.
Attwood, T. K., Parry-Smith, D, J., Phukan, S. (Eds.) 2007. Pairwise Alignment Techniques. In: Introduction to Bioinformatics, Dorling Kindersley (India) Pvt. Ltd., New Delhi pp. 114-138.
Bailey, T.L. & Gribskov, M. 1998. Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14(1): 48–54.
Bayat, A. 2002. Science, medicine, and the future: Bioinformatics. In Clinical review. British Medical Journal 324: 1018–22.
Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T.L. 2009. BLAST+: architecture and aplications. BioMed Central Bioinformatics 10: 421.
Dereeper, A., Audic, S., Claverie, J-M., Blanc, G. 2010. BLAST-EXPLORER helps you building datasets for phylogenetic analysis. BioMed Central Evolutionary Biology, 10(8) pp. 1 – 6.
Deusdado, S.A.D., & Carvalho, P.M.M. 2008. SimSearch: A New Variant of Dynamic Programming Based on Distance Series for Optimal and Near-Optimal Similarity Discovery in Biological Sequences. In: J.M. Corchado et al. (Eds.) IWPACBB2008, Advances in Soft Computing 49: 206-216.
Ditty, J.L., Kvaal, C.A., Goodner, B., Freyermuth, S.K., Bailey, C., et al. 2010. Incorporating genomics and bioinformatics across the life sciences curriculum. PLoS Biology 8(8): e1000448.
Dong-Wook Kim, Ryong Nam Kim, Dae-Soo Kim, Sang- Haeng Choi, Sung-Hwa Chae, Hong-Seog Park. 2012. easySEARCH: A user-friendly bioinformatics program that enables BLAST searching with massive number of query sequences. Bioinformation 8(16): 792–794.
Gibas, C., & Jambeck, P. 2001. Sequence Analysis, Pairwise Alignment, and Database Searching. In: Developing Bioinformatics Computer Skills; O’Reilly Media Inc., Seventh Indian Reprint (2008) pp. 159-190.
Gish, W., & States, D.J. 1993. Identification of protein coding regions by database similarity search. Nature Genetics 3(3): 266-272.
Holton, W. C. 2004. The Path to Species Comparison. In: Environmental Health Perspectives 112(12): A 672.
Kelly, S., Maini, P.K. 2013. DendroBLAST: Approximate Phylogenetic Trees in the absence of Multiple Sequence Alignments. PLOS ONE 8(3): e58537 pp. 1-11.
Kerfeld, C.A., Scott, K.M. 2011. Using BLAST to teach “E-value-tionary” Concepts. PLoS Biology 9(2): e1001014.
Klein, J.R., & Gulsvig, T. 2012. Using bioinformatics to develop and test hypotheses: E. coli-specific virulence determinants. Journal of Microbiology & Biology Education 13(2): 161-169.
Korf, I. 2003. Serial BLAST searching. Bioinformatics 19(12): 1492-1496.
Lobo, I. 2008. Basic Local Alignment Search Tool (BLAST). Nature Education 1(1): 215.
Lu, G., Jiang, L., Helikar, R. M. K., Rowley, T. W., Zhang, L., Chen, X., Moriyama, E.N. 2006. GenomeBlast: a web tool for small genome comparison. BioMed Central Bioinformatics, 7(Suppl 4):S18: 1- 9.
Madden, T. 2002. The BLAST sequence analysis tool. In: NCBI Handbook (Eds. Mc Entyre, J., Ostell, J.), National Library of Medicine, Bethesda, MD.
Maloney, M., Parker, J., LeBlanc, M., Woodard, C.T., Glackin, M., Hanrahan, M. 2010. Bioinformatics and the Undergraduate Curriculum Essay. CBE-Life Sciences Education 9: 172-174.
McGinnis, S., Madden, T.L. 2004. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Research 32: W20–W25.
Morgulis, A., Gertz, E.M., Schaffer, A.A., Agarwala, R. 2006. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. Journal of Computational Biology 13: 1028–1040.
Moriya, Y., Itoh, M., Okuda, S., Yoshizawa, A.C., Kanehisa, M. 2007. KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Research, 35: W182–W185.
Neumann, R.S., Kumar, S., Shalchian-Tabrizi, K. 2013. BLAST output visualization in the new sequencing era. Briefings in Bioinformatics.
Newell, P.D., Fricker, A.D., Roco, C.A., Chandrangsu, P., Merkel, S.M. 2013. A Small-Group Activity Introducing the Use and Interpretation of BLAST. Journal of Microbiology & Biology Education 14(2): 238-243.
Parra, G., Agarwal, P., Abril, J.F., Wiehe, T., Fickett, J.W., & Guigo, R. 2003. Comparative Gene Prediction in Human and Mouse. Genome Research 13:108–117.
Pearson, W.R. 2013. An Introduction to Sequence Similarity (“Homology”) Searching. Current Protocols in Bioinformatics John Wiley & Sons, Inc. 42: 3.1.1- 3.1.8.
Pertsemlidis, A., Fondon III, J.W. 2001. Having a BLAST with bioinformatics (and avoiding BLASTphemy). Genome Biology Reviews 2(10): 2002.1-2002.10.
Pierce, B. A. 2002. Genomics. In: Genetics: a conceptual approach; W.H. Freeman and Co. pp. 548-582.
Ranganathan, S. 2005. Bioinformatics education-Perspectives and challenges. PLoS Computational Biology 1(6): e52.
Rivera, M.C., Jain, R., Moore, J.E., Lake, J.A. 1998. Genomic evidence for two functionally distinct gene classes. Proceedings of the National Academy of Sciences of the United States of America 95: 6239-6244.
van Hijum, S. A. F. T., Zomer, A.L., Kuipers, O.P., Kok, J. 2005. Projector 2: contig mapping for efficient gapclosure of prokaryotic genome sequence assemblies. Nucleic Acids Research 33: W560–W566.
Wootton, J.C., & Federhen, S. 1996. Analysis of compositionally biased regions in sequence databases. Methods in Enzymology 266: 554–571.
Ye, J., Coulouris, G., Zaretskaya, I., Cutcutache, I., Rozen, S., Madden, T. L. 2012. Primer-BLAST: A tool to design target-specific primers for polymerase chain reaction. BioMed Central Bioinformatics 13(134): 1-11.
Ye, J., McGinnis, S., Madden, T.L. 2006. BLAST: improvements for better sequence analysis. Nucleic Acids Research 34: W6–W9.
Zhang, Z., Carriero, N., Zheng, D., Karro, J., Harrison, P.M., Gerstein, M. 2006. PseudoPipe: an automated pseudogene identification pipeline. Bioinformatics 22(12): 1437–1439.
Zhimin, Z., & Zhongwen, C. 2013. Dynamic Programming for Protein Sequence Alignment. International Journal of Bioscience and Biotechnology 5(2): 141–150.