1
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
2
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
Description of Module Subject Name Biochemistry
Paper Name 13 Biostatistics and Bioinformatics
Module Name/Title 09 Basic Local Alignment Search Tool - BLAST
Dr. Vijaya Khader Dr. MC Varadaraj
3
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
1. Objectives: In this module the students will
1. Understand BLAST for searching similar sequences in the databases.
2. Understand a heuristic approach implemented in basic BLAST which reduces the number of target sequences to be evaluated during similarity search using dynamic programming.
3. Learn about various BLAST variants available to be used for specialized applications with associated input parameters and interpretation of output parameters as well as a practical example for protein BLAST search.
4. Use BLINK for the exploration of similar protein sequences which have already been pre-computed for every protein sequence.
5. To use Genome BLAST for searching assembled genomes to find paralogous sequences.
6. To use specialized BLAST such as Primer-Blast 2. Concept Map
3. BLAST
Basic local alignment search tool (BLAST) available at http://blast.ncbi.nlm.nih.gov/Blast.cgi, is a similarity search tool which finds regions of local similarity between biological sequences using local sequence alignment. It compares a query sequence to target/ subject sequences in a selected database and calculates the statistical significance of sequence alignment. BLAST may find one or more sub-sequences in one query which have similarity to sub-sequences/ domains in target/ subject sequence in the database.
BLAST reports statistical significance and actual alignment, which can be used to homology. Since homologous sequences have similar structure and consequently function, a newly determined sequence can be used for similarity search to infer structural, functional and evolutionary relationships between sequences as well as to help identify members of gene/ protein families. A functionally interesting protein, say a spot in 2D-PAGE, can be sequenced using Mass Spectroscopy and used as query sequence for searching sequences with similar structural and functional information. If the structural and functional
BLAST
Basic BLAST and BLINK Genome BLAST Specialized BLAST
4
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
knowledge is known for some similar protein, then the same knowledge may be applied in understanding the unknown structure and unknown function of the query protein.
Back to concept map 3.1. Basic BLAST
BLAST implements a heuristic approach, which results in substantial computational savings during similarity search. In this approach, the database to be searched is first filtered using the query sequence, in a very fast manner, to exclude dissimilar sequences. This reduces the number of target sequences to be evaluated using dynamic programming (DP) alignment to a very small number. However, actual implementation is different from the one implemented in FASTA. The actual implementation in BLAST is as follows:
(1) Splitting complete query sequence (say qlnfsagw) in words of fixed size (same as k-tup for FASTA), say 2 in this case, to create a list. This is based on the observation that a good alignment usually includes short identical or very similar fragments, similar to small diagonals in DotPlot. This is followed by introduction of possible mutations in each word, and scoring each mutated word using a substitution matrix (PAM120) to keep mutated words with score above a pre-computed threshold for the original word. This is called extended list for original words created after splitting complete query sequence in words of fixed size.
(2) Looking up database sequences for matching words in the extended list using a given scoring matrix and keeping sequences containing matching words with recording offset (diagonal in dotPlot) of each word.
5
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
(3) Extending alignment of matched words in both directions by aligning identical/ mismatched residues in two sequences. Extension requires two matching words on the same diagonal. Extension continues till the score of alignment in both directions is above a prefixed threshold. The resulting contiguous aligned segment pair is without gaps and is called high-scoring segment pair (HSP).
(4) The highest scored HSPs are chosen to extend alignment in both directions using dynamic programming to introduce gaps, till the alignment score is above certain threshold.
6
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
(5) Finally, there is calculation of alignment scores and E-Value for evaluating statistical significance of each alignment.
There are several BLAST variants available for similarity search to retrieve a specific target.
Back to concept map
3.1.1. Selection of a Blast variant
Formulation of the similarity search experiment with BLAST begins with a decision about what type of sequences to compare, DNA, Protein or DNA as protein, keeping in mind the target to achieve. If the sequence under consideration is a protein or codes for a protein, then the search should probably take place at the protein level. This decides the specific BLAST variant, to choose a translation (available with BlastX, tBlastN, tBlastX only), to be used. Otherwise use BlastN using nucleotide query. The pre-treatment, to query sequence and database sequences, such as filtering subsequences, to be given, if any, is also decided. Then database to be search is decided and the database size can be limited by Entrez (pronounced as ‘AahnTray’) Query. Then depending upon the homology targeted, a substitution matrix (PAM or BLOSUM) is selected for scoring alignment. Gap penalty costs are also decided, keeping in mind the homology targeted. Word-size is decided to achieve sensitivity of search needed. Then, we have to decide the output parameters such as statistical threshold to decide to include significant matches, number of sequences to be reported. Once the output is produced, we have to interpret the results to draw an inference of homology. In addition, BLAST provides a tool, BLINK for the exploration of similar protein sequences which have already been pre-computed for every protein sequence.
The table given next shows five Blast variant for specific applications. Therefore, there is option of choosing a Blast variant for the specified application.
7
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
Specific application / Objective Blast variant Query Database Comparison at
Find homologous proteins BlastP Protein Protein Protein level
Analyze new DNA to find genes and seek homologous proteins
BlastX DNA Protein Protein level Search for genes in unannotated DNA tBlastN Protein DNA Protein level Discover gene structure: Search translated
nucleotide database using a translated nucleotide query
tBlastX DNA DNA Protein level
Seek identical DNA sequences and splicing patterns
BlastN DNA DNA DNA level
The BLAST page can be reached at http://blast.ncbi.nlm.nih.gov/Blast.cgi to choose a specific variant.
This helps to decide whether to use protein as query sequence or DNA as query sequence and chose a specific Blast variant. However, each BLAST variant requires one query sequence:
Back to concept map
3.1.2. Input Parameters to BLAST variants
8
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
Go to Concept Map
Translated BLAST: BlastX and tBlastX, in addition requires to choose a Genetic code for translation of the query nucleotide sequence:
Choosing search Set:
Choosing a database to search: For homology searches the most commonly searched database at NCBI website is nr database. The nr protein database combines data from non-redundant GenBank CDS translations, PDB, SwissProt, PIR and PRF. This removes the redundant, i.e. superfluous identical sequences and yields a collection with nearly all known proteins without repeating a given sequence. As databases are growing so rapidly, it is essential to use a current database. E-value (explained next) statistics are
9
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
affected by database size. Therefore, the search results from two searches can be compared only if two searches were performed with the same database and same database size. If one is interested in searching for proteins of known structure, it is best to search, just, pdb database.
Standard Protein BLAST, BlastP and Translated BLAST: BlastX
Standard Nucleotide BLAST BlastN and Translated BLAST: tBlastN/ tBlastX
There is an option to exclude a particular organism. This will allow search of orthologous sequences. If the Exclude checkbox is not checked then it is equivalent to searching from this organism for paralogous sequences. Similarly, there is an option for Entrez Query to report results for given Entrez keywords, as shown for PTS IIA, next.
10
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
Standard Nucleotide BLAST: BlastN offers
BlastN interface has three options for program selection
o Megablast is intended for comparing a query to closely related sequences and works best if the target percent identity is 95% or more but is very fast with word-size 16 to 256 options.
o Discontiguous megablast uses an initial seed that ignores some bases (allowing mismatches) and is intended for cross-species comparisons with word-size 11 and 12 options.
o BlastN is slow, but allows a word-size down to seven bases with 11 and 15 option also available.
Similarly Standard Protein BLAST : BlastP offers for program selection
BlastP interface has four options for program selection
11
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
BlastP simply compares a protein query to a protein database with a option for word size of 2, 3 or 6.
PSI-BLAST allows the user to build a PSSM (position-specific scoring matrix) using the results of the first BlastP run with a option for word size of 2 or 3.
PHI-BLAST performs the search but limits alignments to those that match a pattern in the query with a option for word size of 2 or 3..
DELTA-BLAST constructs a PSSM using the results of a Conserved Domain Database search and searches a sequence database with a option for word size of 2 or 3..
One can BLAST at this stage
Or open Algorithm parameters:
Algorithm parameters has three sections: General Parameters, Scoring Parameters and Filtering &
Masking.
General Parameters: BlastP - The important being the word size, as shown next. The higher the word size, the more sensitive the search will be.
12
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
General Parameters: BlastN - The important being the word size.
Scoring Parameters: BlastN:
Only match/mismatch and gap costs can be set here.
Scoring Parameters: BlastP, tblastn and blastx: Matrix, gap costs and compositional adjustment can
be set here.
13
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
Scoring Parameters: tblastx - Translated BLAST: Only scoring matrix is allowed to be choosen
Filtering for low complexity regions: The statistics for database searches assume that unrelated sequences will look essentially random with respect to each other. However, certain patterns in sequences violate th is rule. The most common exceptions are long runs of a small number of different residues (such as poly Alanine tract). Such regions of sequences could spuriously obtain extremely high match scores. For this reason, the search program may automatically remove such sections in proteins (replacing them with X) using the SEG program, if the checkbox to filter low complexity regions is checked. There are two other filters. Mask for lookup table, so that no hits are found based upon low-complexity sequence. However, the BLAST extensions are performed without masking and so they can be extended through low - complexity sequence. With Mask Lower Case option selected you can paste a FASTA sequence in upper case characters and denote areas you would like filtered with lower case. In this case, masking is done at both lookup table stage and BLAST extension stage. This allows users to customize what is filtered from the sequence during the comparison to the BLAST databases.
Standard Nucleotide BLAST: BlastN offers to mask species specific repeats
14
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
Back to concept map
3.1.3. Interpretation of similarity results: Interpretation of Blast search results to infer homology uses statistical and biochemical interpretation.
Statistical significance of alignment: The raw score (S) for an alignment is calculated by summing column score for each aligned position (using a specified substitution matrix) and the score for each gap position (using affine gap costs). The BLAST server report two measures to acce ss the quality of search. E value i.e.
Expectation value is defined as the number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. E is given by the product of P-value and residues in the database searched (m) as well as residues in the query sequence (n).
E = m * n * P
where P–value is given by Ke-s. The parameters K and lambda () can be thought of simply as natural scales for the alignment (search) space size and the scoring system respectively and whereas ‘S’ is the score for alignment. However, the raw score S is normalized to Bit scores, because, the raw scores ‘S’, have little meaning without detailed knowledge of the scoring system used, i.e. unless the scoring system is
understood, citing a raw score alone is like citing a distance without specifying feet, meters, or light years.
S – ln K
ln 2
S’=15
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
By normalizing a raw score using the formula (Box), one attains a "bit score" represented as S', which has a standard set of units. The E-value corresponding to a given bit score is simply E= mnP. Bit scores subsume the statistical essence of the scoring system employed, so that to calculate significance one needs to know in addition only the size of the search space. The higher the bit score, the most significant is the alignment.
However, one should actually look at the quality of the alignment before accepting the score.
For detailed help on BLAST visit
http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html or ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_BLASTGuide.pdfBack to concept map 3.1.4. An example for BlastP
The FASTA sequence for Human Glycophorin A protein is shown next
>Glycophorin A Human
MYGKIIFVLL LSEIVSISAS STTGVAMHTS TSSSVTKSYI SSQTNDTHKR DTYAATPRAH EVSEISVRTV YPPEEETGER VQLAHHFSEP EITLIIFGVM AGVIGTILLI SYGIRRLIKK SPSDVKPLPS PDTDVPLSSV EIENPETSDQ
As Biochemists, let us use BlastP to find the structurally and functionally homologous sequences to this Human Glycophorin A protein. Paste the amino acid of mature peptide and enter name job title as GLPA_Human in enter query sequence section.
In Choose search set section select Non-redundant protein sequences (nr) database and organism to limit to mammals (taxid:40674). The non-redundant means that a given protein sequence will not be present
16
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
twice. The nr database is non-redundant GenBank CDS translations+ PDB+ SwissProt+ PIR+ PRF excluding environmental samples from WGS projects.
In the algorithmic parameters section select the word size 6 to make it most sensitive, BLOSUM62 to search intermediately diverged sequences with intermediately gap opening penalty 11 and low extension penalty 1. This will search intermediately diverged sequences.
and click BLAST with show results in a new window check box checked.
The results page will open in a new browser tab.
17
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
The results page has three important sections: Graphic Summary, Descriptions and Alignments. In addition, the hyperlinks for other reports including search summary, taxonomic reports, distance tree of results and multiple alignment are also presented. Graphic Summary, Descriptions and Alignments will be presented as expanded lists.
The Graphic summary is shown next:
18
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
Graphic summary shows that protein has a conserved domain belonging to Glycophorin_A superfamily.
Next, the horizontal lines with colour codes show the distribution of Blast hits according to alignment score.
The description section presents an ordered list of the matches according to scores and E-values.
The first two entries are shown above and then some entries from middle of this list are shown next.
In the list above you can reach the actual alignment by clicking on any entry, as shown for glycophorin A [Mus musculus] hyperlink next
This will take you to actual alignment:
19
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
The alignment above shows that a local region in the query sequence (amino acids 64 to 122) of Human Glycophorin A aligned with a local region in the target database/ target/subject sequence (amino acids 100 to 158) of mouse Glycophorin A. To reach this entry in NCBI, click on ref|NP_034499.3 hyperlink. The complete entry will be displayed.
20
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
In this way we can scan the relevant entries to extract functional information about this protein. Similarly, if three dimensional structures of similar proteins is/ are available, we can use those structures for understanding the molecular mechanism of its functioning. For example, one alignment with known three - dimensional structure is shown next.
In addition, to see search summary click on Search Summary hyperlink in other reports tab
It will display the search parameters, database used and computed statistics
Click on hyperlink in other reports tab To have look at multiple alignment. In addition, to multiple alignment the results window presents legends for links to other recourses, including UniGene, GEO, Gene, Structure and Map Viewer.
21
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
The next picture link for entries belonging to Gene, MapViewer and Structure links. This helps us to draw an inference for the target information.
Back to concept map 3.1.5. BLINK
BLAST Link (BLINK) available at http://www.ncbi.nlm.nih.gov/sutils/blink.cgi?mode=query provides a link option on protein records that displays the results of a pre-computed BLAST search of that protein against all other protein sequences at NCBI.
22
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
The result for P07515, i.e. Hpr, a phosphocarrier protein, is shown next:
This shows that there are 872 similar sequences in bacteria and there is no sequence in archaea, metazoa, fungi, plants, viruses or others.
Back to concept map 3.1.6. SmartBLAST
SmartBLAST is available at http://blast.ncbi.nlm.nih.gov/smartblast/?LINK_LOC=BlastHomeLink.
SmartBLAST processes protein query to present a concise summary of the three best matches from the non-redundant protein sequence database along with the two best protein matches from well-studied reference species. If possible, the two matches from the reference species dataset will be from different organisms. SmartBLAST produces these results using a combination of an optimized BLASTP search, a new implementation of BLAST meant to find closely related matches, and a multiple alignment. Additionally, SmartBLAST presents Conserved Domain Database matches to your query.
The hypothetical protein EF0031 in Enterococcus faecalis V583 genome has following sequence
23
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
>EF0031 length=863
MKKNISQLTQWLKEHSLLLKLIFLGSVLVFVANQVTHIAQGMSWADIFST MEQQSTGRLIGMVLAGLLGVIPMLLYDYVVVKLLEKEGKPPMKRMDWLTS AWVTNTINNLAGFGGVVGATLRINFYGKDVPRGKVVATVSKVALFLISGL SILSFVAFVDLFFIRTQNVFREYWVWLLLGSLIAPALWFFTYLKRRTLFK TFFPKAVLLLFGASLGQWLGGMFAFLMIGRLMQVPVSMVSVYPMFVIATL IGMLTMVPGGMGTFDVLMILGLSQLGIDRSQAIVWLLYYRLFYYVTPFMT GVILFLQQAGMKVNQFFDNLPRLFSQKVAHFILVAALYFAGIMMVLLSTV TNLSNVSRLFQVLLPFSFNFLDQTLNLFVGFLLLGLARGISMKVKKAYWP TIILLGFCIVNTVARTTSWQLIAVYAVILLAVILARKEFYREKFVYSWGA LTVDSILFGCLFIGYAVAGYYAARPAGGNQVINHFLLFPSDDVWFNGLIG LSISLIGLFFLYQYLAETTVTLGEGFEKARLTRFLEKFGGNEGSQFLYLK DYGHFYYQEEGEDQVLFGFQMKFNKCFVLADPIGQREKWTAATLAFMDQA DLLGYQLVFYRISEEYVMNLHDCGFEFMKVGEEGLIQFDEPSTVNQTAWT ETVTEKIAAEAADFQFEFYPETISDALYQELERVSADWSRNQKERYFIGG RLDPEYLKCSSVGLVRQKQTVIGFITGKEMEKGKSISYDLLRIRSDAPAF TREYLFTHFIETYQQQGYQLIDIGMAPLANVGESKYSFLKERFVNIFYKY SYQIYAFQDTRKRKEQYVTSWQPRYFAYPKRTSVLFAFVQLSLLITKGRH QSVSLVEEAMTEI
With this sequence input in SmarBLAST, following result was produced. To see annotation of graphical output just browse the mouse over the Bar to see the annotation.
Back to concept map
3.2. BLAST Assembled Genomes
24
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
At http://blast.ncbi.nlm.nih.gov/Blast.cgi, start entering Enterococcus faecalis. As you begin, the dropdown list will suggest available genomes. You can select from the suggested genomes. If the genome is not suggested, then you can click a broad category as shown next. However for the present example, select the suggested option with taxid: 1351 and click button
This tool searches similar nucleotide sequences from a chosen assembled genome using the Basic Local Alignment Search Tool. Consequently this tool is useful for finding similar sequences in the same genome.
If the query sequence from the same assembled genome, then thi s is equivalent to searching paralogous sequences in a chosen assembled genome.
Back to concept map 3.3. Specialized BLAST
25
Biochemistry
Biostatistics and Bioinformatics
Basic Local Alignment Search Tool - BLAST
The last section at BLAST homepage http://blast.ncbi.nlm.nih.gov/Blast.cgi, is specialized BLAST for specific applications.
Back to concept map 4. Summary
In this module, the students:
Understood BLAST for searching similar sequences in the databases.
Understood a heuristic approach implemented in basic BLAST which reduces the number of target sequences to be evaluated during similarity search using dynamic programming.
Learnt about various BLAST variants available to be used for specialized applications with associated input parameters and interpretation of output parameters as well as a practical example for protein BLAST search.
Used BLINK for the exploration of similar protein sequences which have already been pre- computed for every protein sequence.
Learnt to use Genome BLAST for searching assembled genomes to find paralogous sequences.
Learnt to use specialized BLAST such as Primer-Blast Back to concept map