• No results found

Sequence Similarity Search

N/A
N/A
Protected

Academic year: 2022

Share "Sequence Similarity Search "

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

1

Biochemistry

Biostatistics and Bioinformatics

Sequence Similarity Search

(2)

2

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

Description of Module Subject Name Biochemistry

Paper Name 13 Biostatistics and Bioinformatics Module Name/Title 08 Sequence Similarity Search

Dr. Vijaya Khader Dr. MC Varadaraj

(3)

3

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

Objectives: In this module, students will –

1. understand the meaning of similarity search

2. learn comparing scoring systems based on evolutionary divergence (PAM family) and conserved sequence (BLOSUM family) to evaluate extent of similarity between two sequences.

3. decide whether a protein or a DNA sequence to be used for similarity search. In addition, students will understanding masking of unwanted subsequence(s) so as to be discarded during similarity search.

4. decide the method/algorithm to be used for similarity search 5. decide, which program to be used for the selected method.

6. choose which databases to search

7. decide the number of search results to be returned by similarity search program.

8. Interpret the search results to draw an inference of homology 9. Practice similarity search, by taking a real example

1. Concept Map

Conserved sequence Matrices (BLOSUM)?

Evolutionary Matrices (PAM)?

Practical Similarity Search Example Choose Database

Results: How many / Online or Email Interpretation of result output

Similarity search

Scoring Systems Options

Exhaustive Search 1. SSSERCH

2. GLSEARCH 3. LLSEARCH

Search Algorithm

Heuristic Search 1. FASTA

2. FASTX/FASTY 3. BLAST Query (Protein or DNA?)

Mask subsequence?

(4)

4

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

2. Sequence Similarity Search

Sequence similarity search is a tool used to find all similar sequences in databases with functional and evolutionary relationship with a sequence in hand. In addition, it is used to identify other members of this sequence family. The sequence in hand is called a query sequence and each individual sequence in the database is called target or subject sequence. In the “module 08 sequence alignment – practical example”, we have seen that dynamic programming (DP) method of sequence alignment is guaranteed to find optimal alignment between two sequences. In addition, DP calculates optimal identity and similarity score for alignment length involving both sequences. The main application of sequence alignment is in searching similar sequences in the database. Once the similarity is significant, it can be used to draw an inference of homology between two sequences. Significance of similarity is drawn from overlap alignment length Vs.

number of identities. In addition, the alignment can be statistically assessed, using P-value, to draw an inference of homology, and once the homology is inferred, then, all the experimental knowledge of target sequence can be used to interpret the structure, function and evolution of query sequence. Therefore, if we have a protein for which no experimental structural or functional knowledge is available, then the same sequence can be used as a query sequence to find the homologous sequences in the database. Some of these homologous sequences will have experimentally determined knowledge. Th is experimentally determined knowledge can therefore be applied on the query sequence to understand its structure and function. In addition, similarity search is used in finding all similar sequences which can be used for MSA to find motifs and construct phylogenetic tree.

The objective of similarity search to find similar sequences and reject dissimilar sequences. The two main requirements to find similar but reject dissimilar sequences are called sensitivity and selectivity, respectively. Sensitivity refers to the ability of a similarity search program both to find all the similar sequences (true positives) without missing any truly similar sequence (false negative). Selectivity, sometimes called specificity, refers to ability of a similarity search program to exclude all dissimilar sequences.

Back to Concept Map

2.1. Scoring similarity between sequences

Different scoring systems/ schemes have been developed for alignment of a residue with a residue. There are different schemes for scoring alignment of residues in proteins and nucleotides. In addition, we also have different schemes for scoring alignment involving gaps. Further, we also have different schemes for scoring alignments of sequences involving different evolutionary divergences.

The similarity scores for pairs of sequence residues, either amino acids or nucleotides, used to assess sequence similarity, are calculated from previous biological knowledge. The PAM and BLOSUM families of

(5)

5

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

scoring matrices, used to assess sequence similarity, are based on previous biologically knowledge. PAM scoring matrices have been developed to reflect degrees of evolutionary divergence. On the other hand, BLOSUM scoring matrices have been developed to reflect degrees of residue conservation. There are several PAM matrices and each of the PAM matrix has a different degree of evolutionary divergence.

Similarly, there are several BLOSUM matrices and each of the BLOSUM matrix has different a degree of residue conservation. Therefore, a specific matrix is chosen for searching proteins having similarity at a particular divergence distance or at a particular residue conservation. Consequently, no single matrix effectively covers the entire range of divergence distances or residue conservations. Therefore, a specific matrix needs to be selected, to search similar sequences reflecting a particular level of evolutionary divergence or a particular level of residue conservation. Closely related sequences, however, are relatively easy to find even with non-optimal matrices, so the tendency has been to use matrices tailored for fairly distant similarities. For many years, the most widely used matrix was PAM-250, because it was the only one originally published by Dayhoff.

All the similarity search programs offer several different matrices reflecting varying degrees of divergence.

Therefore, one would need to search several times, under different divergence distances, to retrieve all similar sequences. FASTA suit of search programs offers following scoring matrices which covers 20 to 90%

sequence divergence:

Back to Concept Map

2.2. Search Algorithm

The similarity between query sequence and each of the target sequence is evaluated on the basis of the pairwise alignment score between query and target sequence. We have seen DP calculates a score for alignment between two sequences. Therefore all the resulting alignments can be compared using pairwise sequence alignment score. The best alignment will have the maximum score. The next best alignment will have the next best score and so on. However, global DP alignment finds overall similarity score between

(6)

6

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

two sequences. Therefore, if the query sequence has a small region which is similar to a small region in the target sequence, then the overall global alignment score will be low and hide this local similarity. For example, homeobox genes, which regulate embryonic development, are present in a large variety of species. Although homeobox genes are very different in different species, one region called homeodomain, is highly conserved. The question is to find this conserved area and ignore the areas that show little similarity. Therefore, global DP alignment developed by Needleman and Wunsch was modified, slightly, by Smith and Waterman, so as to find sequences having local similarities. This is implemented in SSEARCH, to find similar sequences having local similarity.

However, DP alignment is very time consuming for the computer due to large number of computational steps involved. The sequence similarity search tool requires that each of target sequences in the database is aligned with the query in hand. The number of sequence alignments to be done in the similarity search equals the number of target sequences in the target database. With the increase in the number of sequences in the databases, it became almost impractical to use slow DP method for similarity search.

Consequently, in addition to sensitivity and selectivity, third requirement, i.e. speed, gained significance for developing similarity search programs. SSEARCH is a program using DP for alignment during similarity search and is exhaustive in evaluating the sequence alignment with each target in the database and therefore slow. To increase the speed of DP based similarity search programs, two approaches had been applied. The first was to use DP on parallel processing platform as implemented in ParALign and ScanPS and this requires additional hardware.

The second approach was to develop heuristic programs to reduce the number of target sequences to be aligned using DP. In this approach, the database is first filtered using the query sequence, in a very fast manner, to exclude dissimilar sequences. This reduces the number of target sequences to be evaluated using DP alignment to a very small number. However, in this fast filtering process, some true homolog may not be selected and thereby reducing sensitivity of the search tool. The two heuristic programs, FASTA and BLAST are very fast programs to search databases for similar sequences, but occasionally missing some true homologue. Both exhaustive search and heuristic search based programs are discussed below.

Back to Concept Map

2.2.1. Exhaustive Search

SSEARCH uses local alignment and is implemented at University of Virginia. In addition, two variants, on the basis of global alignment (GGSEARCH) and global-local alignment (GLSEARCH), are also implemented.

These can be reached at http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=select&pgm=sw.

These programs has a common web interface, which has four sections.

Section A allows selecting a program for search, based on local, global and global/local DP alignments.

(7)

7

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

In Section B, user has to input query sequence information; either paste sequence in FASTA format or enter accession/ GI number of the sequence. Alternatively, already saved FASTA sequence file can be uploaded.

In addition, recent suite of the FASTA programs (SSEARCH, GGSEARCH, FASTA, FASTX, etc) has the ability to annotate the final alignment using external information. Annotation information file format can be reached at: http://fasta.bioch.virginia.edu/fasta_www2/annot_file_fmt.html. In this case a file containing information for protein is entered in four columns and each column separated by pressing ‘Tab’ key. A file for current query sequence was created in NotePad (shown next), saved as ‘Hpr Annotation.FA’, and uploaded alongwith sequence, as pasted above.

This information will be used by the program to label residues according to information provided for each position. First line contains the comment information. Next lines contain information for residues at specific positions. In second line, position 1 to 88 is named as PTS Domain. This is followed by a

(8)

8

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

biochemically important site (symbol ^ or =) at position 15 and a phosphorylation site (symbol *) at position 46. This information will be displayed in resulting alignments (see be low).

Section C, allows choosing a database to search.

In addition, for standard searches against the PIR1 and SwissProt databases, the program can run a script to query Uniprot, InterPro, or Pfam to learn the location of active sites and domain boundaries.

(9)

9

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

The statistics for database searches assume that unrelated sequences will look essentially random with respect to each other. However, certain patterns in sequences violate this rule. The most common exceptions are long runs of a small number of different residues (such as poly Alanine tract). Such regions of sequences could spuriously obtain extremely high match scores. For this reason, the search program may automatically remove such sections in proteins (replacing them with X) using the SEG p rogram, if default filtering is selected. Section C also enable to choose masking the query and database sequences for low complexity regions using ‘SEG’.

One can start search at this stage, using default input parameters for scoring matrix, open & extend gap penalty and word size (k-tup) for search and default statistical estimates to be reported in the results page. However, section C has other search options for user to change default values for these parameters.

On clicking ‘Search Database’ button, the results appear in ensuing window. The results page has four sections. The first section has general information about search

(10)

10

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

The next section has a list of best hits arranged with the best on the top.

The s-w score, listed in second column is a number, gives the overall similarity between two sequences.

The score for matching two residues is assigned using a substitution matrix. For example, the alignment of tetrapeptide amino acid sequence ‘MRSF’ with ‘MKCI’ is shown next with score assigned using BLOSUM62 matrix

M R S F | : . M K C I 5 2 -1 0 = 6

Alignment gives a total s-w raw score of 6. The bits score in column 3 is the normalized raw s-w score so that the two alignments between same sequences with different databases can be compared. Next is

(11)

11

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

E(number of library sequences) value. The number written in bracket is the number of sequences in the database used in the present search. The E(value) is expected number of sequences in the database that would achieve this raw s-w score by chance. This is calculated by multiplying P-value (probability indicating that alignment occurred by chance) with the product of number of residues in the query sequence and database.

E(value) = P * m * n, where P is P-value, m is the number of residues in the database sequences and the n is the number of residues in query sequence. P-value, therefore is the ratio of E(value) and the product of number of residues in the query sequence and database. The last section has general informat ion about the number of residues in the database.

In this case, therefore m * n = 172681839 * 88 = 15,196,001,832 = 1.6 * 10-10.

The statistical significance of an alignment score is frequently assessed by calculating P-value to determine the statistically significance for probability of the alignment with this score. The P-value of 1 means that there is one chance to find this alignment by chance in this database with 461146 sequences and any alignment with P-value of one is, therefore, not significant. The first match in the list is to the self, i.e.

query sequence itself present in the database searched. The E(461146) value, for second sequence is 6.3 * 10-27. P-value, therefore is nearly equal to 3.375 * 10-17 and shows that there is only one chance of random alignment in 1017 sequences. This P-value is much higher than the number of sequences in the database i.e. 461146. This shows that two sequences aligned in a database of 461146 is highly unlikely by chance and, therefore, are having clear homology. However, the actual interpretation is completed with a look at the alignment. Click on align link, you will reach at actual alignment in the next section:

(12)

12

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

This section displays the features of query as uploaded in annotation file and the features of the target sequence in the database. The actual alignment shows that there not much gaps but significant identities with functional residues at position 15 and 46 matched. Therefore, three dimensional structure of target can be used as basic structure of the query. In addition, the other homologous members in the reported list may be used to complete the evolutionary family with conserved domain PTS_HPr_prot-like in the InterPro database.

For detailed interpretation of scores, please visit http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul- 1.html and http://homepages.ulb.ac.be/~dgonze/TEACHING/stat_scores.pdf.

Back to Concept Map

2.2.2. Heuristic Search

Back in 1983, it was surprising to find similarities between a cancer-causing gene and a gene involved in normal growth and development. Today, it would be even more surprising not to find any similarity between a newly sequenced gene and the huge nucleotide sequence databases. However, nucleotide sequence database search is not easy now, as it was then. There are 193, 921,042,946 bases in GenBank as of June 15, 2015 in NCBI-GenBank Flat File Release 208.0. The present day statistics may be obtained at

(13)

13

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt. When we are trying to find a closest match to a gene of length 300 bases in such a database with 193921042946 bases, the DP algorithms takes too much time. One approach to make search tool fast is to use a fast parallel hardware implementation of DP alignment algorithms; another is to use fast heuristics that usually work well. To reduce the computational requirements, many heuristics for fast database search has been developed. Heuristics are like tour guides, who after taking information from the tourist regarding places in which tourist is interested, visit places such as shopping malls or religious places. This filters the places to be visited. Heuristics also use the filtering idea because heuristic, after taking information from the sequence, filters the sequences in the database, which need to be used in DP alignment. Filtering is based on the observation that a good alignment usually includes short/ small identical or short/ small similar fragments. Therefore, small words from long sequence are created and used to see if these small words are present in target sequence. If the small words are not present, then computationally intensive DP is not undertaken for pairwise alignment with target. This saves time to evaluate unnecessary DP alignment. The first most widely used algorithm was the one implemented in FASTA (pronounced as FAST aye) and currently most widely used is BLAST.

Both algorithms differ in details but both filter the database sequences for subsequent DP alignment.

1. Initial Filtering: It starts with Word search (FASTA uses identity words and BLAST uses expanded list) where both query and target sequences are broken into overlapping words of specific length (parameter K-tup or k-tuple in FASTA and parameter word size in BLAST). The lists of words from the query and target sequences are compared, and the diagonal with most matching words (i.e. DotPlot diagonals) is taken as the region most likely to contain the best alignment between two sequences.

2. Initial Alignment: The diagonals from the initial filtering are used to determine if two diagonals can be joined to obtain a region of sufficient similarity to merit further examination, and these diagonals are joined to create an initial alignment. Whereas, BLAST expands good extended-word matches along the diagonal in both the directions.

3. Restricted Smith-Waterman (DP) alignment: If joined diagonals contain sufficient sequence similarity, then a restricted Smith-Waterman alignment is performed, on the region with the highest score, bounded by the window size.

FASTA reports only one alignment per query sequence (because it joins all significant diagonals into one) which have sufficient similarity, whereas, BLAST may report more than one subsequence alignments separately.

Back to Concept Map

FASTA is a suite of programs for searching nucleotide and protein databases with a query sequence.

FASTA implements heuristic involving following steps, which results in substantial computational savings.

(14)

14

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

(1) (A) Splitting each target sequence in the database into overlapping words of specified size (k-tup, usually 2 or 3 for proteins and for 11 DNA). This list of overlapping words in sequence is arranged alphabetically in a list and stored in the computer file. The position of each overlapping word in the list is also stored alongwith the word. Let us say we have a protein sequence

‘TAPSTMVPMSFYLION’, in the database. The list of overlapping words of k-tup equal to ‘3’ will be TAP

APS PST STM TMV

MVP VPM PMS MSF SFY FYL YLI LIO ION

These words are then arranged in alphabetical order and stored in a computer file as shown below.

APS FYL ION LIO MSF MVP PMS PST SFY STM TAP TMV VPM YLI

The position of each word in the sequence is stored alongwith the word.

APS 2 ION 14 FYL 11 LIO 13

(15)

15

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

MSF 9 MVP 6 PMS 8 PST 3 SFY 10 STM 4 TAP 1 TMV 5 VPM 7 YLI 12

In this way each sequence entry in the database is converted into a dictionary of words, arranged in alphabetical order and stored in the computer. Searching this list in a dictionary way of search, each word position in target sequence can be quickly located. If no word is located in target sequence then this sequence is not considered for DP alignment.

(B) When a user submits a FASTA search using a query sequence, say, ‘TAPSTAVPMSLION’, then the program Splits query sequence into overlapping words of specified k-tup size (3 in the present example). This list of overlapping words in sequence is compared with each database entry as stored at serial number 1(A) above. The first word ‘TAP’ is looked into dictionary as created at serial number 1 above. Following dictionary search, the word could be directly detected and its position in the target is noted. The difference in position of the matching words in two sequences is calculated and is called offset. All the matching words with same offset are said to belong to same diagonal in visual dotPlot.

For example, if a word is present at position 1 in query and is also present at position 1 in target sequence, then the offset is zero. This way each overlapping word in query is compared and we get the following table.

Query Word Position Target/Subject position Offset/Diagonal

TAP 1 1 0

APS 2 2 0

PST 3 3 0

STA 4 TAV 5 AVP 6

VPM 7 7 0

PMS 8 8 0

MSL 9 SLI 10

LIO 11 13 2

ION 12 ` 14 2

The words with same offset shown above can be visualized as diagonals in DotPlot as shown next

(16)

16

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

However, for actual sequences, all the diagonals representing ungapped local identities may be shown as below

The contiguous matching words with same offset are residues matched between two sequences. If there are mismatches between diagonals of same offset, then these diagonals are joined by allowing mismatching residues. The threshold number of mismatched residues allows the closely placed diagonals by allowing mismatching of residues to yield longer contiguous ungapped similarities. One similarity is shown below with a dotted diagonal joining nearby ungapped identities with same offset:

(17)

17

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

(2) This is followed by rescoring all ungapped similarities using a particular scoring matrix such as BLOSUM62. Top ten ungapped similarities are retained and the score of the best diagonal is reported as INIT1 score

(3) The nearby ungapped similarities with different offset are joined by introducing gaps , till the alignment score is above certain threshold. The best gapped alignment is retained and its score is reported as INITN Score.

(18)

18

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

(4) This is followed by Smith-Waterman alignment. The optimal DP alignment and its score is reported as opt score.

(5) Finally, E-Value and bit scores are calculated for evaluating statistical significance of each alignment.

There are FASTA variants (FASTX/Y, TfastX/Y) available for similarity search to achieve a specific target.

FASTA suite is available online from three servers.

1. http://www.ebi.ac.uk/Tools/sss/fasta/ at EMBL outstation EBI

(19)

19

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

2. http://www.genome.jp/tools/fasta/ at GenomeNet, Japan

3. http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml at University of Virginia, USA, and is shown next:

Select Protein-protein FASTA and in the ensuing page all variants of FASTA are listed and can be selected.

Rest of the interface is same as for SSEARCH.

Paste the sequence

(20)

20

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

and set all parameters as explained above for SSEARCH and click search database button. The results will appear in the ensuing page and interpreted in the same way as for SSEARCH. The first section has general information of parameters, second has list of hits, third has actual alignment and finally, statistics of the search. The first section is shown next:

The second section contains a list of proteins having similar regions to query protein, as shown below:

(21)

21

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

It has 8 columns. The first column list the name of protein with number of amino acids (written within parenthesis) in this target protein. Then, column named opt score reports raw score and the column labelled E(num), shows the E-value, which represents the number of sequences in the searched database, which would achieve the score for this search by chance. The E-values below 0.01 would be by chance very rarely and are therefore likely to indicate homology. In these cases, the experimental knowledge, attached to target sequence annotation, may be applied to query sequence. However, as Biochemists, we need not to have unnecessary confidence or casually ignore the annotations attached. We need to evaluate alignment in a holistic way, by looking at the actual alignment. If the statistics (opt score and E- value) of alignment are good and alignment overlaps the whole length in both sequences, then there is a good chance that both will share the same or related function. Therefore, the alignments with alignment length (listed in second last column) covering nearly full length in two sequences shall be investigated by having a look at the actual alignment. The length of query sequence in the present case is 88. In the present case, the first protein with best score has 276 amino acids with only 87 amino acids aligned. Next protein with best score is 85 amino acids with 81 amino acids aligned. Therefore, this is the first best alignment with overlap alignment of 81 amino acids between protein of 88 and 85 amino acids.

Therefore, to a look at the actual alignment, click this protein in the last column, hyperlink named align.

You will reach the third section of the alignment as shown below:

(22)

22

Biochemistry

Biostatistics and Bioinformatics Sequence Similarity Search

This annotation report a Site at H15 as active site, indicating prospective phosphohistidine intermediate.

The annotation report the region 1-81 in query and region 1-81 in target with score statistics. The residue at position 15 is ‘H’ in both proteins and is aligned.

Similarly have a look at the other best scoring alignments covering nearly whole length.

3. Summary

In this lecture we learned:

 the meaning of similarity search

 comparing different scoring systems based on evolutionary divergence (PAM family) and conserved sequence (BLOSUM family) to evaluate extent of similarity between two sequences.

 to decide whether a protein or a DNA sequence to be used for similarity search. In addition, learned masking of unwanted subsequence(s) to be discarded during similarity search.

 decide the method/algorithm to be used for similarity search

 to decide, which program to be used for the selected method.

 to choose which databases to search

 to decide the number of search results to be returned by similarity search program.

 to interpret the search results to draw an inference of homology

 a practical example of similarity search

References

Related documents

motivations, but must balance the multiple conflicting policies and regulations for both fossil fuels and renewables 87 ... In order to assess progress on just transition, we put

The pairwise genetic distance analysis, sequence identity and phylogenetic tree have shown that the specimen has similarity only with 16S rDNA sequence of

SpeCific primer are generally / designed to target a DNA sequence flanking the regions to be ampl ified..

The full length predicted protein sequence showed 99 and 98% amino acid sequence similarity to MADS-box and flower homeotic proteins from T.. spelta and

The scan line algorithm which is based on the platform of calculating the coordinate of the line in the image and then finding the non background pixels in those lines and

The genetic variation of the deduced amino acid sequence of the partial protease gene from the clone BTM106 was determined by its multiple sequence

(D) Multiple sequence alignment with selected sequence neighbors, highlighting conserved catalytic site residues (in triangles) (E) Predicted ligand binding pockets in red surface,

This is to certify that the thesis entitled "A Computational Pathway for Bracketing Native-Like Tertiary Structures from Sequence and Secondary Structural Information of