• No results found

Sequence Alignment – Creation Process

N/A
N/A
Protected

Academic year: 2022

Share "Sequence Alignment – Creation Process "

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

1

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

.

(2)

2

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

Description of Module Subject Name Biochemistry

Paper Name 13 Biostatistics and Bioinformatics

Module Name/Title 07 Sequence Alignment – Creation Process

Dr. Vijaya Khader Dr. MC Varadaraj

(3)

3

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

1. Objectives: This module aims students -

1. To learn various empirical schemes for scoring pairwise sequence alignment 2. To learn dynamic programming method for creating optimal sequence alignment

2. Concept Map

3. Sequence alignment process

Sequence (DNA, RNA or Protein) alignment involves placing of evolutionary related residues in two sequences in corresponding columns along horizontal sequence rows. The dynamic programming method is guaranteed to find the optimal alignment of evolutionary related sequences, when combined with empirically derived scoring matrices as well as evolutionary related affine gap penalty scheme. Therefore, firstly, we will discuss evolutionary related scoring schemes and, then, we will discuss the powerful dynamic programming method of sequence alignment.

Back to Concept map

3.1. Scoring sequence alignments

Dynamic programming for creating alignment

Scoring aligned gaps using Scoring aligned

residues using

Sequence alignment process

Scoring Sequence Alignments

Linear Gap penalty Affine gap penalty PAM Matrices

BLOSUM Matrices

(4)

4

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

To compare different alignments, we need to assign scores to each alignment. Each residue aligned with another residue or a gap in the corresponding column, is assigned an alignment score. The total score of alignment is defined as the sum total of the all scores of individual columns. There are different schemes for scoring alignment of residues in proteins and nucleotides. For these alignments, scoring systems/

schemes based on evolutionary distance (PAM matrices) and conservation of residues (BLOSUM matrices) have been developed for alignment of a residue with a residue. In addition, we have different schemes for scoring alignments of sequences involving different evolutionary divergences (PAM120, PAM200, PAM 250 etc) or different evolutionary conservation (BLOSUM40, BLOSUM62, BLOSUM80 etc). Further, we have different schemes for scoring alignments involving gaps.

Back to Concept map

3.1.1. Scoring aligned residues:

The schemes for scoring aligned residues are derived from previous biological knowledge about substitution or conservation of residues in sequences which are known to be related. A residue is said to be conserved during evolution, if the same residue is present at corresponding position in all sequences derived from a common ancestor.

To develop empirical scoring matrix, all sequences, belonging to family of proteins, known to be related due to descent from a common ancestor, are carefully selected and aligned to place the residues in columns, so as reflect their similarity. Then the residues in each column are counted and observed for conservation and substitutions. The residue present in majority of sequences at a particular column is said to be conserved.

Let us say that we have 100 related sequences aligned carefully and we find that in majority of sequences, say 60 sequences, at a particular column, tryptophan is conserved. In five sequences, let us say that, we find that tryptophan is replaced with phenylalanine. In other five sequences, let us say that, we find that tryptophan is replaced with tyrosine. In remaining 30 sequences, other 17 amino acids replace tryptophan but with different frequencies. Subsequently, based on these observed conservation and substitution rates, different scores are assigned to each conservation and substitutions. Similarly, other 19 amino acids are also assigned scores for observed conservation and substitutions. This produces a 20 * 20 matrix with a score for each possible combination of amino acids.

Such substitution matrices are constructed from a large and diverse sample of verified alignments of related amino acid sequences to count the number of conservations and substitutions. If the sample is large enough to be statistically significant, then resulting matrix/table should reflect the true probabilities of conservation and substitutions, occurring through a period of evolution. These substitution matrices provide a quantifiable measure of ability of one residue to be either conserved or to be replaced by other residue. The chance of random mutations is 1 in 20 i.e. 5%. The computed conservation/ substation score

(5)

5

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

(ratio) represents the relative chances of observed substitution to random substitution. The high positive score shows that the position is evolutionary conserved. This score may be low positive which shows that the position is evolutionary conserved for mutations for highly similar residues. This score may be zero which shows that replacement of one residue by another may be evolutionary significant for weakly similar residues (semi-conservative mutations). On the other hand, a negative score shows that the two residues are penalized for substitution and therefore discouraged to replace each other during evolution.

There are two widely used empirically derived scoring systems; one developed by Margaret Dayhoff and other developed by Henikoff and Henikoff.

Back to Concept map

3.1.1.1. PAM family of substitution matrices: Margaret Dayhoff, following model of evolution, developed several scoring matrices, known as PAM family of substitution matrices. She aligned very closely related (with 85% or more identities) full length protein sequences. Then, from these aligned sequences, a phylogenetic tree was constructed. From this tree, closely placed sequences with 1% mutation, called point mutation, corresponding to one PAM of evolution were selected and then the substitution rates for each possible mutation were tabulated from rates of substitutions, observed from these alignments. This closely related substitution table was as labeled PAM1. PAM stands for point accepted mutation and the number denotes approximately 99% conservation i.e. 1% mutation. After 100 PAMs of evolution, not every residue will have changed: some might have mutated several times, perhaps returning to their original state and others not at all. Thus it is possible to recognize as homologous proteins separated by much more than 100 PAMs.

To explain this substitution table development process, an alignment among 20 closely sequences (85%

identities) was created and used for phylogenetic tree construction. From the tree, alignments with one

(6)

6

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

mutated residue in 100 residues were selected. Each sequence was selected because it was found placed closely in phylogenetic tree and had 1% mutations. This is 1 PAM distance, because they are closely placed in phylogenetic tree and they have only 1% mutation (99% identities). In the previous diagram, only first 20 residues, in each of the 20 sequence are shown. Single mutation (boxed) in each sequence was present in this region only:

Then, the observed substitution rate for each aligned amino acid was tabulated. The observed substitution rate was compared with random substitution rate of 5% to arrive at conclusion whether a residue at a particular position was conserved, mutated (conservative or semi-conservative) or disallowed. For example, in the first column amino acid ‘A’ is present in majority at 16 positions. Therefore, frequency of

‘A’ is 16/20, i.e. 80%. This is much more than the random rate of substitution (5%), therefore, this represents conservation of amino acid ‘A’ at this position. In this column amino acid ‘A’ is replaced by ‘V’

at 4 positions. Therefore substitution rate of ‘A’ with ‘V’ is 4/20, i.e. 20%. This is slightly more the random rate of substitution (5%), therefore, this represents conservative substitution.

In the next column amino acid ‘C’ is present in majority at 19 positions. Therefore, frequency ‘C’ is 19/20, i.e. 95%. This is much more than the random rate of substitution (5%), therefore, this represents conservation of ‘C’ at this position. In this column amino acid ‘C’ is replaced by ‘A’ at 1 position. Therefore substitution rate of ‘C’ with ‘A’ is 1/20, i.e. 5%. This is equal to the random rate of substitution ( 5%), therefore, this represents semi-conservative substitution.

In the next column amino acid ‘D’ is present in majority at 18 positions. Therefore, frequency ‘C’ is 18/20, i.e. 90%. This is much more than the random rate of substitution (5%), therefore, this represents conservation of ‘D’ at this position. In this column amino acid ‘D’ is replaced by ‘E’ at 2 positions.

Therefore substitution rate of ‘D’ with ‘E’ is 2/20, i.e. 10%. This is slightly more the random rate of substitution (5%), therefore, this represents conservative substitution.

From these three columns we find that conservation rate of ‘A’ is 80%, ‘C’ is 95% and ‘D’ is 90%. In this way, a 20 * 20 substitution matrix i.e. PAM1 matrix is constructed for 1 PAM distance.

The other PAM numbers representing less conservation i.e. more divergences were extrapolated from this PAM1. PAM family of matrices follow a model of evolution therefore, these are more appropriate for constructing phylogenetic trees i.e. to decipher evolutionary history. PAM250, the most diverged matrix is shown next. PAM250 matrix shows that each amino acid is not conserved equally during evolution.

Therefore, all amino acids are not conserved equally in protein sequences. Some amino acids are conserved more than others. For example tryptophan is the most conserved amino acid. Therefore, highest score is assigned for alignment of a tryptophan in one sequence with a tryptophan in other sequence. It is found that the next conserved residue is cysteine. Therefore, the identity alignment scores are not equal/ same for each amino acid. Secondly, when some amino acid mutate to any other amino acid then the chance of mutation to any other amino acids is not equal for all possible pairs of substitutions. For

(7)

7

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

example, when tryptophan, the most conserved amino acid, mutates to any one of other 19 amino acids, then the chance to mutate to any other 19 amino acids is not equal for each of the 19 amino acids. There are more chances to mutate to other aromatic amino acids such as tyrosine and phenylalanine than to mutate to cysteine. Therefore, a substitution matrix or table contains different scores each possible substitution.

The following table compares a PAM number with % Sequence Identity so as to reflect approximate evolutionary distance in PAM number with % Observed Mutations.

PAM Number % Sequence Identity % Observed Mutations

1 99 1

30 75 25

40 69 31

80 50 50

110 40 60

120 38 62

200 25 75

250 20 80

(8)

8

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

Therefore, PAM250 matrix can be used to align proteins having 20% identities, i.e. most diverged proteins and PAM30 matrix can be used to align proteins having 75% identities i.e. closely related proteins. This shows that we have different PAM scoring matrices for aligning differently diverged protein sequences.

Therefore PAM family of scoring matrices is more appropriate for construction of evolutionary trees.

Back to Concept map

3.1.1.2. BLOSUM Family of Substitution Matrices: Henikoff and Henikoff used conserved residues in distantly related proteins to develop scoring systems known as BLOSUM (BLOcks SUbstitution Matrix) family. They created multiple sequence alignments of distantly related proteins and identified conserved regions without gaps, i.e. the BLOCKS database. This database serves as the source of data for all BLOSUM matrices. They examined multiple alignments of conserved regions in directly, rather than to extrapolate from closely related sequences. To create a specific BLOSUM matrix, in each block, all sequences that share at least n percent identity are clustered/ grouped together, i.e. where identity = 62% the matrix developed is called BLOSUM62.

(9)

9

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

Similarly, where identity was equal to 80%, the matrix developed is called BLOSUM80.

The comparison of BLOSUM62 and BLOSUM80 shows that we have different scoring matrices for aligning differently conserved protein sequences. In addition, it shows each amino acid is not conserved equally.

Some amino acids are conserved more than others. For example tryptophan is the most conserved amino acid followed by cysteine.

All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. For example, BLOSUM62 represents a substitution matrix developed from aligned sequences with conserved block having 62% identities. After alignment without gaps, the substitution rates of each amino acid to each of the other amino acids are tabulated to develop BLOSUM62. BLOSUM matrices are more appropriate for finding conserved regions in proteins.

Back to Concept map

(10)

10

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

3.1.2. Scoring aligned gaps

Gaps are inserted in sequence alignment whenever there are insertions/deletions. An insertion can be thought of incorporation of some additional residue in one sequence, which is not inserted in the seco nd sequence. Therefore, to align the inserted residues in first sequence, we need to insert gaps in the second sequence. Conversely, it can be thought that these residues were present in both the sequences, but, there was a deletion of the same residues in second sequence which are still present in first sequence. In this case again we need to introduce gaps in second sequence to align the residues not deleted from first sequence. There are two models of penalizing gaps. First is linear gap penalty model and second is affine gap penalty model

Back to Concept map 3.1.2.1. Linear gap penalty

Using linear gap penalty, each gap position is penalized equally. This allows insertion/ deletion of individual residues independently. For example, if we fix the gap penalty at -3 and there is a deletion of 5 residues, then the gap penalty will be -15. On the other hand if only one residue is deleted in the sequence then the gap penalty score during alignment will be just -3.

Back to Concept map 3.1.2.2. Affine gap penalty

Using affine gap penalty model, we impose separate gap penalties for opening a gap and then for extending the already opened gap. A higher penalty for having opened a gap and then a mild penalty for extending the gap of certain inserted/deleted residues. This is based on the hypothesis that insertions or deletions in related sequences usually involve more than one residue. This does not allow insertion/

deletion of individual residues independently, but presumes that insertion/ deletion of blocks of residues.

Therefore, if there is a deletion, then a gap is opened with a large penalty such as -10. This is called gap opening penalty. In addition, another penalty, but mild, is added for each individual residue deleted such as -1. This is called gap extension penalty. Therefore, if the deletion was of one residue, then, total affine gap penalty will be -11. If 5 residues are deleted total affine gap penalty will be -15. Whenever, when aligning closely related sequences (high number of identities), using scoring matrix such as BLOSUM80, we can assign high gap opening penalty and high gap extension penalty. On the other hand, when we are aligning distantly related sequences (less number of identities), using scoring matrix such as BLOSUM45, we can assign mild gap opening and mild gap extension penalties. When we are aligning intermediately related sequences (intermediate number of identities), using scoring matrix such as BLOSUM62, we can assign intermediate gap opening and intermediate gap extension gap penalty.

Back to Concept map

(11)

11

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

3.2. Dynamic Programming method for creating alignment

If we consider the optimal, or highest scoring, alignment shown in A next, we can break the alignment into two parts as shown in B. The overall alignment score is the score for left-hand alignment of four bases in B plus the score for aligning the two bases on the right, shown in B. If we assume that 5 base alignment in A is optimal, we must conclude that the four base alignment in B is also an optimal alignment. If it was not (for example, if we gave a higher score for aligning A with T, than aligning A with C), then the alignment shown in C would have a higher score than the one shown in A. Then C rather than A would be the optimal alignment.

In plain English, the best alignment that ends at a given pair of bases (or residues) is the best alignment of the sequences up to that point, plus the score for aligning the two additional bases or residues.

Further, removing another pair of bases from B gives us a situation shown in D (next).

The next step would require inserting a gap in sequence 2. The above brief consideration shows that, at any step, there are only three possibility: aligning the next residue from sequence 1 with the next residue in sequence 2; aligning the next residue from sequence 1 with a gap in sequence 2, and; aligning the next residue sequence 2 with a gap in sequence 1. Whenever, at any step, there are three possibility for the current step and the result of the current step depends upon the result obtained during previous step, then the dynamic programming algorithm is the best to use. Further, using empirical scoring and gap penalty schemes, the dynamic programming method is guaranteed to find the optimal alignment.

The first use of dynamic programming for sequence alignment between two sequences for global alignment was made by Needleman and Wunsch. This is meaningful for comparisons between members of the same protein family with conserved sequences with almost similar lengths, i.e. global alignment. In

ACT G C

|| + | + |

AA- G C

D

ACTGC ACTG C ACTGC

|. || || | + | | .||

AA-GC AA-G C A-AGC

A B C

A B C

(12)

12

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

many biological applications, the score of alignment between small conserved regions contained in long sequences may be larger than the score of alignment between the entire sequences. In those cases, Smith and Waterman modified dynamic programming method to align locally similar regions i.e. local alignment.

Let us use dynamic programming method, manually, to align the following amino acid sequences:

SDQENCELKR SEEERCEVKR

The two sequences are 10 residues long each. The end to end alignment between these sequences having 6 successive identities, shown in bold capital letters, is presented below:

SdqE—-n-CEl-KR S—-Eee-rCE-vKR

This is 60% identify, therefore, we will use BLOSUM62 matrix for dynamic programming alignment of these sequences. For scoring gap positions, let us use linear gap penalty of -1.

There are three phases in alignment using dynamic programming method:

1) To create a score matrix for each possible residue combination comparison (10 x 10 table).

2) To trace back path of alignment from the lower right cell of the matrix to the next best score in the previous position.

3) To generate sequence alignment from this trace back path.

The sequence lengths are 10 residues each. The first phase starts with constructing a grid (table/matrix) with 2 extra rows and 2 extra columns than each sequence length i.e. 12 columns for the first sequence and 12 rows for the second sequence.

(13)

13

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

The first sequence is written horizontally in the top row of the grid, starting from third to last cell of the row. The second sequence is written vertically on the left most column of the grid starting from third to last row of the first column. An arbitrary character such as ‘’ is placed, before both the sequences to signify that every sequence begins with a gap. The first row first column cell is left blank. A zero is place in the second row second column cell to signify the alignment of a gap with a gap.

Next, we need to assign scores to all columns in the second row cells for aligning initial gap in vertical sequence, with each residue in horizontal sequence. We begin with the third column of the second row.

The score in this cell is to be assigned for matching horizontal sequence residue ‘S’, a serine with vertical sequence residue ‘’, a gap. Since aligning a gap with residue has a score of -1 in this alignment, we add minus one to the score written in the left cell. The score in the left cell is 0, therefore a -1 is added to 0 which gives sum equal to -1. Therefore -1 is written in the second row third column, as shown below:

Next, we need to assign a score in fourth column of second row. This score in this cell is assigned for matching horizontal sequence residue ‘D’, an aspartate with vertical sequence initial gap ‘’. Since aligning a gap with residue has a score of -1 in this alignment, we add -1 to the score written in the left cell. The score in the left cell is -1, therefore a -1 is added to -1 which gives sum equal to -2. Therefore a -2 is written in the fourth column of second row, as shown below:

Now let us to assign score to second column of third row. This score in this cell is assigned for matching vertical sequence residue ‘S’, a serine with horizontal sequence initial gap ‘’. Since aligning a gap with residue has a score of -1 in this alignment, we add -1 to the score written in the upper cell. The score in the upper cell is 0, therefore -1 is added to 0 which gives sum equal to -1. Therefore -1is written in the second column of third row, as shown below:

(14)

14

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

Next, we can assign the cell in the second column of fourth row. The score in this cell is assigned for matching vertical sequence residue ‘E’, a glutamate with horizontal sequence initial gap ‘’. Since aligning a gap with residue has a score of -1 in this alignment, we add -1 to the score written in the upper cell. The score in the upper cell is -1, therefore a -1 is added to -1 which gives sum equal to -2. Therefore -2 is written in the fourth row in second column, as shown below:

Continue filling the gap scores in second row and second column in this way and it will end up in the following matrix.

Now, we need to fill the grid with match/mismatch score using substitution matrix BLOSUM62, using dynamic programming methodology. In the dynamic programming, we consider four cells at one time. The cells to be considered include the cell for which we have to assign alignment score and three adjacent cells;

one exactly on left, one diagonally up on left side and one exactly on upper side as shown below:

(15)

15

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

These cells contain one number each, with -1, 0, and -1 for left, diagonally upper left and upper cell, respectively. We have followed a linear gap penalty scheme of -1 in the present case and this gap penalty is added to the number in the left cell and also to the number in upper cell. The left and upper cells both contain -1, therefore -1 is added to each and both result in -2 for each. Now we find the score of aligning

‘S’ in vertical sequence with ‘S’, in horizontal sequence using BLOSUM62 matrix. The score is ‘4’ and we will add this ‘4’ to the number in the cell diagonally up on left side. This cell presently contains zero, therefore, adding ‘4’ to zero result in total of ‘4’. Now we have the three numbers arrived at for the current cell, two numbers are -2 and one is ‘4’. We will select the maximum number out of these three numbers and place in the cell under consideration. Since, 4 is maximum in these three numbers, we put a 4 in the empty cell under consideration, as shown below:

At next step for assigning a score to third row fourth column, we again consider four cells:; one exactly on left, one diagonally up on left side and one exactly on upper side as shown below:

These cells contain one number each, with 4, -1, and -2 for left, diagonally upper left and upper cell, respectively. We have followed a gap penalty policy of -1 in the present case and this gap penalty is added to the number in the left cell and also in cell exactly up the empty cell. The left cell contains 4, therefore -1 is added to result in 3. The upper cell contains -2, therefore -1 is added to result in -3. Now we will find that what is the score of aligning ‘S’ in vertical sequence with ‘D’, in horizontal sequence using BLOSUM62 matrix. The score is ‘0’ and we will add this ‘0’ to the number in the cell diagonally up on left side. This cell presently contains -1, therefore, adding ‘0’ to -1 result in total of ‘-1’. Now we are arrived at three numbers; 3, -1 and -3 from left, diagonally upper left and upper cell, respectively. We will select the maximum number out of these three and place in the cell under consideration. Since, 3 is maximum in these three numbers, we put a 3 in the empty cell under consideration, as shown below:

(16)

16

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

At next step for assigning a score to third row fourth column, again we consider four cells; one exactly on left, one diagonally up on left side and one exactly on upper side as shown below:

These cells contain one number each, with 3, -2, and -3 for left, diagonally upper left and upper cell, respectively. We have followed a gap penalty policy of -1 in the present case and this gap penalty is added to the number in the left cell and also in cell exactly up the cell we are matching. The left cell contains 3, therefore -1 is added to result in 2. The upper cell contains -3, therefore -1 is added to result

in -4. Now we will find the score of aligning ‘S’ in vertical sequence with ‘Q’, in horizontal sequence using BLOSUM62 matrix. The score is ‘0’ and we will add this ‘0’ to the number in the cell diagonally up on left side. This cell presently contains -2, therefore, adding ‘0’ to -2 result in total of ‘-2’. Now we have the three numbers arrived at for the current cell; 2, -2 and -4 from left, diagonally upper left and upper cell, respectively. We will select the maximum number out of these three and place in the cell under consideration. Since, 2 is maximum in these three numbers, we put a 2 in the empty cell under consideration, as shown below:

In this manner the complete grid is filled and is shown next:

(17)

17

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

Next, we follow the second phase of dynamic programming. This phase is called tracing back. We read the lower right most cell. This cell has a score of 36. In addition, we consider three other cells , exactly adjacent to this cell; one on the left, one on diagonally up left side and one upper cell. These cells has numbers 30, 31 and 30, as shown below:

Now, we have to trace back from lower right most cell with the score 36. Tracing back is always to that adjacent cell which has maximum number. In tracing back, we place an arrow from lower right corner of the present cell to the lower right corner of the cell, which has the highest number. If, in case all the three adjacent cells contains the same number then an arrow from lower right corner of the current cell is started and ended with the lower right corner of diagonally upper left cell. However, in the present case, the maximum number among the three adjacent cells is 31, in the diagonally upper left cell. Therefore, an arrow from lower right corner of the current cell with number 36 is started and ended with the lower right corner of diagonally upper left cell with number 31, as shown below:

Next, again, we have to trace back to one of the three adjacent cells, from the current cell with number 31, highlighted and shown below.

Since, the diagonally upper left cell has highest number 26, we place an arrow from lower right corner of the current cell to the lower right corner of diagonally upper left cell, as shown below:

(18)

18

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

In this way, we continue to trace back. In case, we have two cells with equal numbers, then we trace back to both the cells. These represent alternative options of alignment and will result in two optimal alignments. But, in the present case, we did not find any such alternative alignments and finally ended up as shown below:

Then in last phase of dynamic programming, we have to generate actual sequence alignment from trace back path shown above. We start form the lower right most arrow. Whenever we have a diagonal arrow, the residues from both the sequences are written the same column. For example, the lower right most arrow has a diagonal arrow, therefore, residues from both the sequences are written in the same column, as shown below:

R R

Then we follow the previous arrow. If the arrow is diagonal, therefore we include the residues of both the sequence to align in the residues for column again.

KR KR

In this way we continue till we encounter a horizontal or a vertical arrow. In the present case we co ntinue aligning up the residues R in vertical sequence and N in the horizontal sequence.

(19)

19

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

NCELKR RCEVKR

Then we encounter a vertical arrow. Since now the arrow is vertical, therefore, residue from vertical sequence is written and a gap in horizontal sequence is introduced as shown below:

-NCELKR ERCEVKR

Next we encounter a horizontal arrow, therefore, a residue from horizontal sequence is aligned with a gap in vertical sequence, as shown below:

E-NCELKR -ERCEVKR

Next, we have three diagonal arrows, therefore three residues from both the sequence are aligned as shown below:

SDQE-NCELKR SEE-ERCEVKR

This way of alignment using dynamic programming was described by Needleman and Wunsch. This is alignment of two sequences (10 residues each) using BLOSUM62 displaying 6 conserved residues, 2 conservative mutations, 1 semi-conservative mutation resulting in alignment of 9 residues. One residue in each sequence got aligned with a gap, resulting in overlap alignment length of 11.

Let us take another example of Needleman-Wunsch dynamic programming alignment for the following pair of protein sequences, using BLOSUM62 and linear gap penalty zero:

THISISAPRRTEINSEQVENCE ITISANNTHERSEQVENCE

(20)

20

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

The resulting matrix and consequent sequence alignment is shown below :

-THISISAPRRT-EINSEQVENCE | || | ::| | :||||||||

IT-IS--A-NNTHE-RSEQVENCE

This manual dynamic programming alignment using the Needleman-Wunsch algorithm with BLOSUM62 and zero gap penalty produced 14 conserved positions (14 similarities), 3 semi-conservative substitutions with 7 gaps having an overlap alignment length of 24.

This way of alignment using dynamic programming described by Needleman and Wunsch results in global alignment of two sequences. Smith and Waterman modified this method of alignment slightly. They did not allow the score to fall below zero. This modification allowed the regions with local similarity to be detected and aligned. Therefore, Smith and Waterman method of dynamic programming produced local alignment between two sequences.

Back to Concept map 4. Summary

(21)

21

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Creation Process

In this lecture we learnt about:

 various empirical scoring schemes for pairwise sequence alignment

 dynamic programming method for creating optimal sequence alignment

References

Related documents

The Congo has ratified CITES and other international conventions relevant to shark conservation and management, notably the Convention on the Conservation of Migratory

These gains in crop production are unprecedented which is why 5 million small farmers in India in 2008 elected to plant 7.6 million hectares of Bt cotton which

The genetic variation of the deduced amino acid sequence of the partial protease gene from the clone BTM106 was determined by its multiple sequence

4.14 (a) Real time voltage swell (b) Negative sequence component of disturbance phase, Instantaneous Peak value contour calculated from the (c) positive and (d) negative

(D) Multiple sequence alignment with selected sequence neighbors, highlighting conserved catalytic site residues (in triangles) (E) Predicted ligand binding pockets in red surface,

Week 3 PAM matrix, Global and local alignment, BLAST: features and scores Week 4 Multiple sequence alignment, Conservation score, phylogenetic trees Week 5 Protein sequence

Exploitation of the model in (2) for obtaining identities involving special function appearing in CD. As can be seen, the most important step in the above sequence is constructing

(B) In the sequence alignment, '*' (sequences highlighted in blue) indicates fully conserved residue, ':' indicates conservation between groups of strongly similar