1
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
2
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
Description of Module Subject Name Biochemistry
Paper Name 13 Biostatistics and Bioinformatics
Module Name/Title 19 Multiple Sequence Alignment and Phylogenetic Tree Construction
Dr. Vijaya Khader Dr. MC Varadaraj
3
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
1. Objectives: For multiple sequence alignment and phylogenetic tree construction, the objectives in the present module are to
1.1. Compile a set of closely and distantly related enzyme sequences for construction of a multiple sequence alignment (MSA)
1.2. Display MSA graphically to extract maximum amount of information for detecting active site residues 1.3. Analyse MSA to detect conserved motifs and residues present in the active site of the enzyme 1.4. Construction of the phylogenetic tree to reveal evolutionary history
2. Concept Map
Description
Summary Construction of MSA
Graphic Display of MSA Compilation of a Sequence Set
Analysis of MSA
Phylogenetic Tree Construction
4
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
3. Description
Dear students, during evolution of enzymes, functionally significant residues are conserved more than the rest of the residues in the sequence. The conserved residue may be present singly or as a contiguous short stretch, known as a sequence motif. The residues inside active site of the enzyme are conserved for functional roles such as substrate binding and chemical catalysis. The catalytic residues are actual reactive amino acids and also involved in stabilization of the transition state. In addition, conserved glycine and proline residues outside active site are conserved for accurate folding of the protein to position active site residues for binding and catalysis of the substrate. In modules 6 to 10, we have seen that pairwise sequence alignment is used as similarity search tool to find similar sequences in databases to identify members of sequence family. Pairwise sequence alignment allows drawing structural, functional and evolutionary relationship between two sequences. However, to extract functionally significant residues from sequence information, a conclusive inference as to consensus residues among several sequences, needs to be drawn. During evolution, mutations occur to enable species to survive under changing environment, but the functionally significant residues are conserved alongwith their positions in the three dimensions, so as to form active site correctly, for catalysing the same chemical reaction. Therefore, alignment of several sequences for the same reaction is useful for detection of conserved residues to identity functional roles. The alignment of several sequences is known as multiple sequence alignment i.e.
an MSA, which allows detection of conserved residues, which are otherwise hidden in pairwise alignment.
The detection of conserved residues gives an insight of substrate binding, chemical catalysis and folding patterns in proteins. Detection of conserved residues inside and outside active site may lay foundation to develop an initial set of peptides for developing a QSAR model for structure based protein design. Further, an MSA is a prerequisite for constructing phylogenetic tree reflecting evolutionary divergence over time.
Phylogenetic tree, therefore, may be used to select closely related multiple template structure s for interactive homology modeling of protein sequences. Consequently an MSA has several important applications and in this module we will focus on detecting binding site residues involved in determining specificity and detecting reactive catalytic residues involved in determining turnover number of enzymes as well as detecting significant glycines and prolines required for accurate protein folding. Above all, an MSA is a pre-requisite for constructing a phylogenetic tree.
Therefore, the objectives in this module on Detecting active site of enzymes begin with compiling a set of enzyme sequences from various species, known to be related to each other, either closely of distantly for construction of multiple sequence alignment. The constructed MSA will be di splay graphically to extract maximum amount of information hidden in MSA for detection of functionally significant residues. Finally, we will use MSA for construction of the phylogenetic tree.
5
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
Go to Concept Map
3.1. Compilation of a Set of Related Sequences
In the module 17 we used a bifunctional enzyme Hpr Kinase/Phosphorylase (Hpr K/P) for homology modeling and in module 18 we used Hpr K/P for protein-ligand docking. Hpr K/P as kinase accepts Hpr and pyrophosphate as substrates to produce serine phosphorylated Hpr and inorganic phosphate. As phosphorylase, Hpr K/P accepts serine phosphorylated Hpr and inorganic phosphate as substrates to produce un-phosphorylated Hpr and pyrophosphate.
In the present module we will carry forward the example of Hpr K/P for detection of functional residues of Hpr K/P. We will use multiple sequence alignment for detection of functional residues because 3D structure of Hpr K/P bound to its substrate, pyrophosphate is not available. We will compile a set of closely and distantly related Hpr K/P sequences for construction of multiple sequence alignment. We can use BLAST searches for compiling a complete set of closely and distantly related Hpr K/P sequences. However, in this example we will use keyword based search of reviewed Swiss-Prot database at UniProtKB, although BLAST searches at close, intermediate and distantly diverged sequences using PAM30, PAM120 and PAM250 levels will lead to the same sequence set. Visit UniProtKB at http://www.uniprot.org/ and search for Hpr Kinase/Phopshorylase. This search returns 215 reviewed SwissProt and 4512 unreviewed TrEMBL sequences. Swiss-Prot is the manually annotated component of UniProtKB. It contains manually reviewed and annotated proteins with information extracted from literature and curator-evaluated computational analysis. TrEMBL, on the other hand is computationally translated proteins from nucleic acid sequence database at EMBL, which are awaiting review for inclusion in Swiss-Prot. Reviewed 215 entries is sufficient large dataset. Therefore, click Reviewed 215 hyperlink to display reviewed entries.
6
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
By default, UniProtKB results displays 25 entries per page. We have 215 entries. Therefore, to display all 215 entries in a single page in your web browser, select 250 from the Dropdown list to show 215 entries in a single web page.
At this stage, we need to analyze the lengths of the sequences to be included for constructing Multiple Sequence Alignment. Therefore, sort 215 sequences in increasing order of their length, by clicking on the down arrow button in the “Length” column.
The sorted list is presented in the web browser
7
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
In this sorted list, it is clear that first sequence of 216 is of very short length as compared to next sequence with 301 residues. Therefore, this sequence will not be included for MSA. Next to 301 length sequence, the sequences continue incrementing in lengths of one or two. Therefore, this represents a genuine variability in sequence length during evolution. Scrolling further down the sorted list shows that the sequence lengths differ by one or two residues.
However, the last residue has again an unusual length of 615. Therefore, this sequence will not be included for MSA.
Now, go to the top of list and sort the sequences in alphabetical order by clicking on the button in organism column. This will help in locating a sequence in multiple sequence alignment by its name. Now select all residues by clicking check box in first column heading. Now find the sequence with length 216 and unselect it. Similarly, find the sequence with unusual length of 615 and unselect it.
8
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
This compiles our set of sequences to be used in MSA. The compiled set 213 sequences will be displayed with yellow background and will enable Align Button in the tool bar at the top of the page.
Go to Concept Map
3.2. Construction of MSA
For constructing an alignment from this page, click Align Button. This will begin alignment using Application Tool, Clustal Omega, which is a new, general purpose MSA alignment tool and can align up to 4000 sequences or a maximum file size of 4 MB. The results of the MSA will appear in the browser Window.
9
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
To highlight aligned residues on the basis of conservation and similarities, click the first checkbox under amino acid properties.
This will highlight the conserved columns for identities with dark grey, strongly similar columns with grey and weakly similar columns with light grey.
10
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
Otherwise also, after the last sequence in alignment, consensus pattern is displayed in one of the three consensus symbols. A star (“*”) symbol in a column reveals identities to show that the same residue is present in all sequences aligned. A semicolon (“:”) symbol indicates conservation of residues with strongly similar properties. On the other hand, a dot or period symbol indicates conservation of residues with weak similarities. I will come back to similarities while analyzing MSA.
Go to Concept Map
3.3. Graphic Display of MSA
A conserved motif with four (GKSE) and another with two residues (DD) as well as single identities in the MSA are clear to draw inference for their frequency of 100% at that position. Clustal omega, categorizes certain residues in strong and weak similarities groups, with semicolons and period symbols, for each.
STA group is the first of nine distinct groups allowing replacements with strongly similar properties.
Similarly, FYW is the last group with strongly similar properties. Similarly, there are eleven groups of amino acids showing weak similarities. Visit FAQ at http://www.ebi.ac.uk/Tools/msa/clustalo/help/faq.html, for complete list. The conserved identities in the MSA are clear to draw an inference for their participation in the significant function of the enzyme. Similarities, either strong or weak, convey allowed mutations. A
11
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
single replacement is sufficient to classify it as non-identity. Similarities do not quantify the extent of replacement, i.e. the consensus representation does not convey the frequency of replacement. But, to draw meaningful conclusion for involvement in a significant function, there is need to convey the extent of replacement also. The graphical representation of extent of conservation and replacements in an MSA can be shown in a 100% stacked bar or sometimes called 100% stacked column diagram and is commo nly known as sequence logo.
A sequence logo or 100% stacked column bar diagram compares the percentage that each amino acid contributes to a total at each position. Use 100% stacked column bar diagram to emphasize the proportion of each amino acid in a given position in protein sequence.
The proportion of each amino acid is displayed using vertical rectangles with heights corresponding to proportion of each amino acid. It can be drawn using any spreadsheet application such as MSExcel. In the set of hypothetical sequences taken in the present example, the first sequence is shown to have each amino acid in alphabetical order along sequence length. Then second sequence replacing second last residue with a tyrosine followed by third sequence replacing third last residue with a tyrosine and continue mutating in same way till the first residue in first sequence is replaced with a tyrosine. This is shown to aligned hypothetically in the way shown here. The amino acid composition at each position is entered in spread sheet with columns labeled 1 to 20 for amino acid positions and rows labeled for amino acid symbols in order. After entering amino acid composition at each position, select stacked bar column from tool bar in insert menu of MSExcel.
12
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
This 100% stacked column bar diagram clearly shows 95% alanine and 5% tyrosine in first position with corresponding heights of rectangles. Similarly, it shows 90% cysteine and 10% tyrosine in second position.
The complete vertical rectangle in the last column reveals the conservation of tyrosine at this position.
Therefore, a sequence logo or 100% stacked column bar graphical display reveals the extent of conservation and replacements in an MSA.
For constructing sequence logo let us download multiple sequence alignment of Hpr K/P, by clicking Download Button in the tool bar, at the top of the page at UniProtKB Clustal omega. This will present a dialog box to select format and mode of downloaded file. This saving of the MSA in FASTA will help in further offline analysis for highlighting identities at different levels of conservation such as 99% percent or 95% or other level using MSA editing application BioEdit.
13
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
The alignment results will appear in browser window in FASTA sometimes called MultiFASTA format.
Select all, copy on clipboard, paste in NotePad, and save with a chosen file name with FASTA as extension name and save as type all files selected.
Visit WebLogo at http://weblogo.threeplusone.com/create.cgi and upload the saved alignment to create sequence logo. A sequence Logo describes the information content of the alignment. Here sequence Logo for positions 1 to 180 is shown in three rows with 60 positions for e ach row. Underneath each row, consensus sequence of Clustal omega is pasted, manually. The X-axis shows the position in the alignment.
The Y-axis describes the amount of information in bits. At each position there is a stack of amino acid symbols occupying the position. The overall height of a stack represents conservation of amino acid at that position. Higher or Big stacks i.e. stacks with more overall height represents higher conservation at that position and positions with small stacks represent higher level of mutations of amino acids at that position.
Large symbols within a stack represent frequently observed amino acid.
14
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
In addition, there is wave of information from beginning of the sequence position to end representing the change in conservation level along sequence. That means all the stacks do not reach the same height as in a 100% stacked column bar diagram. The calculation of stack height at a given position is influenced by the amount of conservation present at that position. This shows that information at each position is not same.
The higher stacks fall within positions having conserved and strongly similar residues. Next the sequence logo for positions from 181 to 240 is presented.
We find a stretch of higher stacks in the first row and this corresponds to conserved motifs and a large stretch of similar residues. This is actually the active site of the enzyme. Therefore, higher stacks in a given stretch indicate the presence of a binding or active site. Another Sequence Logo application available at center for biological sequence analysis at technical university of Denmark DTU allows construction of many types of sequence logos. Visit http://www.cbs.dtu.dk/biotools/Seq2Logo for construction of Shannon type logo.
15
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
Try for construction of Shannon type logo using previous hypothetical sequences set to understand the interpretation of Shannon type Sequence Logo.
The Shannon type logo is somewhat similar to one constructed at WebLogo at threeplusone.com. The X - axis again shows the position in the alignment. The Y-axis also describes the amount of information in bits.
At each position there is a stack of amino acid symbols occupying the same position. But instead of displaying the residues actually observed in alignment, Shannon type logo includes other residues predicted with Information Theory. Here this may be considered as mutations which may be expected in future during further evolution. However, biochemists may test these mutations in the laboratory through site directed mutagenesis. The suggested mutations in the conserved motif containing active site of enzyme Hpr K/P may be tested in the laboratory through site directed mutagenesis, followed by expression mutated Hpr K/P and purification of the mutated Hpr K/P for activity measurements.
16
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
For example, mutation of conserved dipeptide aspartate-aspartate to suggested glutamate-glutamate may be tried for retaining activity by the enzyme.
Go to Concept Map
3.4. Analysis of MSA
Now, let us analyse MSA using Clustal Omega consensus notation. A semicolon in the last column of alignment in this example shows that it is conserved with strongly similar properties. This indicates the replacement of residue with other residues having similar properties. For example, mutation between Methionine and Leucine residues, in the present case. Clustal omega defined nine groups for strong similarities and 11 groups for weak similarities. For example strong similarity is indicated for group MILV, when any of these residues replaces others in a column.
However, if Phenylalanine may also exist at a position where MILV may be present, then a strong similari ty is considered as a weak similarity. From now onwards, sequences from HPRK_PEDPA from Pediococcus pentosaceous to HPRK_STAXY from Staphylococcus aureus are displayed.
17
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
The aligned 60 residues reveal a mutation having strongly similar amino acids such as Phenylalanine, Tyrosine and Tryptophan, the aromatic group of residues. Such strongly similar mutations indicates participation in packing residues in hydrophobic core of proteins. In addition, three columns show dots in the consensus. This indicates presence of weakly similar residues as in the second column, showing glycine and asparagine. In addition, two more columns show dots in the consensus. This indicates that these weakly similar positions are allowed to mutate with less similar residues for optimizing enzymatic activity in different species, surviving under different environments. Therefore, these weak similarities may be significant for developing an initial set of proteins to be used for developing a QSAR model for structure based protein design for engineering enzyme activity under a given environment. Next 60 aligned residues reveal the presence of conserved Histidine in all strains analyzed.
This is a strong proof for the involvement of Histidine in reaction mechanism or forming a salt bridge burying a negatively charged residue Aspartate or glutamate inside hydrophobic core of the protein or forming an ion pair on the surface for correct three dimensional f olding of the protein. The experimental knowledge confirmed participation of this histidine in reaction mechanism. Next aligned 60 residues reveals presence of a short conserved sequence of length four i.e. a conserved motif, comprising of glycine, lysine, serine and glutamate. This conserved motif is strong indicator of active site of the Hpr K/P. The experimental knowledge confirmed participation of this conserved motif in binding substrate for catalysis.
18
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
Further, conservation of two consecutive aspartate indicate participation in forming the active site by these residues.
The experimental knowledge confirmed participation of these aspartate residues in reaction mechanism.
Conservation of three single residues, two glycines and one arginine is also reve aled. Conservation of charged residue such as arginine is a strong indicator for involvement of arginine in reaction mechanism or for burying some negatively charged residue, aspartate or glutamate inside hydrophobic core of the protein by forming a salt bridge or exposing an ion-pair on the surface for correct 3D folding. The experimental knowledge confirmed participation of this arginine in reaction mechanism for stabilization of transition state oxy-anion. The conservation of glycine residue is a strong indicator of glycine being part of a turn in three dimensional structure of the protein. We will look at single conservations and similarities of un-charged residues in the next module. Next aligned 60 residues reveals conservation of two glycine residues indicating participation in a turn in three dimensional structure of the protein. We will look at single conservations and similarities of un-charged residues in the next module.
19
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
Next aligned 60 residues reveals conservation of a glutamate Indicating i nvolvement in reaction mechanism or forming a salt bridge burying some positively charged residue inside hydrophobic core of the protein or forming an ion pair on the surface for correct three dimensional folding.
The experimental knowledge confirmed participation of this glutamate in burying a histidine.
Similarities in hydrophobicity, polarity and isoelectric point can be represented through periodic spectrum of amino acids.
20
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
Alanine is minimal hydrophobic with increasing hydrophobicity both up and left up to phenylalanine. The hydrophobicity of polar side chains increases from glutamine both up and right, up to tyrosine. This spectrum provides residues for packing the hydrophobic core of proteins. The isoelectric point increases periodically from acidic residues to arginine through increments of approximately two units. This provides residues for acid base catalysis from pH 3 to 13. This periodic spectrum provides for mutating similar physicochemical amino acids. For example, Clustal omega groups MILV as distinct group allowing them replacements with strongly similar properties. Similarly, FYW is another group with strongly similar properties.
The consensus notation of HprK/P MSA reveals two conserved motifs, 7 identities, 22 strong similarities and 11 weak similarities.
In total, there are 33 similarities which are due to mutations by residues having similar physico-chemical- properties such as hydrophobicity, polarity and isoelectric point. We have seen that Hpr K/P in vitro, as kinase, accepts Hpr and pyrophosphate as substrates to produce serine phosphorylated Hpr and inorganic phosphate.
As phosphorylase, Hpr K/P accepts Ser-(P)-HPr and inorganic phosphate as substrates to produce dephosphorylated Hpr and pyrophosphate. From experimental observations, it is known that residues G151-E159 forms P-loop for binding ATP or pyrophosphate substrate in the P-loop.
21
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
The residues H136 and D174-D175 are catalytically reactive and residue R202 is involved in the stabilization of the Oxy-anion transition state.
The conserved motif in P-loop and reactive di-aspartate as well as reactive H136 and transitions state stabilizer R202 are predicted correctly in the consensus sequence by Clustal Omega. In three dimensions, S46 of Hpr and R202 of Hpr K/P are present on the opposite side of phosphate group. The H136 and D174- D175 are present near S46 of Hpr. Therefore identification of conserved residues can be used for detecting functionally important residues in enzymes.
Go to Concept Map
3.5. Phylogenetic Tree Construction
The 3D structure of Hpr K/P from Enterococcus faecalis is not available. In module 17 on Protein Structure Modelling, we used automated mode of homology modeling for 3D structure prediction based on a single template structure. However, in module 18 on protein ligand docking, we found 3D structures of Hpr K/P from Mycoplasma , Staphylococcus and Lactobacillus. Using more than one template for homology modeling results in better structure prediction. But, to undertake homology modeling using more than one template, we need to use only closely related 3D structures. The MSA is used for reconstruction of evolutionary history for the extent of relatedness. The evolutionary history is presented graphically using a pictogram known as phylogenetic tree. Phylogenetic tree reflects divergence of genes and proteins from a common ancestor and therefore may be used to select closely related multiple templates for interactive homology modeling of protein structures.
There are hundreds of programs available for phylogenetic analysis. For complete list of tree construction applications, visit evolution site at genetics division of university of Washington at http://evolution.genetics.washington.edu/phylip/software.html. These programs follow different strategies for construction of a phylogenetic tree. Visit wikipedia at https://en.wikipedia.org/wiki/Phylogenetic_tree for theory behind various phylogenetic tree construction methods.
22
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
We will use PHYML application, a freely available web based tree construction program, which is based on Maximum Likelihood. PhyML application is also part of the robust tree construction application for the non-specialist available at http://www.phylogeny.fr/ which constructs the MSA using commonly used MSA application, MUSCLE.
23
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
Visit PhyML available at http://www.atgc-montpellier.fr/phyml/
At PHYML we find that PHYML application requires pre-constructed MSA in PHYLIP format. We constructed MSA using Clustal Omega at UniProtKB and downloaded MSA in FASTA format for construction of WebLogo. Clustal Omega MSA alignment at UniProtKB allows downloading MSA in FASTA format only.
Therefore we need to convert format of saved alignment from FASTA to PHYLIP. Alternatively, MSA may be reconstructed interactively using Clustal Omega at EBI.
Visit format conversion site at https://hcv.lanl.gov/content/sequence/FORMAT_CONVERSION/form.html.
24
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
Choose the saved file in FASTA format. Set the input format to FASTA. Select the PHYLIP standard - sequential and choose output line width as wide as possible and click submit button. In the ensuing window, the MSA in PHYLIP format appears.
Click download hyperlink to display MSA in PHYLIP format in NotePad. The PHYLIP format has two sections, the header and the sequences.
The Header section has single line which records two numbers and an alphabet. The first number records the number of sequences in the file, 213 in the present example. The second number records the number of aligned columns in the multiple sequence alignment. This includes the number of inserted gaps indicated with hyphen character. This is 366 in the present example. The third is an alphabet to convey that all the sequences are entered in a sequential manner, i.e. one sequence after the other in each
25
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
separate line, after header line. The sequence line starts with an identification annotation typed in first ten columns. Therefore if the identification annotation is shorten than 10 characters, as six in present example, then append the remaining columns with blanks i.e. 4 spaces appended in the present case. Then to this identification annotation, the complete aligned sequence length including gaps in MSA is appended. This length is 366 in the present case. Then save this file in PHYLIP format with .PHYLIP as extension name using NotePad, as already explained several times.
Now, visit PHYML at http://www.atgc-montpellier.fr/phyml/ and upload saved MSA in PHYLIP format.
Set the type of sequence as amino acids and type of Phylip format as sequential. Leave other settings to default. Enter a name for analysis, your e-mail, confirm e-mail and execute.
The ensuing page will confirm uploading and allow you to check the status and Results. The ensuing page will confirm uploading and allow you to check the status and Results. The results link will also reach your e-mail ID in the indicated time. Click the hyperlink to check job status.
26
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
If the job is still running it will be indicated. When the job is completed, click hyperlink visualize your inferred phylogenetic tree.
The tree is presented in the browser window. The tree can be saved in any of image formats for printing. A typical bifurcating tree is shown here.
27
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
The horizontal lines in the tree are called branches and represents sequence belonging this phylogenetic group or family. The two adjoining branches are joined with a common ancestor called a node, represented with a circle in this diagram. The node on the extreme left (at the bottom in some tree diagrams) is called the root node which represents the ultimate common ancestor of all members in the tree. The identifiers on the extreme right in this tree are the existing sequences used to construct tree. All the nodes from these existing sequences till the root node are inferred ancestor sequences which do not exist today and are extinct. The branch lengths are shown where the branch lengths represents evolutionary divergence times. This phylogenetic tree is called a phylogram. But as Biochemists, we are interested only in relationships, either close or distant. The same is represented through dendrogram.
Therefore click dendrogram to display tree as dendrogram. One can search a node in the tree.
Simply type the identification annotation in the search node text box and press enter key. Enter o07674 for Hpr KP from Enterococcus faecalis and press enter key. The tree can be represented in linear or radial manner. Change the tree to radial by clicking radial radioButton, so that all leaf nodes i.e. the sequences used for tree construction are displayed in minimum space. The searched node will be highlighted in blue color. Click on the node to open context menu and select “Path to Root” command.
28
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
This will highlight the path to earliest divergence node in the tree in red color.
29
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
Similarly search other node you wish to map the path. Enter Q9S1H5 for Hpr KP from Staphylococcus xylosus. Then click on the node to open context menu and select Path to Root command. This will highlight the path to earliest divergence node in the tree in red color. In this way user can estimate the divergence of the same enzyme in two species during evolution.
Similarly search Q9RE09 for Hpr KP from Lactobacillus casei. Then click on the node to open context menu and select Path to Root command. This will connect the path to earliest divergence node in the tree. In this way user can estimate the divergence of the same enzyme in species during evolution. Similarly search Q8EWA5 for Hpr KP from Mycoplasma penetrans . Then click on the node to open context menu and select Path to Root command. This will connect the path to earliest divergence node in the tree. In this way user can estimate the divergence of the same enzyme in two species during evolution. This will highlight complete path from root to these four nodes in red color.
30
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
This highlighting reveals that divergence of HprKP in Mycoplasma from root occurred much earlier at second node than Staphylococcus, Lactobacillus casei and Enteroccocus HprKP occurred at recent seventh divergence node.
This shows Staphylococcus, Lactobacillus casei and Enterococcus HprKP had common ancestor at seventh inferred node of their common ancestor. Therefore, divergent template available from Mycoplasma may be excluded from interactive homology modeling and templates from closely related Staphylococcus and Lactobacillus may be used for homology modeling of Hpr K/P from Enterococcus. Visit http://spdbv.vital- it.ch/modeling_tut.html for interactive homology modeling using SwissPDBViewer and SWISS-MODEL server and follow the step-by-step procedure.
Go to Concept Map
31
Biochemistry
Biostatistics and Bioinformatics
Multiple Sequence Analysis and Phylogenetic Tree Construction
4. Summary
Dear students, during evolution of proteins, functionally significant amino acids are conserved more than the rest of the residues in the sequence. Therefore, we learned to compile a set of enzyme sequences from SwissProt using keyword search from various species, known to be related to each other, either closely of distantly. Then we learned construction of multiple sequence alignment from this set of sequences. The constructed MSA was displayed graphically as WebLogo to extract maximum amount of information hidden in MSA. The MSA in the light of graphical display was analyzed to detect functionally significant residues from MSA of Hpr K/P. The conserved motif in P-loop for binding ATP or pyrophosphate and reactive di-aspartate as well as reactive H136 and transitions state stabilizer R202 could be predicted correctly in the consensus sequence by Clustal Omega. Therefore, an MSA is very useful for detecting functional residues, in case enzyme-substrate complex 3D structure is not available. In addition, we learned construction of phylogenetic tree to detect closely related sequences for selecting multiple templates to be used during interactive homology modeling. I thank you all for visiting ePGPathshala.