1
Biochemistry
Biostatistics and Bioinformatics
Protein Sequence Analysis
2
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
Description of Module Subject Name Biochemistry
Paper Name 13 Biostatistics and Bioinformatics Module Name/Title 05 Protein Sequence Analysis
3
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
1. Objectives: In this module, the students will:
1. Understand protein sequence analysis for Biochemistry and Molecular biology experiments 2. Learn downloading annotated protein sequences from UniProtKB
3. Compute various Biochemical parameters for a given protein sequence 4. Learn prediction of post-translational modifications
5. Learn prediction of signal peptide and transmembrane helices in a given protein sequence 6. Learn downloading raw protein sequences using genome browser.
7. Conduct analysis of a given protein sequence to find repeats using RADAR and visualize the presence of direct repeats using DotPlot
8. Using InterProScan to finds family, domains, repeats and sites in a given protein sequence.
9. Use PeptideCutter to search peptide bonds cutting enzymes and/ or chemicals for cleaving sites in an input protein sequence
2. Concept Map
3. Protein Sequence Analysis
Using PeptideCutter Downloading UniProtKB Sequence
Protein Parameter Computation Protein Sequence Analysis
Signal and TM peptide Prediction PTM Prediction
Repeat Analysis and Visualization Using IntroProScan
4
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
Protein sequence analysis for Biochemistry and Molecular biology experiments begins with obtaining a sequence in the laboratory or from sequence database. This is followed by computing various Biochemical parameters, prediction of signal peptide and transmembrane helices as well as prediction of post- translational modifications. To visualize the presence of repeats, DotPlot analysis is conducted. To gain additional information from known databases, PredictProtein tool for detecting various features and InterPro tool for functional analysis of protein classified into families is used. Finally, for protein identification using Mass Spectroscopy, PeptideCutter tool is used to search peptide bonds cutting enzymes and/ or chemicals for cleaving sites in a protein sequence to be identified.
Back to Concept Map
3.1. Downloading an annotated protein sequence
Visit http://www.expasy.org/ and search UniProtKB for “Glycophorin A Human” and click search button.
In the result page click on
In the ensuing page choose GLPA_HUMAN entry at serial number 2
5
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
and click hyperlink to reach Glycophorin A page.
The most important in this page is Display side bar, where one could jump to any of the feature listed. The features include, function, names & taxonomy, subcellular location, post-translational medications &
processing, interactions with other proteins, 3-D structures, conserved families and domains, sequence &
external links to other sequence databases, publications & literature information. The information for Glycophorin A from human is presented for subcellular location and post-translational modifications/
processing.
6
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
In the Format Tab of main page, select FASTA Canonical
Download the FASTA sequence and save as GLPA.FA file using NotePad. The mature peptide is from amino acids 20 to 150, with three domains: one N-terminal extracellular with 16 attached oligosachharide units having nearly 100 sugars, rich in sialic acid, which make the RBC anionic and thus hydrophilic. There is a middle region transmembrane helix and finally C-terminal cytoplasmic domain. The sequence of complete protein is shown next with mature protein highlighted with green background.
>sp|P02724|GLPA_HUMAN Glycophorin-A OS=Homo sapiens GN=GYPA PE=1 SV=2 MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTNDTHKRDTYAATPRAH
EVSEISVRTVYPPEEETGERVQLAHHFSEPEITLIIFGVMAGVIGTILLISYGIRRLIKK SPSDVKPLPSPDTDVPLSSVEIENPETSDQ
Back to Concept Map
7
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
3.2. Protein Parameter Computation
ProtParam tool allows the computation of various physical and chemical parameters for a given protein.
The computed parameters include molecular weight, theoretical pI, amino acid composition, atomic composition which are self explanatory. In addition, extinction coefficient, estimated half-life, instability index, aliphatic index and grand average of hydropathicity (GRAVY) is also calculated. Visit http://web.expasy.org/protparam/ and paste the 131 amino acids mature protein sequence with green
background and click button.
The results page is shown next
8
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
Extinction Coefficient indicates how much light a protein absorbs (represented by absorbance, A) at a certain wavelength and is useful during protein purification. Lambert-Beer Law, defined A = log (I0/I) =
e
cl, where, I0 is the intensity of incident light, I is the intensity of transmitted Light, c is concentration of the absorber protein, l is path length through the solution or thickness of cuvette,e
is molar extinction coefficient or molar absorbance coefficient at a particular wavelength for a particular absorber protein.Therefore, Molar Extinction coefficient is defined as
e
= A / cl. For commonly used cuvette of 1 cm path length, Unit of molar absorbance coefficient is M-1 cm-1 (dm3 mol-1 cm-1). It has been shown thate
280for amino acids as chromophore is determined by amino acid sequence (Gill, S. C. and von Hippel, P. H., 1989, Calculation of protein extinction coefficients from amino acid sequence data. Analytical Biochemistry, 182, 319–326. Erratum: Analytical Biochemistry, 1990, 189, 283). For each disulphide bonde
280 = 125, for TrP (W)e
280= 5500 and for Tyr (Y)e
280= 1490. For the following protein sequence,KYYGNGVTCGKHSCSVDWGKATTCIINNGAMAWATGGHQGNHKC
We find 2 disulphide bonds, two tryptophan residues and two tyrosine residues. Therefore,
e
280= 2 x 125 + 2 x 5500 + 2 x 1490 = 14230 M-1 cm-1 (dm3 mol-1 cm-1) can be calculated for this sequence.ProtParam reported two extinction coefficients for this sequence:
9
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
From experimental knowledge, it is established that all pairs of Cys residues are forming disulphide bonds (cystines), therefore,
e
280= 14230 M-1 cm-1 can be used during purification of this protein.For Glycophorin A, we find no Cys, no tryptophan but four tyrosine residues. Therefore,
e
280= 4 x 1490 = 5960 M-1 cm-1 (dm3 mol-1 cm-1) can be calculated for this sequence. ProtParam tool also reported the sameSometimes molar absorbance coefficients are large, therefore 1% or 0.1% solution is used for expressing absorbance coefficient. For Glycophorin A, ProtParam tool reported:
Half-life is the estimated time to reduce the amount of a protein to one half after its synthesis within a given cell. This is estimated by ProtParam in three physiological, i.e. mammalian reticulocytes, yeast and E.
coli. The estimated half-life for Glycophorin A, ProtParam tool reported
The instability index provides an estimate of the stability of a protein in a test tube. The estimated instability index for Glycophorin A, ProtParam tool reported
10
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
An instability index smaller than 40 indicates a stable protein and a value above 40 indicates the protein to be unstable.
The aliphatic index of a protein indicates the thermostability of globular proteins and is calculatedfrom the relative volume occupied by small aliphatic side chains of alanine, valine, isoleucine, and leucine. For Glycophorin A, ProtParam tool reported
GRAVY (Grand Average of Hydropathy) is average hydropathicity of a protein sequence as defined by Kyte J.
and Doolittle R.F. (1982) J. Mol. Biol. 157:105-132, shown next
For Glycophorin A, ProtParam tool reported
Back to Concept Map
3.3. Post-translational modification (PTMs) analysis on proteins
Computational prediction of post-translational modifications including phosphorylation, acetylation, methylation etc. is very useful for Biochemical experimental design. There are several online servers available for prediction of post-trnaslational modifications. The partial list can be reached at ExPASy server available at http://www.expasy.org/proteomics/post-translational_modification.
11
Biochemistry
Biostatistics and Bioinformatics
Protein Sequence Analysis
12
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
Some others are listed next.
1. at www.phosphosite.org
2. at http://www.phosida.com/
3. at http://phospho.elm.eu.org/index.html
4. at http://gps.biocuckoo.org/.
Back to Concept Map
3.4. Signal peptide and transmembrane helices prediction
Phobius is a combined signal peptide and transmembrane topology prediction tool and is available online at http://phobius.sbc.su.se/.
13
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
Phobius predicted a signal peptide from 1 to 19 amino acids, followed by extracellular (non-cytoplasmic domain (20-91 amino acids) continuing with a transmembrane domain (92-114 amino acids) to end up inside red blood cell with cytoplasmic domain (115-150 amino acids). The same prediction is presented graphically, next
TopPred available at http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::toppred is also used for prediction of membrane proteins based on hydrophobicity values for a given size of amino acid window. This prediction shows that Glycophorin A is an integral membrane protein.
14
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
TMHMM available at http://www.cbs.dtu.dk/services/TMHMM-2.0/ is used to predict transmembrane helices in proteins. For the sequence of Glycophorin A, it predicted two transmembrane helices. The first is the signal peptide and the second is the membrane anchor helix.
15
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
Back to Concept Map
3.5. Repeat analysis and visualization in protein sequences
To understand repeats analysis, user needs a protein sequence having repeats. EF3314 protein in Enterococcus faecalis is known to contains repeats To download a protein sequence, visit genome browser at http://microbes.ucsc.edu/ and enter Enterococcus faecalis to Select Genome. In the Enterococcus faecalis genome browser, enter EF3314 in the text box. The gene EF3314 will appear and will show that the gene is encoded in the ‘-‘ strand, i.e. complementary strand . Now click on the EF3314 gene. On the ensuing page at the bottom, click on hyperlink. You will reach on the page displaying the EF3314 protein in FASTA format. Copy the FASTA format sequence and paste in NotePad and save as ‘3314Protein.FA’ by selecting all files as type of file.
RADAR available at http://www.ebi.ac.uk/Tools/pfa/radar/ finds and aligns repeats in protein sequences.
16
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
DotPlot is very useful to appreciate sequence features visually. The most common sequence feature is the presence of direct repeats in a sequence. Repeated sequences (or repetitive elements, or repeats) are patterns of nucleic acids (DNA or RNA) and proteins that occur in multiple copies throughout the sequence.
17
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
The presence of repetitive sequences within a single sequence cannot be appreciated while reading a sequence. But the DotPlot gives a visual presentation if the sequence contains repeat sequences. For example, the protein sequence ‘THISREPEATISREPEATED’ can be analysed using a DotPlot. To construct a DotPlot, one needs to develop a matrix of columns and rows. The number of rows and columns depends on the number of residue in sequence. In the present case there are 20 residues, therefore, a table with 21 rows and 21 column boxes is drawn. In each box, a residue symbol is written in the first row and first column. Then the residues are matched for each of the boxes. In the boxes, where the residues are same, their symbol is written. Any other letter such as star ‘*’dot ‘.’ may also be written. Then the visual presentation reveals that there is one main diagonal showing the identity. Since we have used the same sequence as horizontal sequence and vertical sequence, therefore there is complete identity diagonal. The lines parallel to main diagonal in intrasequence comparison reveals the presence of direct repeat sequences. In the present case, this sequence has ‘ISREPEAT’ residues and this is directly repeating only once, therefore, there is one parallel diagonal on each side of the main identi ty diagonal. Some times palindromic sequences such ‘RADAR’ may be present. These are visible as perpendiculars cutting the main diagonal or parallel diagonal. The sequence ‘EPE’ is present in this case. In DNA, such sequences represents restriction endonuclease sites.
T H I S R E P E A T I S R E P E A T E D
T T T T
H H
I I I
S S S
R R R
E E E E E E
P P P
E E E E E E
A A A
T T T
I I I
S S S
R R R
E E E E E E
P P P
E E A E E E
A A A
T T T
E E E E E E
D D
18
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
If the perpendicular diagonals do not cut the main or parallel diagonal then it represents an inverted repeat. One can also use two different sequences, one as vertical and other as horizo ntal sequence to visualise the identity between two sequences.
To use user need a protein sequence and a DotPlot software. Use protein sequence of cell wall surface anchor protein saved in 3314Protein.FA file. Download DotPlot software named ‘Dotter’, a DotPlot program. The Dotter gives its output which is graphic and easy to visualise the repeats in the sequence. To use Dotter, run dotter on DOS Prompt by typing “dotter 3314Protein.FA 3314Protein.FA” and pressing enter. The direct repeats are visible as parallel lines, as shown next
Back to Concept Map
19
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
3.6. Searching families, domains and sites in protein sequence
InterProScan Sequence Search available at http://www.ebi.ac.uk/interpro/search/sequence-search compares the input sequence with protein sequence signature databases to find domains and sites in the input protein sequence. In addition, it finds the protein family it belongs to,.
EF0710 protein in Enterococcus faecalis is known to contain domains and sites as well it belongs to a protein family. Download EF0710 protein sequence from Enterococcus faecalis genome browser at http://microbes.ucsc.edu/ and paste the sequence in the input text box of InterProScan, as shown next, and click button.
The ensuing results window shows that this protein belongs to a protein family and has domains as well sites.
20
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
Back to Concept Map
3.7. PeptideCutter
PeptideCutter available at http://web.expasy.org/peptide_cutter/ is used to search peptide bonds cutting enzymes and/ or chemicals for sites in an input protein sequence . The tool allows to select enzymes and chemicals to be used and display options for cleavage sites and enzymes as well as chemicals predicted.
Simply paste the Glycophorin A sequence in the input textbox, as shown next.
21
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
and click button . The results for selected enzyme(s), trypsin in the present case will be displayed, as shown next
Back to Concept Map 4. Summary
22
Biochemistry
Biostatistics and Bioinformatics Protein Sequence Analysis
In this module, students:
Understood protein sequence analysis for Biochemistry and Molecular biology experiments
Learnt downloading annotated protein sequences from UniProtKB
Computed various Biochemical parameters for a given protein sequence
Learnt prediction of post-translational modifications
Learnt prediction of signal peptide and transmembrane helices in a given protein sequence
Learnt downloading raw protein sequences using genome browser.
Conducted analysis of a given protein sequence to find repeats using RADAR and visualize the presence of direct repeats using DotPlot
Used InterProScan to find family, domains, repeats and sites in a given protein sequence.
Used PeptideCutter to search peptide bonds cutting enzymes and/ or chemicals for cleaving sites in an input protein sequence