1
Biochemistry
Biostatistics and Bioinformatics
Nucleic Acid Sequence Analysis
2
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
Description of Module Subject Name Biochemistry
Paper Name 13 Biostatistics and Bioinformatics Module Name/Title 04 Nucleic Acid Sequence Analysis
Dr. Vijaya Khader Dr. MC Varadaraj
3
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
1. Objectives: In this module students will learn about
1. Nucleic acid sequence analysis for computational operon construction in prokaryotic raw DNA sequence with putative ORFs
2. Downloading raw genomic DNA sequence using a genome browser 3. ORF prediction n prokaryotic DNA
4. Promoter prediction
5. Prediction of rho independent transcription termination site 6. Primer designing for amplification of complete operon using PCR 7. in silico PCR
8. Translation of nucleic acid to protein sequence 2. Concept Map
3. Nucleic Acid Sequence Analysis
Download DNA sequence ORF Prediction in prokaryotic DNA
Nucleic Acid Sequence Analysis
Transcription termination site prediction
Promoter Prediction
Primer Design for PCR in silico PCR
Translation of nucleic acid to protein
4
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
A gene is defined as any stretch of DNA sequence, which in combination with RNA polymerase and other transcription factor(s) if any, will yield one diffusible RNA product. Here, any continuous sequence of DNA sequence include the actual transcribed DNA sequence and additional DNA sequence(s) controlling the transcription of actual transcribed DNA sequence. These additional controlling DNA sequence(s) may be upstream of actual transcribed DNA sequence or within this transcribed DNA sequence or even downstream of this DNA transcribed sequence. Therefore, a gene consists of actual transcribed DNA sequence and controlling DNA sequence(s) on the same DNA molecule regulating the transcription of the actual transcribed DNA sequence. Nucleotide sequences of complete chromosomes and genomes are known for many organisms. The genes are arranged linearly on chromosomes and can be browsed using a genome browser, such as the one available at https://genome.ucsc.edu/
Any stretch of DNA sequence on a bacterial genome is defined as Open Reading Frame (ORF) if it has a start codon (AUG/GUG) followed by codons for other amino acid and finally ending with a stop codon (UAA/UAG/UGA). This ORF is called a protein coding sequence (CDS) if it is transcribed and translated in vivo. Computational CDS discovery in newly sequenced DNA is to predict peptide/protein coding ORFs in new DNA followed by searched for sequence similarity with already known peptides/proteins in other genomes. If the peptide/protein is found in some other genome then it is most likely to be a n actually expressed peptide/protein and it is annotated on the genome by the name of this protein in other
5
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
genome. However, if the predicted peptide/protein is not found in other genome, then this computational prediction is annotated on the genome as hypothetical protein.
In prokaryotes, the genes, the actual transcribed DNA and additional DNA sequence(s) controlling the expression of transcribed DNA are collectively called an operon. In prokaryotes, each single transcribed mRNA may have more than one CDS and the operon transcribing such mRNA is called a polycistronic operon. Otherwise it is called monocistronic operon.
Construction of operon in prokaryotic DNA require downloading and saving the DNA sequence, encompassing the complete operon, as a FASTA file. Then this downloaded sequence is analysed for searching ORFs, transcription start site prediction, promoter sequence elucidation, ribosome binding site (Shine-Delgarno), rho independent transcription stop site prediction and an inference for construction of operon.
The DNA sequence encoding cellular functional components, such as polypeptide or tRNA or rRNA, on genomes and chromosomes can be identified by computer programs. These identified polypeptide or tRNA or rRNA encoding DNA sequences are starting points for construction of complete genes to understand their Biochemistry and Molecular Biology. Therefore, we will use Bioinformatics tools for computational construction of genes in newly sequenced DNA/genomes. Then we will use primer design software for planning PCR experiments to test the correctness of the computationally constructed gene.
Back to Concept Map
3.1. Download DNA sequence using genome browser
First of all, download necessary raw DNA sequence to be analysed. In the present example, will use 88 amino acid containing Phosphocarrier protein, Hpr, and 575 amino acid containing protein, Enzyme I, from Enterococcus faecalis. Downloading sequence will include both the open reading frame (ORF) and 5’ & 3’
untranslated regions (UTR) as well as promoter sequence, to construct full operon. Therefore, visit http://microbes.ucsc.edu/ and enter ‘Enterococcus faecalis V583’ and press enter or click on
button, as shown next:
On the UCSC Enterococcus faecalis V583 Genome Browser, enter ‘Phosphocarrier protein Hpr’ and click button, as shown next.
6
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
In the ensuing window, find the gene EF0709 and Click on . You will reach page of . You will notice for this ORF. Please note here that this position on chromosome includes only nucleotides encoding the protein. Go to previous page using back option of browser. Here, note that the direction of the gen is forward as indicated by right arrows. This shows that the coding strand is ‘+’ strand and not the reverse complement.
Now, click on the left side, previous item button, marked with 2 left arrows:
Now click on button and at the following page again click on button and you will reach page showing ORF EF0711
Click on and you will reach a page, where position of this EF0711 is shown
Note that the position of this gene starts at 664651. Therefore, the EF0710 gene definitely ends at 664650 nucleotide position on this genome.
Click on 2 left arrows i.e. buttons again and again to reach previous gene EF0708.
Click on black box of and you will reach page where position of this EF0708 is shown.
7
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
Note that the position of this gene ends at 666442. Therefore, the nucleotides from 662443 (end of EF0708) to 664650 (start of EF0711) encompass the complete operon for the ORFs EF0709 and EF0710.
Now click back button of your browser and the gene EF0709 will appear as . You right click on to open ‘context menu’. From
the context menu you select as shown next:
You will reach page showing:
8
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
Here you change the chromosome position, chr:662,443-664650 as shown and add 50 extra bases downstream of the DNA to design primers for amplification of complete operon.
and click button
From the following browser window, copy the FASTA format Sequence and save using Notepad as
‘709_710_DNA.fa’.
Back to Concept Map
3.2. Gene Prediction in prokaryotic DNA
Now check if this DNA sequence contains any open reading frame (ORF) alongwith regulatory elements encoded in upstream promoter.
Visit GeneMarkS at http://exon.gatech.edu/genemark/genemarks.cgi. Upload ‘709_710_DNA.fa’ FASTA file and click on button. Then Click on link on the following page. The result will appear in browser window shown next:
9
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
This prediction shows two ORFs, without any spacer DNA sequence between them. On the other hand, the last nucleotide in DNA at position 437 which is the last nucleotide ‘A’ in stop codon ‘TAA’ of EF0709 ORF is the first nucleotide i.e. ‘A’ of start codon ‘ATG’ at position 437. This prediction for two ORFs appears to be correct, because the predicted ORF length of 267 nucleotides will give 88 amino acid HPr protein and predicted ORF length of 1728 nucleotides will give 575 amino acids containing Enzyme I protein. But the same need to cross checked using promoter prediction for both ORFs and prediction of two ribosome binding sites for each ORF for constructing the complete operon.
Now visit http://linux1.softberry.com/all.htm
BPROM is bacterial sigma70 promoter recognition program with about 80% accuracy and specificity. It is best used in regions immediately upstream from ORF start for improved gene and operon prediction in bacteria. We do not have information about the sigma factor for the sequences in question. Therefore, click on FGENESB – to find operon and gene in bacteria. In the following page, paste sequence, select Enterococcus faecalis and click Process button:
10
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
You will receive results like this.
Back to Concept Map 3.3. Promoter Prediction
First, check if this DNA sequence contains any promoter upstream of the predicted gene site i.e. u pstream
of start codon ‘ATG’ at 171 on DNA sequence. Visit at
http://www.fruitfly.org/seq_tools/promoter.html and enter the information, shown next
11
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
and click submit button. The results will appear in the browser window:
A putative purine rich ribosome-binding site (GAAGGAGA) was located 9 nucleotides upstream of the start codon (ATG) of ORF 1 EF0709. Similarly a putative purine rich ribosome-binding site (GAAGGA) was located 11 nucleotides upstream of the start codon (ATG) of ORF 2 EF0710
This shows that the promoter for the operon starts at position 102 of the sequence.
Back to Concept Map
3.4. rho independent transcription termination site prediction
One could predict and retrieve rho independent transcription termination site in bacterial sequences. To predict rho independent transcription termination site, go to http://rna.igmors.u- ...GATTTTAGCGTATCAAGAAAGGAAAACCCTAACATAAAAATTTTTTATTTACGAAGGAGACCGATTTATATG...
-35 region -10 region TSS RBS start codon of Hpr
12
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
psud.fr/toolbox/arnold/index.php. Upload FASTA file, set the options for forward strand only and click Run button and results will appear:
Now, to retrieve rho independent transcription termination site, go to WebGeSTer database site at:
http://pallab.serc.iisc.ernet.in/gester/dbsearch.php and browse terminators by clicking on ‘E’
Now click on in the line for Enterococcus faecalis in the window and then on the following page click , a hyperlink on the right side and you will reach, the following page:
13
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
In the Start text box enter the i.e. chromosome position which was used for downloading the complete operon. So in the start textbox enter ‘662443 and in the End textbox ‘ 664650’
and click Submit button.
Now click on the stem-loop button as indicated by the arrow in the figure and you will reach the stem-loop rho-independent transcription termination site for this operon.
14
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
In addition, rho-independent transcription termination site (RITs) can also be predicted using other software online. Search Google with “Finding transcription terminator sites in bacterial genomes”.
This completes the construction of operon for EF0709-EF0710 genes encoding Histidine phosphocarrier protein (HPr) and enzyme I (EI) respectively. Nucleotides after the stop codon of second predicted ORF shows a stem loop sequence (boxed) followed by rich pyrimidine sequence (underlined) for rho independent transcription termination site:
Coding Strand: ...TAATTAAGTGATTATGAAAGCACATCGACAATTGTCGATGTGCTTTTTTATT Nascent mRNA: stop codon ....GCAcAUcgAc....gUcgAUgUGCUUUUUUAUU This shows that rho independendent transcription termination site ends at 2213 base position.
Back to Concept Map 3.5. Primer Design for PCR
The computationally constructed operon/gene can be tested for its correctness through genetic engineering techniques. In case, the operon is correct, then the same can be used in protein engineering for improving functional performance of the two proteins.
The correctness of the constructed operon/gene can be tested through amplification of the complete operon/gene for its cloning and then transfer to a recipient strain lacking this operon/gene. If the operon/gene is functional in the recipient strain then the operon/gene is constructed correctly and it can be used for planning protein engineering experiments to improve functional performance of the organism harbouring this operon/gene.
15
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
The amplification of complete operon can be achieved through PCR (polymerase chain reaction). PCR is an in vitro method for repeated replication (therefore, amplification) of specific DNA fragments. The specific DNA fragment may be very small as compared to the long DNA molecule used in PCR. This is similar to replication of only target gene, say gene 1 out of several other genes, say 100 genes present on the DNA molecule. Therefore, in PCR, several copies of the target/desired gene are produced. Each copy is called amplicon of the gene.
In PCR process, two oligonucleotide primers, each complementary to the extremes of opposite strands of DNA separated by a region to be amplified, direct the synthesis of complementary strand towards each other to produce an exact copy of DNA flanked by primers. Repeating the cycle of 3 independent steps carried out at their defined temperatures, doubles the number i.e. increasing the initial copy number of target DNA in a geometrical fashion (approx. m2n). Here 'm' is the initial copy number of target DNA and 'n' is the number of PCR cycles. The steps involved are step 1- denaturation at 94-97°C, step 2- annealing at 50-72°C and step3 - extension of annealed primers at about 72°C by a thermostable DNA polymerase.
A
typical PCR reaction mixture contains template DNA (containing the specific DNA to be amplified), primer pair, deoxyribonucleotide triphosphates (dNTPs) and thermostable DNA polymerase (Taq DNA polymerase), all present in PCR buffer. Template DNA provides the annealing sites where a specific primer anneals. Taq DNA polymerase then adds on the building blocks (dNTPs) to the 3' end of the annealed primer as directed16
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
by the sequence in the template DNA. Template DNA provides the sequence that is to be amplified. Primer pair determines the specificity of the target DNA fragment to be amplified. Therefore, a specific pair of primers will replicate only the specific target fragment.
During the reaction, DNA is amplified by repeatedly changing the temperature so as to provide optimum temperature conditions at each step. DNA solution containing PCR mixture is brought to 95°C to denature double stranded DNA to single stranded DNA. Then the temperature is lowered to allow annealing of the primers to single stranded DNA. The annealing temperature is determined by the melting temperature of the primers. The melting temperature is calculated by adding 2° for each A or T base and 4° for each C or G base. Annealing temperature is kept 5-10° below the melting temperature. The annealing temperature is optimized for eliminating any non-specific amplification so as to amplify the target DNA only. In the third step of the reaction, Taq DNA polymerase extends the annealed primer at about 72°C. The three steps of denaturation, annealing and extension are repeated for about 25-40 cycles to produce sufficient amount of amplified fragment to be visualized under UV lights after ethidium bromide (EtBr) staining of agarose gel.
All components, except primers and template DNA are common to all PCR Assays. Oligonucleotide primers are specific to each PCR assay used.
PCR buffer: Composition of PCR buffer, particularly the concentration of Mg++, has profound effects on the specificity and yield of amplification. Apparently low Mg++ may result due to the presence of EDTA or other chelators in primer or template DNA stocks. Excess of Mg++ may result in accumulation of nonspecific products. Therefore, titration of Mg++ is highly desirable. Now-a-days, PCR buffer without MgCl2 alongwith a stock solution of MgCl2 is supplied for this purpose. Inclusion of Triton-X-100 and/or gelatin has stabilizing effect on enzymes used in PCR and result in better yield. Some recent protocols have recommended the use of 10% DMSO to reduce secondary structures of target DNA.
Deoxynucleotide triphosphates: The dNTPs binds Mg++quantitatively, therefore, dNTPs' concentration in a reaction mixture will determine free Mg++ available for enzyme activity. dNTPs are used at 200 microMolar final concentration and approximately 50% of dNTPs are left unused after PCR amplification cycles. pH of the dNTPs stock solution should be neutral. A number of biotechnology companies have come up with stock solution of 100 mM ready to use solutions.
Enzyme: Thermostable DNA polymerase is the enzyme required for amplifi cation and is available with a number of manufacturers. Taq DNA polymerase, the commonly used enzyme in PCR, has been isolated from Thermus aquaticus, which have 5' to 3' polymerase activity.
Template DNA : Template DNA used in PCR varies from pure genomic DNA to crude preparation of cells.
Primers : The optimal concentration of primers to be used in the re action mixture ranges between 100- 500 nanoMolar. Higher primer concentration should be avoided as this may promote mispriming, resulting
17
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
in nonspecific amplification. Annealing temperatures should be 5-10 C below Tm. However, annealing at higher temperatures for slightly extended times, especially in first few cycles reduces mispriming and help in increasing the specificity of the primer pair used in the assay.
Success of PCR basically depends upon the sequence of primers designed using DNA sequence of target fragment. The mistakes here will either result in no amplification or non-specific amplification. There are no set rules, but oligonucleotide primers are generally in the range of 18-30 bases, have similar GC content (>50%), similar Tm, minimal secondary structure (i.e. self complementarity) particularly in the 3' region (to reduce primer dimer formation), low complementarity to each other, and specific to target DNA with no cross hybridization to non-target DNA sequences.
https://www.embl.de/pepcore/pepcore_services/cloning/pcr_strategy/primer_design/ provides general rules for designing primer.
These are designed with the help of a computer programme. A number of programs are available from commercial software suppliers or freely available on-line. Visit Primer3Plus site to design primers to amplify complete operon containing the promoter, structural genes encoding and rho independent transcription termination site. This DNA sequence starts at position 102 of the sequence. This shows that rho independent transcription termination site ends at 2213 base position. Therefore bases from 102 position to position 2213 must be the target to amplify.
Visit Primer3Plus at http://www.bioinformatics.nl/cgi-bin/primer3plus/primer3plus.cgi and paste 2258 bp sequence in the Main Tab text box.
18
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
Target: You can specify Targets so that acceptable primer pair will flank this target to be amplified using PCR. A Target is the sequence in which you are specifically interested in your template DNA. It might be a repeat site (for example a CA repeat) or a single nucleotide polymorphism (SNP) site. This will help in repeat and SNP analysis using PCR based amplification and subsequence nucleotide sequencing in detecting the repeats and SNPs in the template DNA from sample in hand.
In the present case we are interested in the amplification of complete operon, therefore, select the residues from 101 to 2213 and click on target button . This will target the bases from 101 to 2213 to include in the amplicon. In this way the complete operon can be targeted for amplification for use in cloning.
Now, Click on ‘General Settings’ Tab and change the concentrations of divalent cations and dNTPs to match the PCR buffer concentrations, as shown:
Now, Click on ‘Advanced Settings’ Tab to check the Use Product Size Input and ignore Product Size Range check box and to set the product size Min = 2212, as shown:
19
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
Click on the button, the results page will appear
...
The following pair of primers is reported.
20
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
Left primer : TCTTAGAAACTGAAAGGGTCTTTTT Right Primer: AAAAATACGTGAATCGCAACA
In addition, 4 more pairs of primers are reported.
Back to Concept Map
Click on the Browse Button and select the file containing EF0709-EF0710 operon stored on your computer.
Now click on ‘Upload File’ button. The sequence will appear in the text box. We know that the length of the sequence is 2258 bases and operon runs from base position 102 (encompassing -35 site of the promoter) to base position 2164, the stop codon ‘TAA’ for the Enzyme I, the second ORF in the operon.
Therefore, we are interested in Select the nucleotides in the textbox
and click on Target button. A pair of Square brackets will appear around the selected sequence.
Now the primers will not fall within this selected sequence but will flank the ends of this selected sequence so that this selected sequence is amplified in PCR, which can then be sequenced to confirm the sequence of DNA in this fragment in the present template DNA.
21
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
Excluded Regions: You can exclude some regions from being used for overlapping with primers. Therefore, primer will not fall in these regions. Some regions for which sequence quality is low or have low complexity or contain repetitive elements, may be excluded in this way so that primers will not be designed to anneal tom these regions.
In this case, the sequence ‘AAAAAAAAAAAAAAAAAAAAAAAAAA’ is having low complexity and will not used for designing primes and primers will not fall on this sequence. But in the present case, now you click Clear button so that the complete sequence is the target and no region is excluded.
Since, both ‘Pick Left Primer’ and ‘Pick Right Primer’ check boxes are already checked, simply press the
‘Pick Primers’ button so that a primer pair is designed. But in case, you are interested in changing general parameters or concentration settings, then first change these settings and then press the ‘Pick Primers’
button.
Alternatively visit http://www.genscript.com/cgi-bin/tools/primer_genscript.cgi for designing primers.
Genes with related function from different organisms show high sequence similarity inferring homology.
Therefore, if a template DNA sequence from target organism is not available, then degenerate primers can be designed from homologous sequences. Genefisher 2 is an interactive web-based program for designing degenerate primers. The procedure leads to isolation of genes in a target organism using multiple alignments of similar genes from other organisms. The term "gene fishing" refers to the technique where PCR is used to isolate a postulated but unknown target sequence from a pool of DNA. Visit Genefisher 2 site at http://bibiserv.techfak.uni-bielefeld.de/genefisher2/.
22
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
Back to Concept Map 3.6. in silico PCR
We have designed following pair of primers for amplification of EF0709-EF0710 genes containing operon Forward/ Left primer : TCTTAGAAACTGAAAGGGTCTTTTT
Reverse/ Right Primer: AAAAATACGTGAATCGCAACA
Visit http://insilico.ehu.es/PCR/ to check if these designed primers will be able to amplify only the target DNA fragment and not any other stretch of DNA in the same genome of E. faecalis V583.
Select Enterococcus and click button.
23
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
Paste the primer sequences in the text boxes and select and click button. The following page will show result.
24
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
This result shows that only one fragment of 2171 bp DNA will be amplified. Therefore, the designed primers pair is correct. Click on button to reach
This ensuing page provides tools for translation to proteins in the encoding DNA and restriction digest tool for cloning using recombinant DNA technology. Click on and the following page will appear.
25
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
Select and click button
You will reach the following page
This window shows six reading frames in the target DNA for translation. Bring mouse over reading frame , the results window will display translated protein as shown next
26
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
Now, bring mouse over reading frame , the
results window will display translated protein as shown next
27
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
Alternatively, click on hyperlink to reach
Select and click on
button to retrieve the list of restriction enzymes for use in cloning this DNA fragment.
28
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
In addition, visit http://microbes.ucsc.edu/cgi-bin/hgTracks?db=enteFaec_V583 to check if the designed primers will be able to amplify only the target DNA fragment and not any other stretch of DNA in the same genome. From tools menu select in silico PCR
to reach the following page.
The ensuing result page will reveal if the designed primers will give specific and/ or non-specific amplification with PCR.
Back to Concept Map
3.7. The genetic codes for nucleic acid sequence translation
EMBOSS Transeq tool at http://www.ebi.ac.uk/Tools/st/emboss_transeq/ translates nucleic acid sequences to their corresponding peptide sequences. It can translate to the three forward and three reverse frames, and output multiple frame translations at once.
29
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
The option for choosing a frame and a translation table is available. NCBI explains the translation tables at http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes for assignment of the correct genetic code for each organism. Simply paste the DNA sequence starting with ATG (start codon) and ending with a stop codon, choose frame 1 and bacterial translation table
30
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
and to click submit button.
The results will appear in the following window.
Translate is another tool, available at ExPASy server http://web.expasy.org/translate/,for translation of a nucleotide (DNA/RNA) sequence to a protein sequence using a specific mitochondrial or nuclear translation table.
31
Biochemistry
Biostatistics and Bioinformatics Nucleic Acid Sequence Analysis
Back to Concept Map 4. Summary
In this module we learnt about:
Nucleic acid sequence analysis for computational operon construction in prokaryotic raw DNA sequence with putative ORFs
Downloading raw genomic DNA sequence using a genome browser
ORF prediction n prokaryotic DNA
Promoter prediction
Prediction of rho independent transcription termination site
Primer designing for amplification of complete operon using PCR
in silico PCR
Translation of nucleic acid to protein sequence