1 Course : PG Pathshala-Biophysics
Paper 14 : Bioinformatics Module 06 : Gene Analysis
Content Writer : Dr. Subhradip Karmakar, AIIMS New Delhi
Gene expression analysis is most simply described as the study of the way genes are transcribed to synthesize functional gene products — functional RNA species or protein products. The study of gene regulation provides insights into normal cellular processes, such as differentiation, and abnormal or pathological processes. Transcriptome analysis experiments can characterize all transcriptional activity or profile thousands of genes at once to create a global picture of cell function and possible phenotype. Gene expression analysis studies can provide a quantitative snapshot of actively of expressed genes and transcripts under various conditions. Further a detailed transcript analysis can provide the lots of information regarding multiple transcript variants, alternate promoter usage, exon switching and non-coding RNA .
When and Why should we study gene expression ?
Researchers may perform gene expression analysis at any one of several different levels at which gene expression is regulated: transcriptional, post-transcriptional, translational, and post-translational protein modification. Transcription, the process of creating a complementary RNA copy of a DNA sequence, can be regulated in a variety of ways.
Transcriptional regulation processes are the most commonly studied and manipulated in typical gene expression analysis experiments. The binding of regulatory proteins to DNA binding sites is the most direct method by which transcription is naturally modulated.
Alternatively, regulatory processes can also interact with the transcriptional machinery of a cell. More recently, the influence of epigenetic regulation, such as the effect of variable DNA methylation on gene expression, has been uncovered as a powerful tool for gene expression profiling. Varying degrees of methylation are known to affect chromatin folding and strongly affect accessibility of genes to active transcription.
Following transcription, eukaryotic RNA is typically spliced to remove noncoding intron sequences and capped with a poly(A) tail. At this post-transcriptional level, RNA stability has a significant effect on functional gene expression, that is, the production of functional protein.
Small interfering RNA (siRNA) consists of double-stranded nucleic acid molecules that are participants in the RNA interference pathway, in which the expression of specific genes is modulated (typically by decreasing activity). Precisely how this modulation is accomplished is not yet fully understood. A growing field of gene expression analysis is in the area of microRNAs (miRNAs), short RNA molecules that also act as eukaryotic post-transcriptional regulators and gene silencing agents
What is gene expression profiling :
Researchers studying gene expression employ a wide variety of molecular biology techniques and experimental methods. Gene expression analysis studies can be broadly divided into four areas: RNA expression, promoter analysis, protein expression, and post-translational modification.
2 RNA Expression
Northern blotting — steady-state levels of mRNA are directly quantitated by electrophoresis and transfer to a membrane followed by incubation with specific probes. The RNA-probe complexes can be detected using a variety of different chemistries or radionuclide labeling. This relatively laborious technique was the first tool used to measure RNA levels
DNA microarrays — an array of oligonucleotide probes bound to a chip surface enables gene expression profiling of many genes in response to a condition. Labeled cDNA from a sample is hybridized to complementary probe sequences on the chip, and strongly associated complexes are identified optically. Gene expression profiling is often a first step in a gene expression analysis workflow, investigating changes in the expression profile of a whole system or examining the effects of mutations in biological systems
Real-Time PCR — steady-state levels of mRNA are quantitated by reverse transcription of the RNA to cDNA followed by quantitative PCR (qPCR) on the cDNA. The amount of each specific target is determined by measuring the increase in fluorescence signal from DNA-binding dyes or probes during successive rounds of enzyme-mediated amplification. This precise, versatile tool is used to investigate mutations (including insertions, deletions, and single-nucleotide polymorphisms (SNPs)), identify DNA modifications (such as methylation), confirm results from northern blotting or microarrays, and conduct gene expression profiling. Expression levels can be measured relative to other genes (relative quantification) or against a standard (absolute quantification). Real-time PCR is the gold standard in nucleic acid quantification because of its accuracy and sensitivity. Real-time PCR can be used to quantitate mRNA or miRNA expression following conversion to cDNA or to quantitate genomic DNA directly to investigate transcriptional activity
In this module we focus of DNA microarray as method to study gene expression:
This module will discuss in details the methods to study gene expression, with special focus on microarray based gene expression profiling and analysis.
1. DNA Microarray
2. Microarray data pre-processing
3. Microarray data processing
5. Perfect match vs mismatch probes
6. Other Normalization
7. MA plots and data expression
8. Advantages and Disadvantages
3 1. DNA Microarray
DNA Micro Array Data Analysis
The analysis of microarray data to produce lists of differentially expressed genes has several steps which can differ based on the type of data being assayed. However, all data follows the same general pipeline which involves reading raw data,
quality assessing the data, removing bad spots/arrays from further analysis, preprocessing the data and calculating differential expression by statistical analysis.
This list of differentially expressed genes can subsequently be annotated with useful information that explains the various genes’ function, for example, gene ontology. I will now explain in more detail how this data analysis pipeline is followed for the types of data supported by this system.
DNA microarray technology is a powerful means for exploring genomes of organisms. It is an important tool for monitoring and analysing gene expression profiles of thousands of genes simultaneously. Their small size, high densities and compatibility with fluorescent labelling make microarray technology a widely used technique in the area of molecular genetics. Microarray technology provides an economic, robust, fully automated approach toward examining and measuring temporal and spatial global gene expression changes
2. Microarray Data Preprocessing
Before any kind of microarray data can be analysed for differential expression several steps must be taken. Raw data must be quality assessed to ensure its integrity. Unprocessed raw data will always be subject to some form of technical variation and thus must be preprocessed to remove as many unwanted sources of variation as is possible, to ensure that results are of the highest attainable level of accuracy. Ideally, the data being assayed should be preprocessed using several different methods, the results of which should be compared to identify which method is of the highest level of suitability. The most appropriate method should then be used to preprocess the raw data before differential expression analysis.
3. Microarray data processing
Because of the design of these kinds of chips, the steps that need to be taken before differential expression analysis are slightly more elaborate than for cDNA arrays, which we will outline later in the chapter.
3.1 Background Correction
The first step is generally to background correct the intensity reading for each spot. Background fluorescence can arise from many sources, such as non-specific binding of labelled sample to the array surface, processing effects such as deposits left after the wash stage or optical noise from the scanner . There is always
some level of background noise, even if nothing but sterile water is labeled and hybridised to the array, some fluorescence will still be picked up by the scanner Different algorithms will use different methods of background correction. The
4 popular Robust Multi-Array Analysis (RMA) algorithm, for example, uses the
convolution of signal and noise distributions .
The next stage is normalization. The purpose of this step is to adjust data for technical variation, as opposed to biological differences between the samples. There
will always be slight discrepancy between the hybridisation processes for each array and these variations tend to lead to scaling differences between the overall
fluorescence intensity levels of various arrays. For example the quantity of RNA in a sample, the amount of time for which a sample spends hybridising or the volume of a sample can all introduce significant variance. Even subtle physical differences between arrays or between the scanners used to read arrays can have an effect.Put simply, normalization ensures that when comparing expression levels of different arrays, that we are, as much as is possible, comparing like with like.
Studies have shown that the normalization method used has a significant difference on final differential expression levels, so it is vital to choose an appropriate method
5. PM Correction
As stated previously, PM probes on the GeneChip measure both the relative abundance of the corresponding gene and the amount of non-specific binding, which arises when mRNA binds to a probe which is not targeting it. MM probes are designed to give a measure for non-specific binding of their corresponding PM probe. It then seems obvious that the MM values should be subtracted from their corresponding PM values as a first step in the analysis process.
In reality however, this does not work, because generally about 30% of MMs are actually larger than their corresponding PMs. This is because, as well as measuring background signal, high volumes of mRNA targeted intentionally by the PM probes tend to also bind to MMs. Many of the most popular preprocessing methods solve this problem by simply ignoring the MM probes altogether and PM values are corrected for non-specific binding using other methods.
We have already seen how GeneChip arrays work by using 11 different PM spots to target 11 separate 25 base long sections of a target genes mRNA. The final step in preprocessing GeneChip Data is to summarize the data from these 11 separate probes into an expression value for the gene in question. There are a number of different ways that this can be achieved, but the end result is always a single expression value for each gene on each chip.
5.2 Preprocessing Methods Implemented for Affymetrix GeneChip Array
Having introduced the general pipeline followed to preprocess Affymetrix microarray data, we will outline some of the preprocessing methods implemented by this
system and describe their operation as well as justifying their inclusion.
There are a number of popular composite preprocessing algorithms. These algorithms implement the four preprocessing steps outlined above and output
5 background corrected and normalized expression measures for each gene on each
array. The preproessing methods implemented by this system are as follows.
5.3 Micro Array Suite 5.0 (MAS5)
MAS5 is an algorithm developed by Affymetrix and is described in their white paper “Statistical Algorithms Description Document” (2002). This algorithm background corrects both PM and MM probes; MMs are then converted into ideal mismatches, where their values are always smaller than their corresponding PM values. Remeber than approximently 30% of the time MM values are greater than their PMs. If MM < PM, then MM value is left unchanged. A robust mean over the log2 transformed differences between PMs and the already calculated ideal mismatch is computed. Expression values are normalized by setting the trimmed mean of the original signals of each chip to a prespecified value. Hence, MAS5 data is normalized after summarisation, not before, as in many other algorithms.
Probe Logarithmic Intensity Error Estimation (PLIER)
PLIER is the current recommended algorithm from Affymetrix. Affymetrix claim that the algorithm improves on MAS5 by introducing a higher reproducibility of signal (lower coefficient of variation) without loss of accuracy; higher sensitivity to changes in abundance for targets near background and dynamic weighting of 3 the most informative probes in a dataset to determine signal. In this system the PLIER algorithm is modified to include quantile normalization as PLIER does not normalize data by default.
5.4 Robust Multi-Array Analysis (RMA)
RMA is an academic alternative to Affymetrix’s algorithms for converting probe level data to gene expression measures. This method is distinct from Affymetrix’s methods in that it completely ignores the MM probe readings; the inventors of the algorithm claim that the MM probes introduce more noise and that, while acknowledging that these probes do provide useful information, have not, at the time of publication of the method, found a productive way to use it.
The methods works by adjusting for background noise on a raw intensity scale, which does not lead to negative background corrected values. The log2 transformed value of each background corrected PM probe is obtained and these values
are normalized using quantiles normalization, which was developed by Bolstad et al. (2003) . Robust multi-array analysis is then carried out on the quantiles
5.5 GeneChip RMA (GCRMA)
GCRMA is largely based on RMA and in fact only differs in the background correction step where it uses probe sequence information to help estimate the background. This leads to improved accuracy in fold changes but at the expense of marginally lower precision
Other Methods Implemented
The system can also carry out a pre-processing method by which the user can manually create the algorithm used, by specifying explicitly which of a selection of available functions, should be applied at each of the various stages, the options available to the user are as follows.
6 The above options can be combined as the user desires to tailor pre-processing
to their needs. This route is not recommended for new users.
5.6 Pre-processing of cDNA Data
The general steps followed when pre-processing cDNA data are quite similar to the above. The main differences are that their is no need for PM correction, as there are no MM probes on cDNA arrays and that their is no summarization stage, as each gene is represented by a single probe.
6 Other Normalization
6.1Normalizing Within Arrays
There are a number of reasons that this step is performed for cDNA arrays. As noted by Smyth (2003) imbalances between the red (Cy5) and green (Cy3) dyes of cDNA arrays may arise from differences between the labelling efficiencies or scanning properties of the two dyes, complicated perhaps by, for example, the use of different scanners or different settings.
If the imbalance is more complicated than a simple scaling of one channel relative to the other, as it usually is, then the dye bias is a function of intensity and normalization will need to be intensity dependent. The dye-bias will also generally vary with spatial position on the slide. Positions on a slide may differ because of differences between the print-tips on the array printer, variation over the course of the print-run, non-uniformity in the hybridisation, or from artifacts on the surface of the array which affect one colour more than the other.
6.2 Normalizing Between Arrays
Similarly to as outlined for oligonucleotide microarrays, cDNA arrays often suffer substantial scale differences because of technical variation, which could be down to any number of factors. Performing normalization between arrays will compensate for such effects and thus yield more reliable results.
1.1.4 Pre-processing Methods Implemented for cDNA Arrays
There are a large number of methods available for pre-processing of dual dye data.
7. MA-plots (Ref http://bioinfo.cipf.es/babelomicstutorial/maplot)
MA-plots are used to study dependences between the log ratio of two variables and the mean values of the same two variables. The log ratios of the two measurements are called M values (from “minus” in the log scale) and are represented in the vertical axis. The mean values of the two measurements are called A values (from “average” in the log scale) and are represented in the horizontal axis.
7 In microarray data contexts an MA-plot is used to compare two channels of intensity measurements. These two channels can be the red and green channels of one single chip of a two-color platform or the intensity measurements of two different arrays when using a single- channel platform. In a single-channel context at least two arrays are needed to draw a meaningful MA-plot.
The National Center for Biotechnology Information (NCBI) launched the Gene Expression Omnibus (GEO) database in 2000 to support the public use and dissemination of gene expression data generated by high-throughput methodologies Most researchers submit to GEO in accordance with grant or journal requirements stipulating that microarray data be made available through a public repository, in compliance with long-established standards of scientific reporting that allow others to judge or reproduce the results.
8. Applications , Advantages and Disadvantages of Microarray:
A large number of microarray-related studies have be designed and aimed to either characterize diseased cells in comparison to healthy cells or highlight the genes involved in a particular biological pathway like cancer. Also differences between responders vs non responder were investigated. Microarray has always even the method to validate rather than to perform a new discovery a prior knowledge is essential . However, in recent years, the number of studies utilizing microarrays in some capacity has increased greatly. More and more studies are relying on microarrays to provide insight into observed physiology, essentially using microarrays to further characterize biological systems .In most of these cases, microarray analysis has generated interesting results, but also raised additional questions requiring further investigation, limiting its successful implementation.
Microarrays are a powerful genomics tool, designed to illuminate differences in the expression of genes within cells. Despite being a relatively new technology, the scientific community has quickly adopted its use in a variety of fields including drug development, evolutionary biology, and disease characterization . The strength of the technology rests on the several factors including: ease of use, availability of platforms and lower cost relative to other exploratory methods such as Northern blotting or Ribonuclease Protection Assay (RPA), implementation of statistical methods for detailed analysis, and most importantly a global view of a gene expression encompassing an entire genome.
Advantages offered by the micro array include speed, specificity and reproducibility. Speed, in terms of generating the array is prime advantage because, spotting the DNA onto the chip requires only that the DNA sequence of interest be known, therefore no time is spent in the handling of cDNA resources such as the preparation and accurate determination of handling bacterial clones, PCR products, or cDNAs, thus reducing the likelihood of contamination and mix up. However, before manufacturing the array, prior knowledge of the genome sequence is required to design the oligonucleotide sets, and when this is not available, alternative methods of printing isolated genetic material may be preferred. Other advantages are high specificity and reproducibility. Both of these attributes are due to the way oligonucleotide sequences to be printed on the chip are designed and the use of multiple, short sequence(s) representing the unique sequence of genes. For example, when designing oligonucleotide sequences for a gene, each sequence is designed to be perfectly complementary to a target gene sequence, at the same time an additional partner sequence is designed that is identical except for a single base mismatch in its centre. This sequence mismatch strategy, along with the use of multiple sequence(s) for each gene increases specificity and helps to identify and minimise the effects of non-specific hybridisation and background signal.
Microarrays only present a frozen snapshot of the transcriptome which is rather dynamic and continually evolving and responding to cellular needs and signals. As such, microarrays only illuminate a part of what is going on inside a cell or a population of cells . In addition, there is rather a poor relationship between RNA and protein and its not mandatory to have to be a tight correlation between the expression of a gene and the amount of translated protein.
Therefore, differentially expressed genes may not translate into varying protein levels with functional implications. Furthermore, the complexity of microarray analysis makes it exceedingly difficult to ascertain meaningful data with real biological significance without clearly defined goals or targets. An intricate aspect of genomic analysis is the interplay between genes or groups of genes (i.e. mechanisms) and that information is not easily deciphered using microarrays. And finally, the functionality of a gene cannot be determined solely using microarrays . Indeed, other methods and experimental tools are needed to decipher the proteome, understand the varying interactions between genes and/or proteins, and develop a more complete picture of cellular behaviour. Microarray experiments are subjected to saturation and have low dynamic range. Microarray experiments can be confused by polymorphism and are subjected to variation due to differential dye binding even though gene expression changes are not there.
9 9. Summary
In this module you have learned that how complex structures help us to understand ligand or substrate binding to the protein receptors, enzymes or transporters. With the examples used for drug receptor complex or enzyme complex you have learned how drug can designed to bind to specific cavity or binding site. The complexes also tell us that there is always need to improve drug design to make more specific to particular target. The drug sertraline antidepressant is not only specific for serotonin receptor but also for other neurotransmitter receptor, but you have learned that halogen group is important for binding to the receptor. Keeping that in mind we can always design more specific drug for each receptor. But the lesson learnt here is that biochemically we can find drug interacting with target molecule but actual atomic detail will help us understand to improve specificity and avoid promiscuous drug.