Get the data
Update your copy of the class repo. The file you'll work with for this assignment is called counts.txt.gz and resides in the directory class/data.
Open a terminal (shell) within the class/data directory and start R.
Install edgeR using biocLite, just as you did last week when you installed limma.
Start R, load the biocLite.R script, and run biocLite to install edgeR.
Load the data set into R
Use read.delim to read the data into a data frame object in R.
Inspect the data
Each row corresponds to an Arabidopsis genes. The first column is the AGI code, the second column is the gene symbol, and the third column is text describing the gene.
The next columns represent control (C1, C2) and treatment (T1, T2, T3, T4) sample libraries.
The numbers represent numbers of reads that overlap each gene.
Use the colSums command to add up the number of mapped reads per library.
QUESTION Which library has the largest number of reads? Which library has the least number of reads? What is the ratio of largest to smallest?
Use rowSums and table to find out how many genes have small numbers of reads:
QUESTION How many genes have zero counts? How many genes have only 1 count? What about 2 counts? What trend do you observe?
Remove genes with zero counts
If a gene has no counts, it's pointless to test it for differential gene expression.
So remove these using rowSums and subsetting.
QUESTION How many genes are left?
Create a DGEList
edgeR implements a class of object called a DGEList, which we'll use to build our differential gene expression analysis.
Explore the DGEList
The DGEList object contains three instance "variables" representing the data you passed to the function DGEList.
Take a moment to understand what these are.
Print the first few rows of cds$counts, which is the expression matrix with gene ids as row names.
List the samples associated with the expression matrix:
Note that cds$samples has three columns, including lib.size (the number of counts per library) and norm.factors, which are 1 for each library.
The All.Zeros list reports genes with zero counts:
In the next steps, you'll use edgeR functions to add new information to cds.
Add normalization factors for libraries
Use the calcNormFactors function to fill in the norm.factors column of cds$samples
The calcNormFactors calculates normalization factors that will allow comparison of across libraries even though they were sequenced to different depths.
The function will replace the ones in cds$samples$norm.factors with new values. It will also return the cds object.
Most edgeR functions behave this way.
QUESTION What is the relationship between library size and normalization factor?
Ultimately, when you report normalized counts per million per gene, you'll use the cpm function, which will use the normalization factors to calculate the effective library size, divide that number into the counts per gene, and multiple by one million.
To see the effective library size per library, do this:
As you saw from looking at the table of counts, a large number of genes have zero counts, a smaller number have two, a smaller number of 3, and so on.
To model the distribution of counts data, edgeR uses a negative binomial distribution.
After estimated a dispersion parameter, it applies a negative binomial test.
To start, estimate a "common" dispersion parameter for all genes:
To perform statistical testing on the means of two groups (treatment and control), we need to estimate variance within and across groups. This is where the dispersion paramter comes in.
The variance of the negative binomial model is a function of the mean. That is, variance(mean) = mean + mean**2 * D, where D is the dispersion.
For example, suppose a gene has a normalized, mean count of 10. Under the negative binomial, the variance is 10 + 100 * cds$dispersion.
Next, we estimate a gene-by-gene dispersion rate, which will (we hope) improve the estimate of variance and give subsequent statistical tests greater power.
To assess how this has worked, use plotMeanVar to view the relationship between means (x axis) and variance (y axis).
The raw variances of the counts are the gray dots, the variances estimated using the tagwise dispersions are the blue dots, the variances estimated using the common dispersions are the solid blue line, and the variance = mean (poisson variance) is the solid black line.
QUESTION What is the relationship between mean and variance in the plot? As mean increases, what happens to variance? Which dispersion estimates do a better job of predicting variance? Are they about the same, or not?
Find differentially expressed genes
The next step is to identify differentially expressed genes using the exactTest function in edgeR.
You can use either the common or gene-wise dispersion estimates. For comparison, try both:
The objects de.cmn and de.tgw are both instances of class DGEExact, which contains three named fields: table, which gives the log-fold-change, log counts per million, and p value from testing each gene.
Apply a multiple hypothesis testing correction to p values from each DGEExact object.
QUESTION How many genes are called as differentially expressed at alpha level 0.05 by each or both methods?
Save results to a file
Open the file in Excel. What do you observe about the differentially expressed genes?
QUESTION Make an educated guess. What was the treatment?