# Get the data

Update your copy of the class repo. The file you'll work with for this assignment is called counts.txt.gz and resides in the directory class/data.

# Getting Started

Open a terminal (shell) within the class/data directory and start R.

Install edgeR using biocLite, just as you did last week when you installed limma.

Start R, load the biocLite.R script, and run biocLite to install edgeR.

# Load the data set into R

Use read.delim to read the data into a data frame object in R.

# Inspect the data

Each row corresponds to an Arabidopsis genes. The first column is the AGI code, the second column is the gene symbol, and the third column is text describing the gene.

The next columns represent control (C1, C2) and treatment (T1, T2, T3, T4) sample libraries.

The numbers represent numbers of reads that overlap each gene.

Use the colSums command to add up the number of mapped reads per library.

**QUESTION** Which library has the largest number of reads? Which library has the least number of reads? What is the ratio of largest to smallest?

Use rowSums and table to find out how many genes have small numbers of reads:

**QUESTION** How many genes have zero counts? How many genes have only 1 count? What about 2 counts? What trend do you observe?

# Remove genes with zero counts

If a gene has no counts, it's pointless to test it for differential gene expression.

So remove these using rowSums and subsetting.

**QUESTION** How many genes are left?

# Create a DGEList

edgeR implements a class of object called a DGEList, which we'll use to build our differential gene expression analysis.

# Explore the DGEList

The DGEList object contains three instance "variables" representing the data you passed to the function DGEList.

Take a moment to understand what these are.

Print the first few rows of cds$counts, which is the expression matrix with gene ids as row names.

List the samples associated with the expression matrix:

Note that cds$samples has three columns, including lib.size (the number of counts per library) and norm.factors, which are 1 for each library.

The All.Zeros list reports genes with zero counts:

In the next steps, you'll use edgeR functions to add new information to cds.

# Add normalization factors for libraries

Use the calcNormFactors function to fill in the norm.factors column of cds$samples

The calcNormFactors calculates normalization factors that will allow comparison of across libraries even though they were sequenced to different depths.

The function will replace the ones in cds$samples$norm.factors with new values. It will also return the cds object.

Most edgeR functions behave this way.

**QUESTION** What is the relationship between library size and normalization factor?

Ultimately, when you report normalized counts per million per gene, you'll use the cpm function, which will use the normalization factors to calculate the effective library size, divide that number into the counts per gene, and multiple by one million.

To see the effective library size per library, do this:

# Estimate dispersions

As you saw from looking at the table of counts, a large number of genes have zero counts, a smaller number have two, a smaller number of 3, and so on.

To model the distribution of counts data, edgeR uses a negative binomial distribution.

After estimated a dispersion parameter, it applies a negative binomial test.

To start, estimate a "common" dispersion parameter for all genes:

To perform statistical testing on the means of two groups (treatment and control), we need to estimate variance within and across groups. This is where the dispersion paramter comes in.

The variance of the negative binomial model is a function of the mean. That is, variance(mean) = mean + mean**2 * D, where D is the dispersion.

For example, suppose a gene has a normalized, mean count of 10. Under the negative binomial, the variance is 10 + 100 * cds$dispersion.

Next, we estimate a gene-by-gene dispersion rate, which will (we hope) improve the estimate of variance and give subsequent statistical tests greater power.

To assess how this has worked, use plotMeanVar to view the relationship between means (x axis) and variance (y axis).

The raw variances of the counts are the gray dots, the variances estimated using the tagwise dispersions are the blue dots, the variances estimated using the common dispersions are the solid blue line, and the variance = mean (poisson variance) is the solid black line.

**QUESTION** What is the relationship between mean and variance in the plot? As mean increases, what happens to variance? Which dispersion estimates do a better job of predicting variance? Are they about the same, or not?

# Find differentially expressed genes

The next step is to identify differentially expressed genes using the exactTest function in edgeR.

You can use either the common or gene-wise dispersion estimates. For comparison, try both:

# Explore results

The objects de.cmn and de.tgw are both instances of class DGEExact, which contains three named fields: table, which gives the log-fold-change, log counts per million, and p value from testing each gene.

Apply a multiple hypothesis testing correction to p values from each DGEExact object.

**QUESTION** How many genes are called as differentially expressed at alpha level 0.05 by each or both methods?

# Save results to a file

Open the file in Excel. What do you observe about the differentially expressed genes?

**QUESTION** Make an educated guess. What was the treatment?