- Setting up
- Download Short Read Archive run file (.sra)
- Convert the .sra file to fastq.
- Assess data quality using FastQC
- Align sequences onto the reference genome.
- Obtain reference sequence from IGBQuickLoad.org
- Create bowtie indexes
- Generate alignments using tophat
- Analyze and visualize your output
- Open and view your alignment files in IGB
- Summarize alignment results
The ability to work with high-throughput sequencing data sets is an in-demand skills in bioinformatics. Much of what we've done thus far has been designed to prepare you for this week's introduction to HTS data sets.
In this assignment, you'll download an RNA-Seq data set from the Short Read Archive, align the data onto a reference genome sequence, visualize the data in Integrated Genome Browser, and summarize the read alignments.
The following instructions assumes you will do the assignment on an iPlant VM ("base" image). You could do it on another machine if you prefer, but if you run into problems, Dr. Loraine may not be able to help you.
However, unless you've requested additional disk space, you might run out space, depending on what other files you've stored there.
Before you begin, check how much hard drive space is available on your VM.
You can check your space allotment using df.
To find out how much data is stored in a given directory, use du (disk usage).
If you expect to need more disk space than what is available, you can request an EBS (elastic block storage) volume. Dr. Loraine recommends asking for 50 Gb to start.
- For information about EBS, see the Atmosphere User's Guide
To do the following exercises, you'll also need to download and install the following tools in your PATH.
How to install or unpack
Many UNIX tools are distributed as compressed files sometimes called tarballs. To unpack a tarball, use gunzip and tar.
Typically, a tool distribution will consist of a directory with compiled binaries and/or source code. Most will include a README file that describes how to compile the software.
List of tools you'll need and where to get them
- samtools - required by bowtie and tophat
- It's available via the yum installer
- If necessary, you can download and compile source code from http://samtools.sourceforge.net/
- tophat - spliced alignment tool, aligns reads onto the reference genome
- bowtie - required by tophat
- It's also available via the yum installer.
- Also see: http://bowtie-bio.sourceforge.net/index.shtml
- Needed to convert 2bit genome file to fasta format
- Available from http://hgdownload.cse.ucsc.edu/admin/exe/
- fastq-dump - for converting Short Read Archive (.sra) files to fastq
Download Short Read Archive run file (.sra)
The data set you'll work with today comes from rice. SRA organizes its data into directories that are accessible via http or ftp protocols.
In this assignment, you'll work with data from an experiment investigating gene expression in rice root tips. The SRA record describing the experiment lists ten samples.
- Read about the experiment here: http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP007395
Read files (in .sra format) are available from the NCBI ftp site.
- Go to ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/
- Select the folder named Study and click through to find the folder for SRA study SRP007395
Download the .sra file for sample SRX081716, run SRR306316.
Convert the .sra file to fastq.
Use fastq-dump to convert the .sra files to FASTQ format.
This command will create a new file SRR306316.fastq.
QUESTION How many sequences are in the file? Are they all the same length? If yes, how long are they?
Assess data quality using FastQC
FastQC is a popular tool for visualizing and assessing the quality of sequencing data sets.
Watch this video to learn how to use FastQC: http://www.youtube.com/watch?v=bz93ReOv87Y
Use FastQC to assess SRR306316.fastq. Save the results of your analysis onto your Web site.
Align sequences onto the reference genome.
Obtain reference sequence from IGBQuickLoad.org
Previously, you used svn to check out the IGBQuickLoad rice (O_sativa_japonica_Oct_2011) repository. The checked out directory contains a compressed copy of the rice genome sequence: O_sativa_japonica_Oct_2011.2bit.
Convert reference sequence from 2bit to fasta format
Twobit (.2bit) is a format developed by the UCSC Genome Bioinformatics programmer Jim Kent, who also developed blat. However, the alignment tools you'll use don't work with 2bit. You'll have to convert the file to fasta before you can proceed. For this, use twoBitToFa.
Create bowtie indexes
Before you can run the alignment program tophat, use bowtie-build to create several bowtie index files for the reference sequence. The bowtie software uses this to speed up the alignment process. For more information, see the bowtie user's guide.
Bowtie-build may take several minutes to complete. If you are running it on your VM, first launch screen so that if you lose your connection, your bowtie-build command can finish.
Generate alignments using tophat
Now you have all the pieces in place to launch the alignment program tophat. Because it will likely take more than a few minutes to run, use screen to create a new shell that won't die if you lose your connection, or use screen -r to recover a detached screen you're not already using for something else.
Tophat options explained
- -I 5000 specifies a maximum intron size of 5000
- -i 50 specifies a minimum intron size of 50
- -o SRR306316 tells tophat to save its output into directory SRR306316
The final two arguments specify the bowtie index and the fastq file to align.
Analyze and visualize your output
TopHat produces several output files containing alignments as well as junctions predicted from the alignments. We will only work with two of these:
- accepted_hits.bam - contains alignments in binary sam format
- junctions.bed - contains junction features merged from spliced alignments
Rename accepted_hits.bam files
Every run of tophat produces files with the same names. Change into the tophat output directory and rename the files so that it is clear where they came from.
- Rename accepted_hits.bam to SRR306316.bam.
- Rename junctions.bed to SRR306316.bed.
- Remove the first line (the "track" line) of SRR306316.bed by using grep -v and the UNIX redirection operator.
Create BAM index files
To work with the alignments files, you'll use samtools, which allows region-based querying of BAM format files. To create an index, run samtools index.
This command will create a second file SRR306316.bam.bai which contains an index for SRR306316.bam which can be read by samtools.
Open and view your alignment files in IGB
Put the files in a Web-accessible directory on your VM. Create a Web page that contains a link to your BAM file. Do the same for your bed file.
To view the files and compare them to the rice gene models, launch Integrated Genome Browser on your local computer.
- Go to bioviz.org
- Click Download
If you are using a computer that does not have a lot of free memory, download the small memory (1 Gb) option.
When IGB opens, go to the rice genome (October 2011 version) and wait for the rice gene models to load from the IGBQuickload server.
Open your BAM file by click-dragging the link from your new Web page into the IGB interface. Or use the Open URL option under the File menu.
Do the same for the bed file.
Find several genes that have multiple transcript variants and then load the reads for that gene.
Can you use the RNA-Seq data to determine which splice variant is the most abundant?
Why or why not?
Use the IGB Export image command to create a snapshot (png format) of an example gene and add your image to the HTML page using the HTML img tag.
Summarize alignment results
Add a table (use the table tag) to your Web page that summarizes the results. It should list:
- The number of reads in the fastq file
For this, you can use simple UNIX commands you already know.
- The number of alignments tophat made.
Use the samtools view command to count the number of alignments. See samtools documentation for how to do this. Note: Next week's lecture will cover samtool commands.
- The number of reads that aligned to exactly one location in the genome
- Number of genes you examined in IGB and summary of what you found.
- Summary of read alignment results