Skip to end of metadata
Go to start of metadata


The ability to work with high-throughput sequencing data sets is an in-demand skills in bioinformatics. Much of what we've done thus far has been designed to prepare you for this week's introduction to HTS data sets.

In this assignment, you'll download an RNA-Seq data set from the Short Read Archive, align the data onto a reference genome sequence,  visualize the data in Integrated Genome Browser, and summarize the read alignments.

Setting up

The following instructions assumes you will do the assignment on an iPlant VM ("base" image). You could do it on another machine if you prefer, but if you run into problems, Dr. Loraine may not be able to help you.

However, unless you've requested additional disk space, you might run out space, depending on what other files you've stored there.

Before you begin, check how much hard drive space is available on your VM.

You can check your space allotment using df.

To find out how much data is stored in a given directory, use du (disk usage).

If you expect to need more disk space than what is available, you can request an EBS (elastic block storage) volume. Dr. Loraine recommends asking for 50 Gb to start.

Required tools

To do the following exercises, you'll also need to download and install the following tools in your PATH.

How to install or unpack



It may be possible to install some tools using yum install

Many UNIX tools are distributed as compressed files sometimes called tarballs. To unpack a tarball, use gunzip and tar.

Typically, a tool distribution will consist of a directory with compiled binaries and/or source code. Most will include a README file that describes how to compile the software.

List of tools you'll need and where to get them

Use wget to download


In your browser, right-click a download link and select Copy Link Location. Click in your VM terminal, type wget and then paste the URL.


Download Short Read Archive run file (.sra)

The data set you'll work with today comes from rice. SRA organizes its data into directories that are accessible via http or ftp protocols.

In this assignment, you'll work with data from an experiment investigating gene expression in rice root tips. The SRA record describing the experiment lists ten samples.

Read files (in .sra format) are available from the NCBI ftp site.

Download the .sra file for sample SRX081716, run SRR306316.


Use wget to download from URLs onto your iPlant VM.

Convert the .sra file to fastq.

Use fastq-dump to convert the .sra files to FASTQ format.

This command will create a new file SRR306316.fastq.

QUESTION How many sequences are in the file? Are they all the same length? If yes, how long are they?

Save space


Compress the file using gzip to save disk space on your VM. TopHat can read compressed files. Also, delete the .sra file when you're finished with it.

Assess data quality using FastQC

FastQC is a popular tool for visualizing and assessing the quality of sequencing data sets.

Watch this video to learn how to use FastQC:

Use FastQC to assess SRR306316.fastq. Save the results of your analysis onto your Web site.

Align sequences onto the reference genome.

Obtain reference sequence from

Previously, you used svn to check out the IGBQuickLoad rice (O_sativa_japonica_Oct_2011) repository. The checked out directory contains a compressed copy of the rice genome sequence: O_sativa_japonica_Oct_2011.2bit.

Convert reference sequence from 2bit to fasta format

Twobit (.2bit) is a format developed by the UCSC Genome Bioinformatics programmer Jim Kent, who also developed blat. However, the alignment tools you'll use don't work with 2bit. You'll have to convert the file to fasta before you can proceed. For this, use twoBitToFa.

Create bowtie indexes

Before you can run the alignment program tophat, use bowtie-build to create several bowtie index files for the reference sequence. The bowtie software uses this to speed up the alignment process. For more information, see the bowtie user's guide.

Bowtie-build may take several minutes to complete. If you are running it on your VM, first launch screen so that if you lose your connection, your bowtie-build command can finish.

Generate alignments using tophat

Now you have all the pieces in place to launch the alignment program tophat. Because it will likely take more than a few minutes to run, use screen to create a new shell that won't die if you lose your connection, or use screen -r to recover a detached screen you're not already using for something else.

Tophat options explained

  • -I 5000 specifies a maximum intron size of 5000
  • -i 50 specifies a minimum intron size of 50
  • -o SRR306316 tells tophat to save its output into directory SRR306316

The final two arguments specify the bowtie index and the fastq file to align.

Analyze and visualize your output

TopHat produces several output files containing alignments as well as junctions predicted from the alignments. We will only work with two of these:

  • accepted_hits.bam - contains alignments in binary sam format
  • junctions.bed - contains junction features merged from spliced alignments

Rename accepted_hits.bam files

Every run of tophat produces files with the same names. Change into the tophat output directory and rename the files so that it is clear where they came from.

  • Rename accepted_hits.bam to SRR306316.bam.
  • Rename junctions.bed to SRR306316.bed.
  • Remove the first line (the "track" line) of SRR306316.bed by using grep -v and the UNIX redirection operator.

Create BAM index files

To work with the alignments files, you'll use samtools, which allows region-based querying of BAM format files. To create an index, run samtools index.

This command will create a second file SRR306316.bam.bai which contains an index for SRR306316.bam which can be read by samtools.


Always store a BAM file's index in the same directory as the BAM file

Open and view your alignment files in IGB

Put the files in a Web-accessible directory on your VM. Create a Web page that contains a link to your BAM file. Do the same for your bed file.

To view the files and compare them to the rice gene models, launch Integrated Genome Browser on your local computer.

  • Go to
  • Click Download

If you are using a computer that does not have a lot of free memory, download the small memory (1 Gb) option.


If you have problems with the download, perhaps because you are running on a shared computer that won't allow you to install software, you can also run IGB by downloading the compiled source from SourceForge. See the TroubleShooting section of the IGB User's Guide for details.

When IGB opens, go to the rice genome (October 2011 version) and wait for the rice gene models to load from the IGBQuickload server.

Open your BAM file by click-dragging the link from your new Web page into the IGB interface. Or use the Open URL option under the File menu.

Do the same for the bed file.

Find several genes that have multiple transcript variants and then load the reads for that gene.

Can you use the RNA-Seq data to determine which splice variant is the most abundant? 

Why or why not?

Use the IGB Export image command to create a snapshot (png format) of an example gene and add your image to the HTML page using the HTML img tag.


You can also download your files onto your local computer and use File > Open to open them in IGB. However, to share them with Dr. Loraine and your class-mates, you'll need to store them on a server accessible via internet.

Summarize alignment results

Add a table (use the table tag) to your Web page that summarizes the results. It should list:

  • The number of reads in the fastq file

For this, you can use simple UNIX commands you already know.

  • The number of alignments tophat made.

Use the samtools view command to count the number of alignments. See samtools documentation for how to do this. Note: Next week's lecture will cover samtool commands.

  • The number of reads that aligned to exactly one location in the genome

Recognizing uniquely mapping reads


The SAM format NH tag indicates the number of times a given read aligns onto the genome.


  • Number of genes you examined in IGB and summary of what you found.
  • Summary of read alignment results


  • No labels