Skip to end of metadata
Go to start of metadata

Tutorial: Making BAM Files for IGB

Visualizing results from an RNA-Seq experiment in IGB is easy to do, but before you can view the results of your experiment, you have to first convert them to one of the many formats IGB understands. You also have to align the sequences onto your reference genome.

To help you get started, we provide the following tutorial walking you through the process of creating alignments files using BowTie and TopHat, converting their output to BAM/BAI format, and then viewing the data in IGB.

What you'll need to get started:

Resource URL Description
BowTie http://bowtie-bio.sourceforge.net/index.shtml
Short read aligner
TopHat http://tophat.cbcb.umd.edu/
Spliced aligner, which uses output from BowTie
Samtools http://samtools.sourceforge.net Command line tool for creating and manipulating BAM files
Reference Genome http://www.bioviz.org/quickload/A_thaliana_Jun_2009/ nextgen_samples/A_thaliana_Jun_2009.fa.gz
Arabidopsis TAIR9 genome release
Genome
Descriptor
http://www.bioviz.org/quickload/A_thaliana_Jun_2009/
mod_chromInfo.txt
List of reference genome chromosomes and their sizes
Some Sequence
Data
http://www.bioviz.org/quickload/A_thaliana_Jun_2009/ nextgen_samples/s_1_sequence.txt
Illumina sequence data from seedlings; 75 bp read length

Note: The examples presented in this tutorial were executed on a computer running a Unix operating system. However, if you are running these programs under a different type of computer, they will work essentially the same way.

Step One: Align your sequences! Example: BowTie (version 0.12.5) and TopHat (version 1.0.14)

To start, suppose you have a giant FASTQ format file from an Illumina sequencer. If you don't have one handy and want to try it out with a data file from our lab, you are welcome to use a sample file we created and published in our QuickLoad directory of next-generation sequencing samples. The sample file contains the first few hundred thousand lines from a much larger file and contains sequence reads from an Illumina library prepared from Arabidopsis seedlings.

To generate spliced alignments, we will demonstrate how to run BowTie and TopHat from the Salzberg lab, although there are other options as well.

Note to Developers: If you write spliced alignment tools, please let us know.  We will include links to your software here.

Index file. To use BowTie, you'll first need to make a genome index file using the reference genome you want to align the sequences against.  For this, you'll need a single giant "fasta" format file containing your chromosome or contig sequences.  You can get these from GenBank, if the genome is published, or from whomever is building your assembly if your genome project is not yet completed.  We provide this file for Arabidopsis researchers on our QuickLoad site as a convenience to help you get started - see http://www.igbquickload.org/quickload/A_thaliana_Jun_2009/nextgen_samples/A_thaliana_Jun_2009.fa.gz.

Note: If you're working with IGB (or any other genome visualization tool!), we recommend that you create a reference genome sequence file that uses the same chromosome (or contig) names that appear in IGB. 

For example, suppose you are aligning Populus trichocarpa (popular) sequences onto the P_trichocarpa_Jun_2004 version of the poplar genome.

In IGB, poplar chromosome names are LG_I, LG_II, LG_III, etc.  You need to make sure that your genome fasta file contains these same names, as the very first sequence of letters following the ">" (greater-than) symbol heading up each individual sequence record.

For example:

>LG_1 some stuff can go here

atggtacttactnntgag...

>LG_II some other stuff can go here

atggtacttactnntgag...

and so on...

The key is to make sure the first sequence of non-white-space letters after the ">" symbol is the name of the chromosome you want to appear in IGB or other programs. If you would like us to make you a fasta file for use with your genome and don't mind it being publicly available, feel free to contact us for help.

Once you have your genome fasta file ready, you can create a BowTie index like so:

bowtie-build A_thaliana_Jun_2009.fa A_thaliana_Jun_2009

(See BowTie documentation for details.)

The bowtie-build indexer will then create several files with prefix A_thaliana_Jun_2009 in the current working directory. When you run bowtie to make alignments, you'll give it the location of these files, which bowtie will use to align your reads onto the reference genome.

Run TopHat.

TopHat will run BowTie and then use the output to generate spliced alignments for reads that BowTie could not align.

To run TopHat effectively, you should give it some notion of the intron sizes you expect. For example, the largest annotated intron in Arabidopsis comes from gene model AT5G32690 and is 57,631 bases. However, the vast majority of introns (99.5% of them) are smaller than 1200 bases. TopHat uses a default maximum intron size that is much, much larger.

In general, when using a spliced alignment tool, it helps to limit the search space. If you can do that, you can reduce the possibility of getting ridiculous alignments that could not possibly have come from a single spliced transcript. For example, consider a simple short read that looks like this:

atctcagttcagttatt

Imagine your spliced alignment tool finds a genomic location that matches the first ten bases:

atctcagttc

Next, your alignment tool attempts to find a segment of genomic sequence downstream where the rest of the read (agttatt) might align. If you restrict the amount of downstream sequence it can search, based on your expert knowledge of what introns in your species should look like, then you can avoid getting ridiculous alignments where the putative intron is so huge that you would not believe it was real. The larger introns you allow, the more likely it is your alignment tool will find something downstream that matches the rest of the read.

Warning: Many alignment tools were developed using human data, which means they have rarely been thoroughly tested with data from other species. So, if you are working with a plant, fruit fly, nematode, or any other genome with smaller introns (which is most of them), you should be very careful about how you use alignments generated by spliced alignment tools developed by groups who mostly care about human data. Be sure to check them against a set of well-curated gene models and look at them in IGB. If there is something fishy with your alignments, you will notice it almost instantly, because you will see what appear to be massive introns spanning many genes.

When we run TopHat with Arabidopsis data, we usually run it like so:

tophat [fastqfile] --solexa1.3-quals -p 8 -F 0 -o [output_directory] -I 2000

which tells TopHat:

  • that the sequence file (fastqfile) is using the latest Illumina quality scoring scheme (--solexa1.3-quals)
  • to run 8 threads or processes simultaneously (-p 8)
  • to disregard minimum isoform fraction (-F 0) (see TopHat manual for details)
  • to save the output in output_directroy (-o)
  • and don't allow introns larger than 2,000 bases.

Based on looking at lots and lots of genes and RNASeq data, we find that this cutoff works pretty well, for our purposes.

When TopHat runs, it creates several files, including:

  • accepted_hits.sam - the read alignments, including the spliced alignments, in text format (a HUGE file, typically!)
  • coverage.wig - a wiggle format file showing coverage of reads across the genome
  • junctions.bed - all the junctions it predicted, with numbers of reads supporting each junction in the score field (very useful!)
  • TAIR9_GFF3_genes-fixed.juncs - the junctions it found in the GFF file (the name will depend on the GFF file you gave it, if any)

You can open the coverage.wig and junctions.bed files directly in IGB, but to view the alignments (accepted_hits.sam) you need to first convert it to a sorted, indexed BAM file using samtools. To do this, read on:

Step Two: Sort and index your SAM file - to make a BAM file and its index, a BAI file

For this, you'll use the samtools utilities. (See above for download information.)

You'll also need your genome descriptor file - the mod_chromInfo.txt file listed above.

First, convert accepted_hits.sam to a BAM file, like so:

samtools import mod_chromInfo.txt accepted_hits.sam accepted_hits_not_sorted.bam

Second, sort it:

samtools sort accepted_hits_not_sorted.bam accepted_hits

which will create a new BAM file called accepted_hits.bam.

You can now delete accepted_hits_not_sorted.bam, which is not particularly useful.

Third, make an index:

samtools index accepted_hits.bam

Samtools will now index the file, creating a new index file named accepted_hits.bam.bai.

Remember: The index and the BAM file always need to stay together! The "bai" file is useless without the "bam" file, and IGB can't load data from the BAM file without the index.

Tip: Once you've created your BAM and BAI files, delete your enormous SAM file. You don't need it anymore, and if you need to recreate it, you can just re-run TopHat using the same parameter settings.

Tip: If you are working with a lot of junction.bed files, edit the track line that appears at the top of the file. TopHat always writes the same thing at the top of the file, no matter what the inputs, and IGB uses the track line to figure out what track to load the data into. To make sure different junction.bed data files appear in different tracks, edit the track line.

Step Three: View the data in IGB

To view the files in IGB, launch IGB and choose the appropriate species and genome. 

Then, either click-drag the files from your desktop into IGB or use File>Open.

You should then see the file names appearing in your Load Mode table in the Data Access Panel. Next, tell IGB how you want to load the data.

For the "BAM" files, your options will include "Don't Load" (the default) and "Region in View". For the other files (coverage.wig and junctions.bed) the options will also include "Whole Genome" and "Whole Chromosome."

If you choose "Whole Genome," then IGB will read the file right away. (This is because if you've picked "Whole Genome," IGB assumes you want to import all the data for the entire genome at once.)

If you choose the other options, then IGB will wait for you to click the Refresh Data button before loading any data.

Note: If you are working with BAM files, it's a good idea to zoom in on a region before clicking Refresh Data. This will avoid overwhelming IGB, since BAM files are typically very large and contain millions of alignments!

To learn how to zoom and scroll through the data, load and visualize short read alignments, etc., take a look at the IGB videos posted in the User's Guide and on the Bioviz Web site:

http://www.bioviz.org/igb

Step Four: Share the Files

If you'd like to make your new data available for other people to view, the easiest option is to place on a publicly accessible server, such is in a public_html directory in your user account on a system running Unix and a Web server like Apache.

For example, if your user name is mstevens, then you would log onto the Unix machine, create a directory called public_html in your home directory, and then place the files in that directory. After making the directory and the files it contains world-readable (chmod -R a+r public_html), you can then open the files in IGB by choose File>Open URL.. and typing in the URL http://example.com/~mstevens/accepted_hits.bam or something similar. If you allow visitors to view a list of all the files the directory, then they can also click and drag the files directly into the IGB interface.

If your server is using the Apache Web server, you can set up some simple .htaccess files in the directory that will allow visitors to view a list of all the files in the directory, with some description:

Options +Indexes
IndexOptions FancyIndexing IgnoreCase FoldersFirst DescriptionWidth=*
AddDescription "Index for binary alignments file" *.bam.bai
AddDescription "Binary alignments file" *.bam

You can also use this file to add password protection. Consult the Apache project .htaccess tutorial for more information.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.