- Read and/or Watch
- Download and set up software on your VM
- Obtain genome (target) and ESTs (queries)
- Align ESTs onto your genome
- Compress and index with bgzip and tabix
- Check the files in Integrated Genome Browser
- Write a shell script run_blat.sh that automates the above process
- Turn in your work
In this assignment, you'll develop a bioinformatics pipeline for annotating a plant genome with ESTs. You'll use blat, a cDNA-to-genome alignment tool, to align a set of ESTs onto a plant genome and process the files for visualization in a genome browser. Then you'll write a UNIX shell script that automates the process.
Read and/or Watch
To start, review Annotating a genome with ESTs - an example from Arabidopsis, which describes how Dr. Loraine used blat to generate an EST alignments data set for Arabidopsis.
You'll perform a similar (but much simpler!) set of steps to annotate the rice genome with three files of ESTs you'll download from PlantGDB, a database developed by Volker Brendel's group and also by Bioinformatics and Genomics faculty member Dr. Shannon Schlueter.
Blat produces a tab-delimited output format called PSL, in which each line of data represents an alignment of a query (EST) against a target (a reference genome.) Google "PSL format" to find out more and also visit the UCSC Genome Bioinformatics FAQ page:
- UCSC Genome Bioinformatics FAQ: PSL format http://genome.ucsc.edu/FAQ/FAQformat.html#format2
Other useful background
- Tabix: fast retrieval of sequence features from generic TAB-delimited files - http://bioinformatics.oxfordjournals.org/content/27/5/718.full
Download and set up software on your VM
For this assignment, you'll need to download and set up a number of programs on your iPlant VM. These include:
UCSC tools pre-compiled binary ("executable") files available from http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/
Download the programs onto your VM and place them a directory that is in your PATH. (By now, you should know what PATH means. For a refresher, follow the link.)
Note that some of the programs above can be obtained as "executables," source code that has been compiled into machine-readable (and thus "executable") code for different computer architectures. Your VM is (probably) CentOS or some other Linux variant with 64 bit architecture. To save time, you may wish to find an executable that matches your VM. Or, you could compile the programs yourself.
As an example, here is how Dr. Loraine installed "blat" on her VM:
Tabix tools available from http://samtools.sourceforge.net/. Go to the Downloads page. The programs are not already compiled but you can build them by running make.
Heng Li and colleagues who wrote tabix and bgzip distribute a Makefile with the source code which allows you to compile a version of the program for your computer. To compile the program, cd into the distribution directory and type make to build both bgzip and tabix.
Here is how Dr. Loraine compiled and set up tabix and bgzip on her VM.
Obtain genome (target) and ESTs (queries)
Obtain genome (if you don't already have it)
Previously, you used subversion (svn) to get a copy of the October 2011 rice genome and placed it in your publicly-accessible Web directory so that you could view it your Web browser:
You'll work with this rice genome version for the current assignment. Your job will be to collect rice ESTs from PlantGDB, align them onto the rice genome, and convert the output to the tabix format. You'll check the outputs using a visualization tool called Integrated Genome Browser. Lastly, you'll write a shell script that automates the process.
You'll work with genome file O_sativa_japonica_Oct_2011.2bit.
If you don't already have a copy, use svn to check out the rice genome directory above. (See previous week's assignment Install subversion on your VM.)
Visit the PlantGDB ftp site and identify rice EST fasta files (Oryza sativa indica and Oryza sativa japonica) for your selected genome.
For each new release of the GenBank database, PlantGDB downloads all sequence data from the dbEST and general GenBank collections and then separates the data into individual fasta format files according to the species of origin.
Note that there are three files for rice, one from each cultivar and a third file without the cultivar suffix. This latter file is comprised of records in which the orginal submitter did not specify the cultivar, only the species.
- Use wget to download the fasta files onto your Atmosphere VM
Download the fasta files onto your VM. For this, you can use ftp or wget.
Align ESTs onto your genome
Align each set of ESTs on the rice genome separately using three invocations of blat and with maximum intron size set to 5000 and minimum identity of 95%.
For example, here is how Dr. Loraine aligned the rice indica ESTs onto the rice genome:
Compress and index with bgzip and tabix
The output files blat creates will likely be very large. In this step, you'll sort the files, compress them using bgzip, and then create a tabix index file you can use to retrieve alignments based on where they map in the genome.
Sort files using sort
Use the UNIX sort utility to sort the alignments by chromosome name and then by start position on the chromosome.
The PSL format reports the chromosome in field 14 and the start position of the alignment (relative to genomic) in field 16.
To sort by chromosome and start position, run sort. (See also the examples from the tabix manual.)
Check the files in Integrated Genome Browser
Put both files in your Web directories on your VM or download both onto your desktop. Note that for IGB to open the alignments file, you need to store the index (.tbi) file in the same directory as the data file.
Launch Integrated Genome Browser and use File > Open URL or File > Open to open the alignments file. Zoom in on a region and click Load Data to see the alignments.
Use File > Export Image... to make a picture of ESTs (choose PNG format) aligned on a gene, save the image in a Web-accessible directory on your VM, and email the URL to the class Yahoo group.
Write a shell script run_blat.sh that automates the above process
Write a bash shell script that automates all of the above steps. For full credit, the script should:
- Accept command line arguments: (requires parsing command line arguments)
- argument1 the name of the 2bit file (genome sequence)
- arguments2, 3, etc give names of fasta query files to align
- Checks that the genome and query files exist and are readable before starting (requires if statement, file testing operators)
- Runs each of the above commands in for each query file (requires a looping construct)
- Names each output PSL files after the input file (requires string operations)
- For example, if a query file is named Oryza_sativa_Japonica_Group.mRNA.EST.fasta, then the PSL output file should be named Oryza_sativa_Japonica_Group.mRNA.EST.psl
- Runs blat only if the output file does not exist; notifies the user if the file exists and then continues to the next file
You can assume that the user has already installed blat, tabix, and the other programs in his or her PATH.
Turn in your work
- To turn in the processed EST alignment files, put them in a Web-accessible directory on your VM and email the address to Dr. Loraine. For full credit, include each sorted, indexed PSL output file and the corresponding tabix (.tbi) index file.
- To turn in your shell program, put it a Web-accessible directory on your VM and email the URL to Dr. Loraine, who will then download it on her VM or on another server for testing.
- To turn in your IGB image, place it in a different Web-accessible directory on your VM and post the URL to the class Yahoo group.