- Obtain Arabidopsis thaliana EST from PlantGDB
- Obtain genome sequence 2bit file from IGBQuickLoad.org (for blat)
- Running blat and blat options explained
- Processing blat output
- Check files in IGB
- Converting to PSL for distribution on the public QuickLoad site
There are around 1.5 million Arabidopsis thaliana ESTs available from GenBank, and around one third of these are from a single 454 sequencing project completed several years ago. Although the total number of ESTs is not large in comparison to other genomes, these data are nonetheless useful because they can provide confirmation of gene models as well as a rough estimate of gene expression patterns and alternative splicing.
For more information about ESTs and splicing patterns in Arabidopsis, see:
- English, et al (2010) Prevalence of alternative splicing choices in Arabidopsis thaliana
- Database of alternative splicing, by EST library - ArabiTag Web site
EST projects have been done for many other plant and animal genomes, and depending on the species, aligning the EST sequences onto the reference genome can aid with annotation. As a result, most genome annotation pipelines include an EST alignment step that attempts to identify genes by aligning the ESTs onto the genomic sequence. Because ESTs arise from mRNAs which are often spliced, the alignment tools used need to be able to handle large gaps in the EST part of the alignment. The best of these will also often take into account expected sizes of these gaps, which correspond to introns, as well as expected splice site consensus sequences flanking each intron/gap. Many tools will also attempt to trim polyA sequences from the ends of ESTs, since these are added as part of mRNA processing and are not part of the genomic sequence.
In this document, I'll use Arabidopsis thaliana ESTs to demonstrate how you can use an EST-to-genome alignments to:
- align ESTs onto a reference genome (using blat)
- subdivide the alignments into multi- versus single-mapping sequences
- sort and index the alignments for distribution on a QuickLoad site (I'll use the public IGBQuickLoad site for demonstration)
- add the EST annotations to an annots.xml file for your genome
The blat alignment tool - where to get it
Although there are many excellent EST-to-genome tools available, I'll use the blat alignment tool because I've used it many times in the past and am happy with its performance and overall accuracy. To get a copy of blat, just visit the UCSC Web site: http://hgdownload.cse.ucsc.edu/admin/exe/. This directory contains compiled versions of blat as well as many other useful tools for processing genomic data sets.
Obtain Arabidopsis thaliana EST from PlantGDB
PlantGDB provides a service in which the entire contents of GenBank's nucleotide databases are downloaded and then subdivided into different files, where each file contains the sequences for a given plant species.
As of this writing, the latest GenBank version is 187.
To obtain the Arabidopsis EST data, I navigated through the PlantGDB ftp site to this directory: ftp://ftp.plantgdb.org/download/FASTA_187/EST/, which contains many hundreds of files.
I scrolled down the listing of the page until I found the Arabidopsis thaliana file. I then right-clicked on the file and chose option "copy link location". To download the file, I open a terminal window (on my Mac computer) and enter:
Check the downloaded file by counting records in the file, checking uniqueness
It's always a good idea when working with data files downloaded from the internet to check their quality, their completeness, along with other assumptions our downstream data pipeline may need to make about the data.
For example, I'm expecting that the file will contain around 1.5 million ESTs. To check this, I can use the simple UNIX utility grep, like so:
This works by counting the number of lines that contain the greater-than character (>), the character that signals the start of a new sequence record.
However, is this the correct number? The PlantGDB Web site reported that the version of GenBank they support is the most current, up-to-date version (187) and so this means I ought to be able to query GenBank for the number of Arabidopsis ESTs and get the same result - 1529700.
To check this, I visit the NCBI Web site and search the Taxonomy database for Arabidopsis thaliana. I know from many years experience that the taxonomy database Web page for Arabidopsis (organism id 3702) will display a table reporting the number of different kinds of records available in GenBank for a given species. When I visit the page (see: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3702) I observe that the number of ESTs reported is 1,529,700 ESTs.
Since the file contains about the expected number of records, I'm satisfied it is complete, at least so far as the NCBI database infrastructure is concerned.
I'm also expected that no EST record will appear more than once in the file. If this assumption is violated, a few ESTs will appear to map onto multiple locations when in fact they may not. This problem would likely only affect a very small number of genes, which makes it minor problem overall but a major problem if one happens to be interested in those one or two affected genes.
To check this, I use a combination of the unix commands grep, sort, uniq, and wc:
What's happening here is that I'm extracting the fasta header lines from the file, piping the output to sort, which will put all lines that are the same next to each other in the output, and then piping the output of sort to uniq, which will remove duplicate lines provided they appear next to each other in the output, and then pipe the output of that to wc (using the -l option) which counts the number of lines in the output. As expected, the number of unique header lines is 1,529,700, exactly what I would expect if every record were unique in the file.
There are certainly many other checks I could think of to run over the file, but testing for the correct number and testing for uniqueness is usually enough, in my experience.
Satisfied that the data are good enough, I proceed to the next step.
Obtain genome sequence 2bit file from IGBQuickLoad.org (for blat)
I also get a copy of the 2bit file for the Arabidospis thaliana genome, available from the publicly accessible IGB QuickLoad site, and save it in the same directory.
Checking the genome sequence - is it what I expect?
I expect that the genome sequence file should contain five records, one for each Arabidopsis nuclear chromosome, plus two additional records for the chloroplast and mitochondrial genomes.
To check this, I use the twoBitInfo program, which will report the name of the each sequence in the file along with their sizes:
To check that this is correct, I follow the link labeled Genome Sequence from the taxonomy browser page. After a few more clicks, I retrieve the GenBank records for the individual sequences. From there, I confirm the sizes reported by twoBitInfo match the chromosome sizes represented in GenBank. So far so good.
Running blat and blat options explained
Now that I've convinced myself that my data are correct and conform to expecations, I proceed to the next step: aligning the sequences onto the genome using blat:
- -t specifies that the target database is DNA. Because Arabidopsis introns are generally smaller than 2000 bases, I used the -maxIntron option to restrict gap (inferred intron) size in the EST-to-genome alignments. I also directed blat to trim polyA tails from the ESTs using the -trimeHardA option. I specified "pslx" as the output format, which is the same format as PSL but includes EST sequences in the output. I directed blat to not include its usual header using the -noHead flag.
Checking blat output
After the blat run finishes, I check the output. To check the number of alignments produced, I do this:
It looks like at least some of the ESTs aligned multiple times to the genome, even under the relatively stringent percent identity constraints I used. But this is not much of a surprise, since Arabidopsis, as with many plant genomes, contains recent duplications that make mapping ESTs arising from genes within those regions difficult to do.
Processing blat output
Next, I'd need to perform some processing steps to get the data set ready for deployment on the IGB QuickLoad site.
I'd like to do the following steps:
- modify the query names in the blat output so that IGB linkouts will function properly
- separate multi- from single-mapping alignments so that it will be obvious to users when a given EST maps to a duplicated region
- remove lower quality alignments where query coverage is less than 90% (more on this later)
- remove lower quality alignments where percent identity is less than 95%
- process the output files using tabix to support region-based data retrieval form IGB (the Region in View data loading feature)
I'll do the first four steps first, using a script I wrote for this purpose. The script (filterBlat.py) is attached to this document.
Fixing query names - why it's necessary
The fasta headers for the ESTs all look something like this: gi|32885540|gb|CB260767.1|CB260767. When blat reports the query name for this EST, it will use the entire string (gi|32885540|gb|CB260767.1|CB260767).
I would prefer that the query name field just include the GenBank accession, with version prefix, e.g., CB260767.1 in this case.
The reason is that I want IGB users to be able to right-click alignments in the viewer and see options for retrieving the corresponding Web page in GenBank. The only way this will work is if the query name field includes just the accession (id) of the EST and nothing else. That is, IGB knows that data in any track where the track name starts with "NCBI EST" can link out "link out" to GenBank.The link URLs will look like http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=$$, where $$ is the EST id. If the "$$" is replaced with longer text gi|32885540|gb|CB260767.1|CB260767, then the link out won't work.
I run my filtering code, which creates new files with filtered, processed blat output.
The -c option tells the script to only write out alignments with 90% query coverage, meaning: 90% of the query sequence had to align onto the genome. The -i option tells the script to only write out alignments where the percent identity was 95% or greater. The --fix_gb_ids option replaces the query name field with the query's accession and version.
Next, I review the results of the filtering using ls, wc, and xargs:
The file A_thaliana_Jun_2009.EST.passed.pslx contains all the alignments that passed the quality filter. The file A_thaliana_Jun_2009.EST.rejected.c90.0i95.0.pslx contains alignments that didn't pass the quality filter. The file A_thaliana_Jun_2009.EST.sm.pslx contains alignments from A_thaliana_Jun_2009.EST.passed.pslx where the EST mapped exactly once onto the genome, and the file A_thaliana_Jun_2009.EST.mm.pslx contains alignments from A_thaliana_Jun_2009.EST.passed.pslx where the EST in question mapped multiple times.
The final command simply runs the wc -l command on each pslx blat output file. I see from the output of this that the multi-mapping file contains 208,044 high-quality alignments for ESTs that mapped multiple times onto the genome. I'm curious to find out how many ESTs were multi-mappers; to find out, I do this:
The query name appears in field (column) 10 of the blat output; this command isolates that column using cut, pipes it through sort, uniq, and then wc -l, giving the total number of unique EST query names in the output file. Thus, 74,472 ESTs produced 208,044 high-quality alignments in which each EST aligned across at least 90% of its length with percent identity 95% or better. A total of 1,172,185 ESTs produced 1,172,185 alignments under the same quality control criteria.
Converting alignment files to tabix'd random access format to support partial data loading into IGB
I can now open the PSLX files in IGB, but IGB's partial data loading feature won't work unless I convert the PSLX files to a format that supports region based querying. That is, the data need to be in a format that allow IGB to retrieve parts of the file
Sorting using UNIX sort utility
Tabix requires that its input files by sorted by sequence name and start position. In a PSL file, the 14th field corresponds to the target sequence name and the 16th fields corresponds to the start position of the alignment relative to the genomic sequence. To sort, the data, I sorted each file (sm and mm files) like so:
And then I check that the line numbers match up, as before:
Next, I replace the unsorted files with their sorted versions:
and check line counts again:
Compressing each file using bgzip
Then I used bgzip to compress the files:
which creates A_thaliana_Jun_2009.EST.pslx.sm.gz and A_thaliana_Jun_2009.EST.pslx.mm.gz
Making tabix .tbi index file for each file using tabix
Tabix is a program that reads a bgzip (block-compressed) file and creates an index indicating the location of various blocks within the larger file, thus making it possible to retrieve data based on their location in the genomic sequence. IGB uses this feature to allow partial loading of larger data files. For example, if you only want to see the blat alignments on a specific gene, you can navigate to your gene of interest and load just the alignments that overlap the currently displayed region.
To create index files for my compressed blat output files, I do this:
These commands create two new files: A_thaliana_Jun_2009.EST.mm.pslx.gz.tbi and A_thaliana_Jun_2009.EST.sm.pslx.gz.tbi, which are the tabix index files for the two larger data files.
Check files in IGB
Next, I check that IGB can open the files. I transfer them onto my personal computer and open them in IGB in the usual way, using File > Open command in IGB.
Converting to PSL for distribution on the public QuickLoad site
Although the files looked great in IGB, for distribution, I prefer to use PSL (not PSLX) format because the PSL format is a lot more lightweight - it consumes less memory in IGB. This will allow users to view all the alignments on a chromosome. For other genomes with fewer ESTs, I probably won't do this, but for Arabidopsis, I think it makes better sense to distribute the alignments without the EST sequence.
So I uncompressed the files and used cut to extra just the first 21 lines, minus the two final PSLX columns containing the query sequence:
and then compress and index the new PSL format file as before, repeating the same steps for the "mm" file.
Check in files and indexes to IGB QuickLoad data repository
I next used svn to check in the files to the subversion repository https://svn.transvar.org/repos/genomes/trunk/pub/quickload/A_thaliana_Jun_2009:
and entered a log message summarizing how the files were created and where they came from.
Add new files to annots.xml
Next, I added some additional lines to the annots.xml file for the A_thaliana_Jun_2009 genome:
The name attribute gives the name of the physical file on the file system. I've checked into a subdirectory called EST, mainly so that I can use that Web directory as a convenient location for linking and adding documentation. Note that the URL I've provided is the same. The title attribute is what will appear next to the checkbox in IGB when users open the IGBQuickLoad.org data source folder in the Data Access panel. The description attribute will become the tooltip users see when they hover the mouse over the data set.
I then check the new annots.xml file into the repository in the usual way:
Update the public server
Now that everything is checked in, I log onto the IGBQuickload.org site, change into the Web directory where the quickload repository has been checked out, and run
to deploy all the new files.
Using blat, I aligned the majority (81%) of the 1,529,700 Arabidopsis ESTs at 95% identity and 90% query coverage.
The table below summarizes the results:
The image below shows how the EST alignments look in IGB: