Introduction

There are around 1.5 million Arabidopsis thaliana ESTs available from GenBank, and around one third of these are from a single 454 sequencing project completed several years ago. Although the total number of ESTs is not large in comparison to other genomes, these data are nonetheless useful because they can provide confirmation of gene models as well as a rough estimate of gene expression patterns and alternative splicing.

For more information about ESTs and splicing patterns in Arabidopsis, see:

English, et al (2010) Prevalence of alternative splicing choices in Arabidopsis thaliana
Database of alternative splicing, by EST library - ArabiTag Web site

EST projects have been done for many other plant and animal genomes, and depending on the species, aligning the EST sequences onto the reference genome can aid with annotation. As a result, most genome annotation pipelines include an EST alignment step that attempts to identify genes by aligning the ESTs onto the genomic sequence. Because ESTs arise from mRNAs which are often spliced, the alignment tools used need to be able to handle large gaps in the EST part of the alignment. The best of these will also often take into account expected sizes of these gaps, which correspond to introns, as well as expected splice site consensus sequences flanking each intron/gap. Many tools will also attempt to trim polyA sequences from the ends of ESTs, since these are added as part of mRNA processing and are not part of the genomic sequence.

In this document, I'll use Arabidopsis thaliana ESTs to demonstrate how you can use an EST-to-genome alignments to:

align ESTs onto a reference genome (using blat)
subdivide the alignments into multi- versus single-mapping sequences
sort and index the alignments for distribution on a QuickLoad site (I'll use the public IGBQuickLoad site for demonstration)
add the EST annotations to an annots.xml file for your genome

The blat alignment tool - where to get it

Although there are many excellent EST-to-genome tools available, I'll use the blat alignment tool because I've used it many times in the past and am happy with its performance and overall accuracy. To get a copy of blat, just visit the UCSC Web site: http://hgdownload.cse.ucsc.edu/admin/exe/. This directory contains compiled versions of blat as well as many other useful tools for processing genomic data sets.

Methods

Obtain Arabidopsis thaliana EST from PlantGDB

PlantGDB provides a service in which the entire contents of GenBank's nucleotide databases are downloaded and then subdivided into different files, where each file contains the sequences for a given plant species.

As of this writing, the latest GenBank version is 187.

To obtain the Arabidopsis EST data, I navigated through the PlantGDB ftp site to this directory: ftp://ftp.plantgdb.org/download/FASTA_187/EST/, which contains many hundreds of files.

PlantGBD stores sequences in Web folders named for the GenBank version and uses basically the same file name formats for each release. This makes it relatively easy to write scripts that download the data when new releases become available.

I scrolled down the listing of the page until I found the Arabidopsis thaliana file. I then right-clicked on the file and chose option "copy link location". To download the file, I open a terminal window (on my Mac computer) and enter:

$ wget ftp://ftp.plantgdb.org/download/FASTA_187/EST/Arabidopsis_thaliana.mRNA.EST.fasta

Check the downloaded file by counting records in the file, checking uniqueness

It's always a good idea when working with data files downloaded from the internet to check their quality, their completeness, along with other assumptions our downstream data pipeline may need to make about the data.

For example, I'm expecting that the file will contain around 1.5 million ESTs. To check this, I can use the simple UNIX utility grep, like so:

$ grep -c ">" Arabidopsis_thaliana.mRNA.EST.fasta
1529700

This works by counting the number of lines that contain the greater-than character (>), the character that signals the start of a new sequence record.

However, is this the correct number? The PlantGDB Web site reported that the version of GenBank they support is the most current, up-to-date version (187) and so this means I ought to be able to query GenBank for the number of Arabidopsis ESTs and get the same result - 1529700.

To check this, I visit the NCBI Web site and search the Taxonomy database for Arabidopsis thaliana. I know from many years experience that the taxonomy database Web page for Arabidopsis (organism id 3702) will display a table reporting the number of different kinds of records available in GenBank for a given species. When I visit the page (see: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3702) I observe that the number of ESTs reported is 1,529,700 ESTs.

Since the file contains about the expected number of records, I'm satisfied it is complete, at least so far as the NCBI database infrastructure is concerned.

I'm also expected that no EST record will appear more than once in the file. If this assumption is violated, a few ESTs will appear to map onto multiple locations when in fact they may not. This problem would likely only affect a very small number of genes, which makes it minor problem overall but a major problem if one happens to be interested in those one or two affected genes.

To check this, I use a combination of the unix commands grep, sort, uniq, and wc:

$ grep ">" Arabidopsis_thaliana.mRNA.EST.fasta | sort | uniq | wc -l
1529700

What's happening here is that I'm extracting the fasta header lines from the file, piping the output to sort, which will put all lines that are the same next to each other in the output, and then piping the output of sort to uniq, which will remove duplicate lines provided they appear next to each other in the output, and then pipe the output of that to wc (using the -l option) which counts the number of lines in the output. As expected, the number of unique header lines is 1,529,700, exactly what I would expect if every record were unique in the file.

There are certainly many other checks I could think of to run over the file, but testing for the correct number and testing for uniqueness is usually enough, in my experience.

Satisfied that the data are good enough, I proceed to the next step.

Obtain genome sequence 2bit file from IGBQuickLoad.org (for blat)

I also get a copy of the 2bit file for the Arabidospis thaliana genome, available from the publicly accessible IGB QuickLoad site, and save it in the same directory.

$ wget http://www.igbquickload.org/quickload/A_thaliana_Jun_2009/A_thaliana_Jun_2009.2bit

You can convert the 2bit file back to fasta using twoBitToFa, also from UCSC Genome Bioinformatics.

All genome version folders supported in IGB QuickLoad include a 2bit genome sequence file, which you can use to run blat or retrieve the genomic sequence in fasta format.

Checking the genome sequence - is it what I expect?

I expect that the genome sequence file should contain five records, one for each Arabidopsis nuclear chromosome, plus two additional records for the chloroplast and mitochondrial genomes.

To check this, I use the twoBitInfo program, which will report the name of the each sequence in the file along with their sizes:

$ twoBitInfo A_thaliana_Jun_2009.2bit output.txt
$ cat output.txt
chr1    30427671
chr2    19698289
chr3    23459830
chr4    18585056
chr5    26975502
chrC    154478
chrM    366924

To check that this is correct, I follow the link labeled Genome Sequence from the taxonomy browser page. After a few more clicks, I retrieve the GenBank records for the individual sequences. From there, I confirm the sizes reported by twoBitInfo match the chromosome sizes represented in GenBank. So far so good.

Running blat and blat options explained

Now that I've convinced myself that my data are correct and conform to expecations, I proceed to the next step: aligning the sequences onto the genome using blat:

$ blat A_thaliana_Jun_2009.2bit Arabidopsis_thaliana.mRNA.EST.fasta 
    -t=dna -maxIntron=2000 -trimHardA -minIdentity=95 -out=pslx 
    -noHead A_thaliana_Jun_2009.EST.pslx

Blat options

-t specifies that the target database is DNA. Because Arabidopsis introns are generally smaller than 2000 bases, I used the -maxIntron option to restrict gap (inferred intron) size in the EST-to-genome alignments. I also directed blat to trim polyA tails from the ESTs using the -trimeHardA option. I specified "pslx" as the output format, which is the same format as PSL but includes EST sequences in the output. I directed blat to not include its usual header using the -noHead flag.

I run the command on a server available to me through UNC Charlotte, but I could have also run it on a virtual machine from iPlant Atmosphere, Galaxy, or any number of other free resources.

Checking blat output

After the blat run finishes, I check the output. To check the number of alignments produced, I do this:

$ wc -l A_thaliana_Jun_2009.EST.pslx
1680435 A_thaliana_Jun_2009.EST.pslx

It looks like at least some of the ESTs aligned multiple times to the genome, even under the relatively stringent percent identity constraints I used. But this is not much of a surprise, since Arabidopsis, as with many plant genomes, contains recent duplications that make mapping ESTs arising from genes within those regions difficult to do.

Processing blat output

Next, I'd need to perform some processing steps to get the data set ready for deployment on the IGB QuickLoad site.

I'd like to do the following steps:

modify the query names in the blat output so that IGB linkouts will function properly
separate multi- from single-mapping alignments so that it will be obvious to users when a given EST maps to a duplicated region
remove lower quality alignments where query coverage is less than 90% (more on this later)
remove lower quality alignments where percent identity is less than 95%
process the output files using tabix to support region-based data retrieval form IGB (the Region in View data loading feature)

I'll do the first four steps first, using a script I wrote for this purpose. The script (filterBlat.py) is attached to this document.

Note that I ran blat with minIdentity parameter set to 95%, so the percent identity filtering should not be necessary. I should probably test my assumptions about how blat works by adding some code to my script that reports alignments that fail to pass the percent identity filter. But for now, I'll skip this step.

Fixing query names - why it's necessary

The fasta headers for the ESTs all look something like this: gi|32885540|gb|CB260767.1|CB260767. When blat reports the query name for this EST, it will use the entire string (gi|32885540|gb|CB260767.1|CB260767).

I would prefer that the query name field just include the GenBank accession, with version prefix, e.g., CB260767.1 in this case.

The reason is that I want IGB users to be able to right-click alignments in the viewer and see options for retrieving the corresponding Web page in GenBank. The only way this will work is if the query name field includes just the accession (id) of the EST and nothing else. That is, IGB knows that data in any track where the track name starts with "NCBI EST" can link out "link out" to GenBank.The link URLs will look like http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=$$, where $$ is the EST id. If the "$$" is replaced with longer text gi|32885540|gb|CB260767.1|CB260767, then the link out won't work.

Linkout patterns are specified in IGB in the IGB resources file igb_default_prefs.xml, which is part of the IGB source code distribution. It is also packaged with every IGB release.

Running filterBlat.py

I run my filtering code, which creates new files with filtered, processed blat output.

The -c option tells the script to only write out alignments with 90% query coverage, meaning: 90% of the query sequence had to align onto the genome. The -i option tells the script to only write out alignments where the percent identity was 95% or greater. The --fix_gb_ids option replaces the query name field with the query's accession and version.

Next, I review the results of the filtering using ls, wc, and xargs:

$ filterBlat.py -c .9 -i .95 --fix_gb_ids A_thaliana_Jun_2009.EST.pslx
$ ls
A_thaliana_Jun_2009.2bit
A_thaliana_Jun_2009.EST.mm.pslx
A_thaliana_Jun_2009.EST.passed.pslx
A_thaliana_Jun_2009.EST.pslx
A_thaliana_Jun_2009.EST.rejected.c90.0i95.0.pslx
A_thaliana_Jun_2009.EST.sm.pslx
Arabidopsis_thaliana.mRNA.EST.fasta
filterBlat.py
$ ls *.pslx | xargs wc -l
    208044 A_thaliana_Jun_2009.EST.mm.pslx
   1380229 A_thaliana_Jun_2009.EST.passed.pslx
   1680435 A_thaliana_Jun_2009.EST.pslx
    300206 A_thaliana_Jun_2009.EST.rejected.c90.0i95.0.pslx
   1172185 A_thaliana_Jun_2009.EST.sm.pslx
   4741099 total

The file A_thaliana_Jun_2009.EST.passed.pslx contains all the alignments that passed the quality filter. The file A_thaliana_Jun_2009.EST.rejected.c90.0i95.0.pslx contains alignments that didn't pass the quality filter. The file A_thaliana_Jun_2009.EST.sm.pslx contains alignments from A_thaliana_Jun_2009.EST.passed.pslx where the EST mapped exactly once onto the genome, and the file A_thaliana_Jun_2009.EST.mm.pslx contains alignments from A_thaliana_Jun_2009.EST.passed.pslx where the EST in question mapped multiple times.

The final command simply runs the wc -l command on each pslx blat output file. I see from the output of this that the multi-mapping file contains 208,044 high-quality alignments for ESTs that mapped multiple times onto the genome. I'm curious to find out how many ESTs were multi-mappers; to find out, I do this:

$ cut -f 10 A_thaliana_Jun_2009.EST.mm.pslx | sort | uniq | wc -l
74472
$ cut -f 10 A_thaliana_Jun_2009.EST.sm.pslx | sort | uniq | wc -l
1172185

The query name appears in field (column) 10 of the blat output; this command isolates that column using cut, pipes it through sort, uniq, and then wc -l, giving the total number of unique EST query names in the output file. Thus, 74,472 ESTs produced 208,044 high-quality alignments in which each EST aligned across at least 90% of its length with percent identity 95% or better. A total of 1,172,185 ESTs produced 1,172,185 alignments under the same quality control criteria.

This assumes of course that my filterBlat.py code is correct. I have not demonstrated that it works correctly, at least not in this document. And so you should regard my claim about the multi-mapping ESTs with some skepticism until I make the effort to prove my results are reliable.

In general, it is good practice to be skeptical of all claims from bioinformatics analysis unless the person making the claims has shown (or tried to show) that their code is correct. Sadly, very few people bother do this, leading to publication of many wrong analyses.

Converting alignment files to tabix'd random access format to support partial data loading into IGB

I can now open the PSLX files in IGB, but IGB's partial data loading feature won't work unless I convert the PSLX files to a format that supports region based querying. That is, the data need to be in a format that allow IGB to retrieve parts of the file

Sorting using UNIX sort utility

Tabix requires that its input files by sorted by sequence name and start position. In a PSL file, the 14th field corresponds to the target sequence name and the 16th fields corresponds to the start position of the alignment relative to the genomic sequence. To sort, the data, I sorted each file (sm and mm files) like so:

$ sort -k14,14 -k16,16n A_thaliana_Jun_2009.EST.sm.pslx > sorted.sm.pslx
$ sort -k14,14 -k16,16n A_thaliana_Jun_2009.EST.mm.pslx > sorted.mm.pslx

And then I check that the line numbers match up, as before:

$ ls *.pslx | xargs wc -l
    208044 A_thaliana_Jun_2009.EST.mm.pslx
   1380229 A_thaliana_Jun_2009.EST.passed.pslx
   1680435 A_thaliana_Jun_2009.EST.pslx
    300206 A_thaliana_Jun_2009.EST.rejected.c90.0i95.0.pslx
   1172185 A_thaliana_Jun_2009.EST.sm.pslx
    208044 sorted.mm.pslx
   1172185 sorted.sm.pslx
   6121328 total

Next, I replace the unsorted files with their sorted versions:

$ mv sorted.mm.pslx A_thaliana_Jun_2009.EST.mm.pslx
$ mv sorted.sm.pslx A_thaliana_Jun_2009.EST.sm.pslx

and check line counts again:

$ ls *.pslx | xargs wc -l
    208044 A_thaliana_Jun_2009.EST.mm.pslx
   1380229 A_thaliana_Jun_2009.EST.passed.pslx
   1680435 A_thaliana_Jun_2009.EST.pslx
    300206 A_thaliana_Jun_2009.EST.rejected.c90.0i95.0.pslx
   1172185 A_thaliana_Jun_2009.EST.sm.pslx
   4741099 total

Compressing each file using bgzip

Then I used bgzip to compress the files:

$ bgzip A_thaliana_Jun_2009.EST.sm.pslx
$ bgzip A_thaliana_Jun_2009.EST.mm.pslx

which creates A_thaliana_Jun_2009.EST.pslx.sm.gz and A_thaliana_Jun_2009.EST.pslx.mm.gz

Making tabix .tbi index file for each file using tabix

Tabix is a program that reads a bgzip (block-compressed) file and creates an index indicating the location of various blocks within the larger file, thus making it possible to retrieve data based on their location in the genomic sequence. IGB uses this feature to allow partial loading of larger data files. For example, if you only want to see the blat alignments on a specific gene, you can navigate to your gene of interest and load just the alignments that overlap the currently displayed region.

To create index files for my compressed blat output files, I do this:

$ tabix -s 14 -b 16 -0 A_thaliana_Jun_2009.EST.sm.pslx.gz
$ tabix -s 14 -b 16 -0 A_thaliana_Jun_2009.EST.mm.pslx.gz

These commands create two new files: A_thaliana_Jun_2009.EST.mm.pslx.gz.tbi and A_thaliana_Jun_2009.EST.sm.pslx.gz.tbi, which are the tabix index files for the two larger data files.

Check files in IGB

Next, I check that IGB can open the files. I transfer them onto my personal computer and open them in IGB in the usual way, using File > Open command in IGB.

IGB requires that the bgzip'd data file and its .tbi (index) file reside in the same folder. Otherwise, IGB won't be able to access the data.

Converting to PSL for distribution on the public QuickLoad site

Although the files looked great in IGB, for distribution, I prefer to use PSL (not PSLX) format because the PSL format is a lot more lightweight - it consumes less memory in IGB. This will allow users to view all the alignments on a chromosome. For other genomes with fewer ESTs, I probably won't do this, but for Arabidopsis, I think it makes better sense to distribute the alignments without the EST sequence.

So I uncompressed the files and used cut to extra just the first 21 lines, minus the two final PSLX columns containing the query sequence:

$ bgzip -c -d A_thaliana_Jun_2009.EST.sm.pslx | cut -f 1-21 > A_thaliana_Jun_2009.EST.sm.psl

and then compress and index the new PSL format file as before, repeating the same steps for the "mm" file.

Check in files and indexes to IGB QuickLoad data repository

I next used svn to check in the files to the subversion repository https://svn.transvar.org/repos/genomes/trunk/pub/quickload/A_thaliana_Jun_2009:

$ svn add A_thaliana_Jun_2009.EST.*
$ svn ci

and entered a log message summarizing how the files were created and where they came from.

Add new files to annots.xml

Next, I added some additional lines to the annots.xml file for the A_thaliana_Jun_2009 genome:

annots.xml

<file
    name="EST/A_thaliana_Jun_2009.EST.sm.psl.gz"
    title="NCBI EST (single-mapping)"
    description="Single-mapping ESTs aligned with 90% query coverage and 95% identity"
    url="http://www.igbquickload.org/quickload/A_thaliana_Jun_2009/EST"
    foreground="B71725"
    name_size="14"
    direction_type="arrow"
    label_field="id"
    background="DEE0E0"
  />
  <file
    name="EST/A_thaliana_Jun_2009.EST.mm.psl.gz"
    title="NCBI EST (multi-mapping)"
    description="Multi-mapping ESTs aligned with 90% query coverage and 95% identity"
    url="http://www.igbquickload.org/quickload/A_thaliana_Jun_2009/EST"
    foreground="B71725"
    name_size="14"
    direction_type="arrow"
    label_field="id"
    background="DEE0E0"
  />

The name attribute gives the name of the physical file on the file system. I've checked into a subdirectory called EST, mainly so that I can use that Web directory as a convenient location for linking and adding documentation. Note that the URL I've provided is the same. The title attribute is what will appear next to the checkbox in IGB when users open the IGBQuickLoad.org data source folder in the Data Access panel. The description attribute will become the tooltip users see when they hover the mouse over the data set.

I then check the new annots.xml file into the repository in the usual way:

$ svn ci annots.xml -m "Configuring EST data sets in annots.xml"
Sending        annots.xml
Transmitting file data .
Committed revision 512.

Update the public server

Now that everything is checked in, I log onto the IGBQuickload.org site, change into the Web directory where the quickload repository has been checked out, and run

$ svn up

to deploy all the new files.

Conclusion

Using blat, I aligned the majority (81%) of the 1,529,700 Arabidopsis ESTs at 95% identity and 90% query coverage.

The table below summarizes the results:

Category	Number ESTs	Number alignments
single-mapping	1,172,185	1,172,185
multi-mapping	74,472	208,044
TOTAL	1,246,657	1,380,229

The image below shows how the EST alignments look in IGB:

Page tree

Annotating a genome with ESTs - an example from Arabidopsis