|Table of Contents|
The ability to work with high-throughput sequencing data sets is an in-demand skills in bioinformatics. Much of what we've done thus far has been designed to prepare you for this week's introduction to HTS data sets.
You can check your space allotment using df.
$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 8.7G 4.5G 3.9G 54% / /dev/sda2 713M 17M 660M 3% /mnt
To find out how much data is stored in a given directory, use du (disk usage).
SRR306316]$ du -h 76K ./logs 479M .
To do the following exercises, you'll also need to download and install the following tools in your PATH.
How to install or unpack
It may be possible to install some tools using yum install
Many UNIX tools are distributed as compressed files sometimes called tarballs. To unpack a tarball, use gunzip and tar.
$ gunzip tool.tar.gz $ tar -xf tool.tar
Use fastq-dump to convert the .sra files to FASTQ format.
$ fastq-dump SRR306316.sra
Twobit (.2bit) is a format developed by the UCSC Genome Bioinformatics programmer Jim Kent, who also developed blat. However, the alignment tools you'll use don't work with 2bit. You'll have to convert the file to fasta before you can proceed. For this, use twoBitToFa.
$ twoBitToFa O_sativa_japonica_Oct_2011.2bit O_sativa_japonica_Oct_2011.fa
Bowtie-build may take several minutes to complete. If you are running it on your VM, first launch screen so that if you lose your connection, your bowtie-build command can finish.
$ screen $ bowtie-build O_sativa_japonica_Oct_2011.fa O_sativa_japonica_Oct_2011
Now you have all the pieces in place to launch the alignment program tophat. Because it will likely take more than a few minutes to run, use screen to create a new shell that won't die if you lose your connection, or use screen -r to recover a detached screen you're not already using for something else.
$ screen $ tophat -I 5000 -i 50 -o SRR306316 O_sativa_japonica_Oct_2011/O_sativa_japonica_Oct_2011 SRR306316.fastq.gz
To work with the alignments files, you'll use samtools, which allows region-based querying of BAM format files. To create an index, run samtools index.
$ samtools index SRR306316.bam
The SAM format NH tag indicates the number of times a given read aligns onto the genome.
For full credit, publish your Web page link on the class Yahoo group.
The Web page should include:
- Number of genes you examined in IGB and summary of what you found.
- Summary of read alignment results
Once everyone has posted his or her link, look at the other students pages. Be ready to discuss in class the degree to which everyone obtained the same results.