Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Introduction

The ability to work with high-throughput sequencing data sets is an in-demand skills in bioinformatics. Much of what we've done thus far has been designed to prepare you for this week's introduction to HTS data sets.

...

You can check your space allotment using df.

Code Block

$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             8.7G  4.5G  3.9G  54% /
/dev/sda2             713M   17M  660M   3% /mnt

To find out how much data is stored in a given directory, use du (disk usage).

Code Block

SRR306316]$ du -h
76K	./logs
479M	.

...

To do the following exercises, you'll also need to download and install the following tools in your PATH.

How to install or unpack

 

Note

It may be possible to install some tools using yum install

Many UNIX tools are distributed as compressed files sometimes called tarballs. To unpack a tarball, use gunzip and tar.

Code Block

$ gunzip tool.tar.gz
$ tar -xf tool.tar

...

Use fastq-dump to convert the .sra files to FASTQ format.

Code Block

$ fastq-dump SRR306316.sra

...

Twobit (.2bit) is a format developed by the UCSC Genome Bioinformatics programmer Jim Kent, who also developed blat. However, the alignment tools you'll use don't work with 2bit. You'll have to convert the file to fasta before you can proceed. For this, use twoBitToFa.

Code Block

$ twoBitToFa O_sativa_japonica_Oct_2011.2bit O_sativa_japonica_Oct_2011.fa

...

Bowtie-build may take several minutes to complete. If you are running it on your VM, first launch screen so that if you lose your connection, your bowtie-build command can finish.

Code Block

$ screen
$ bowtie-build O_sativa_japonica_Oct_2011.fa O_sativa_japonica_Oct_2011

...

Now you have all the pieces in place to launch the alignment program tophat. Because it will likely take more than a few minutes to run, use screen to create a new shell that won't die if you lose your connection, or use screen -r to recover a detached screen you're not already using for something else.

Code Block

$ screen
$ tophat -I 5000 -i 50 -o SRR306316
    O_sativa_japonica_Oct_2011/O_sativa_japonica_Oct_2011
    SRR306316.fastq.gz

...

To work with the alignments files, you'll use samtools, which allows region-based querying of BAM format files. To create an index, run samtools index.

Code Block

$ samtools index SRR306316.bam

...

Info
titleRecognizing uniquely mapping reads

The SAM format NH tag indicates the number of times a given read aligns onto the genome.

Grading

For full credit, publish your Web page link on the class Yahoo group.

The Web page should include:

Report

  • Number of genes you examined in IGB and summary of what you found.
  • Summary of read alignment results

Once everyone has posted his or her link, look at the other students pages. Be ready to discuss in class the degree to which everyone obtained the same results.