By now, you've learned a great deal about how to work productively in a UNIX environment. You've also learned to perform one of the most fundamentally important tasks in bioinformatics programming, which is creating data sets for visualization by end user biologists.
This week, you'll take it a step further by learning to work with process and visualize data sets from RNA-Seq, a form of EST sequencing that not only provides sequence information for expressed genes but also gives us quantitative information about overall expression levels.
This week, you'll learn:
- how to get RNA-Seq data from the short-read archive
- how to convert NCBI-specific sequence format (.sra) to FASTQ
- how to interpret quality scores and other information in FASTQ files
- how to align sequences in FASTQ files onto a reference genome
- how to visualize alignments
- Jonathan Weismann (UCSF) lecture on DNA sequencing (24 min) http://www.youtube.com/watch?v=8n2LvJ-m0n0&feature=related
- wikipedia entry: http://en.wikipedia.org/wiki/FASTQ_format
The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants
- Go to article: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/
- SAM specification - http://samtools.sourceforge.net/SAM1.pdf
NCBI Short Read Archive (SRA)
- TopHat Manual - http://tophat.cbcb.umd.edu/manual.html