Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3
Table of Contents

Introduction

This assignment introduces Unix tools useful for working with large data files commonly encountered in bioinformatics. Once you've completed this lab assignment, you'll be able to do the following:

  • Use UNIX utilities (commands) to check the contents of large files
  • Understand how a popular bioinformatics file format called 'bed' represents gene model annotations

Why we use Unix in bioinformatics

The Unix operating system (OS) is popular in bioinformatics because of its powerful command-line tools that make scripting and performing automated analyses relatively easy.

One of the great benefits of Unix is that everything you learn today about how to interact with and use the system will probably still be useful in several years time. Unix has been around since the 1970s and has been steadily improving since then. Time you invest today in learning Unix tools (most of which are free) will likely pay off your entire career.

To interact with a Unix system, you invoke a program called a shell, an interpreter where you type in commands for the system to execute. The commands you type are programs that have predictable behaviors and which you can chain together into programs of your own called shell scripts.

How to get access to a machine running Unix

To do the exercises, you'll need access to a computer running a Unix operating system. The Bioinformatics Department at UNCC has a computer lab equipped with Apple desktop computers, which are running a version of the Unix operating system based (originally) on FreeBSD. However, if you don't have access to the computer lab, you can also get access to a Unix computer by launching an iPlant Atmosphere virtual machine instance. The other assignment for this week titled iPlant virtual machine - log in and explore walks you through the process of creating and logging in to an iPlant Atmosphere VM. If you don't have access to a Unix system, do the first part of the second assignment first and use your VM to complete the rest of this assignment.

Exercises

Getting started - launch a shell and type commands

To do the exercises, open a terminal window, which is an interface to a single shell.

  • On Apple, to open a terminal window, double-click the "terminal" icon under Applications > Utilities. You can also launch a terminal window by clicking the terminal icon shown on the Dock, the strip of application icons along the bottom of the screen.
  • On computers running other versions of Unix, e.g., CentOs or RedHat, you can launch a terminal window by right-clicking on the desktop and choose "Open Terminal" or something similarly-worded.
  • If you are using an iPlant Atmosphere instance, log into to your instance using ssh and your iPlant user id.

Once you've a shell window, you can type in commands at the prompt, typically a sequences of characters ending with a % or $ character, depending on how the system is set up.

To find out what a Unix command does, use the "man" program, which displays a text file (the "man" page) for a given command. For example, at the prompt ($), type

Code Block
$ man ls

The most useful Unix commands (in bioinformatics)

Much of bioinformatics analysis and programming involves working with data files obtained from diverse sources. As a general rule of thumb, whenever you start working with a new file, even one that you created, you should use combinations of the following commands to get an  overview of the file's contents.

cut - extract columns from a file

This command extracts columns of data from plain text files.

For example, let's say you have a comma-separated file named genbank.csv that looks like this:

Code Block
AEX20383.1,354 aa
P04637.4,393 aa

Let's say you want to extract the second column of data from the file. To get column two, do this:

Code Block
$ cut -f 2 -d , genbank.csv

The -f option indicates the field you want and the -d option tells cut that the , (comma) character is the field delimiter.

As with other Unix commands, you can use pipes to process the output of cut commands. For example

Code Block
cut -f 1 -d , genbank.csv | cut -f 1 -d .

Cuts column 2 from genbank.txt and cuts it further, using the '.' character as the delimiter. This would have the result of splitting the values in column 1 into two new columns, using the '.' character to define the split.

Read the wikipedia entry on cut which describes various ways you can use cut to examine and manipulate files.

sort - sort values in a file or piped input

This commands lets you sort the contents of a file. When you sort a file, lines with identical content end up next to each other in the output, which can then be fed to uniq (see below) to count the number of unique lines in the file.

Read the wikipedia entry on sort which describes sort order and how you can limit the sort to specific fields in a file. (You will see this again when we cover the tabix utility later in the course.)

uniq - extract unique lines from a file or piped input and wc to count them

Use uniq in combination with sort to count unique values in a file. For example, let's say you have a very large file called genbankBIG.csv that you suspect contains duplicate rows. To determine if there are duplicate rows, you could use wc -l (next section) to find out how many lines are in the file and then use sort and uniq to find out how many unique lines are in the file:

Code Block
$ sort genbankBIG.csv | uniq | wc -l
$ wc -l genbankBIG.csv

Read the wikipedia entry on uniq.

wc - count lines, bytes, words in a file

This command reports statistics on lines of piped input or on lines in a file. Use wc in combination with sort and uniq to assess the contents of a file. For example,

Code Block
$ cut -f 2 genbank.txt | sort | uniq | wc -l

counts the number of unique values in field 2 of file genbank.txt.

Read the wikipedia entry on wc.

Practice with gene models file from IGBQuickLoad.org

Enter change directory command (cd) with no arguments and then use pwd to print the current working directory.

Code Block
$ cd
$ pwd

QUESTION 1: When you execute the change directory command without specifying the name of a directory as an argument, the command assumes that you want to do what? Explain.

Use wget to download the protein-coding gene models for Arabidopsis thaliana. You'll get it from the IGB Quickload site, a Web site that Integrated Genome Browser uses as a back end data server:

For example, you can download the above file in the current working directory using a command like this:

Code Block
$ wget http://www.igbquickload.org/quickload/A_thaliana_Jun_2009/TAIR10_mRNA.bed.gz
Note

Most UNIX systems already have wget installed. On Apple computers, you may need to install it yourself or use curl instead.

Create a new directory in your home directory using mkdir.

Code Block
$ mkdir data

QUESTION 2: What happens if you try to create a new directory but a directory by that name already exists? Can you have more than one directory or file with the same name in the same directory? Does any operating system allow this? Explain.

Move the file into your data directory.

Code Block
$ mv TAIR10_mRNA.bed.gz data/.

Note that if you used the browser to download the file, it may have been saved in a different location than your home directory. It may also have uncompressed it for you.
Note that different browsers will likely put downloaded files in different places on the file system. Dr. Loraine does not recommend using a browser to download files.

QUESTION 3: What does the '.' symbol tell Unix to do in the above command? Hint: It helps save typing. How?

Move in to the data directory and use ls with the -l and -h options to get details about the file:

Code Block
$ cd data
$ ls -lh

Note In the above command, you could also have typed:

Code Block
$ ls -l -h

You can typically pile on several different options onto a Unix command, provided they don't contradict each other.

Now uncompress the file using gunzip, if it is not already uncompressed. (Some systems try to help you out by uncompressing files you download via a Web browser.) Now, take a look at the top and bottom of the file, using the head and tail commands.

Info

When you uncompress a file using gunzip, gunzip, by default, replaces the old '.gz' version of the file with a new uncompressed version that lacks the '.gz' extension. As with all Unix commands, you can modify this default behavior using options.

Code Block
$ gunzip TAIR10_mRNA.bed.gz
$ ls -lh
$ head TAIR10_mRNA.bed
$ tail TAIR10_mRNA.bed

Now, compress it again:

Code Block
$ gzip TAIR10_mRNA.bed
$ ls -lh

Note the output from the ls command when you give it the -lh option. The -l (long) option tells ls to "list" all the files in the directory *and* their properties, such as how big they are, who owns them, etc. The 'h' option tells ls to make the output human-friendly and to report file sizes using familiar units of kilobytes instead of bytes.

Code Block
$ gunzip TAIR10_mRNA.bed.gz

QUESTION 4: How big was the compressed file? How big was the uncompressed file?

The file you've downloaded is in "bed-detail" format, a tabular format invented by the University of California at Santa Cruz Genome Bioinformatics group. ("Bed" stands for "browser extensible format.")

Each line is represents the structure of single gene model, which a hypothesis about the structure of a mature, spliced mRNA transcript arising from a gene. Note that a single gene can, in theory, produce multiple different mRNA transcripts due to alternative promoters, alternative splicing, and alternative polyadenylation, which are types of chemical modifications a newly-synthesized transcript can undergo as it progresses through the maturation process. This means that a single gene can be associated with multiple gene models, where each gene model is essentially just a theory about the structure of the genes transcript products. Also, gene models that overlap along the genome sequence axis represent hypothetical alternative mRNA transcripts arising from the same gene or locus. Note that the term "locus" is a synonym for "gene". We will use these terms interchangeably in this class.

Take note of the following fields in the 'bed' file you downloaded:

Column 1: The name of the annotated sequence where the gene model is located.
Column 2: The start position for the gene model (genomic sequence coordinates).
Column 3: The end position for the gene model (genomic sequence coordinates).
Column 4: The name of the gene model
Column 6: The strand (+ or -) it comes from
Column 10: The number of exons the gene model contains.
Column 11: Comma-separated list of exon start positions, relative to the value in column 2
Column 12: Comma-separated list of exon sizes
Column 13: Locus name
Column 14: Descriptive text about the gene model or gene

To understand what these fields mean, read Gene Models and the Bed format: What they represent.

Answer the following questions using Unix commands you learned in the tutorials and in the exercises above. What commands did you use? Hint: be sure you understand how to use sort, wc, uniq, and the | (pipe) operator. (On most keyboard, the pipe key is above the Enter or Return key.)

QUESTION 5: How many distinct gene model names does the file contain?

QUESTION 6: Are any gene model names present multiple times in the file? Provide a sequence of commands that prove your answer is correct.

QUESTION 7: How many annotated sequences (chromosome) are represented in the file? What are their names?

QUESTION 8: Which annotated sequence (chromosome) has the smallest number of gene models? How many does it contain?

A single gene may be associated with multiple gene models. The TAIR10_mRNA.bed.gz file contains many thousands of gene models, all of which have names like AT4G22890.5 or AT2G46660.1. These names are called AGI codes. (AGI stands for Arabidopsis Genome Initiative.) The first part of the gene model name is the locus identifier (e.g., AT4G22890) and the suffix ".[numeral]" indicates the gene model number. For example, the gene model name AT4G22890.5 means: gene model number 5 belonging to locus AT4G22890. Locus is a synonym for gene.

QUESTION 9: How many unique locus identifiers does the file contain? (Hint: use cut twice.)

QUESTION 10: What sequence of commands could you use to determine if any gene models that have identical genomic coordinates and different names? (Hint: see Gene Models and the Bed format: What they represent.)