Skip to end of metadata
Go to start of metadata

Introduction

In this assignment, you'll get more practice writing scripts in python.

Each script will focus on manipulation and analysis of gene models data stored in BED format files.

Each script is meant to be a bit more difficult than the last, so start with countLines.py and work from there.


General requirements

Each program should be designed so that users can run them from the UNIX command line similar to other UNIX utilities like ls and uniq. This means that output should be sent to stdout. Don't print anything to stderr except messages to the user indicating that there is a problem. This means: if the script executes normally without an error, don't write to stderr.

As you know, different systems will have different version of python installed. You can assume that by including the following "she-bang" line at the top of the script and making the script will allow a user to run your code.

Style

To ensure that your code can be easily read and understand by other programmers

  • Maintain separation of concerns, which means: divide your code into logical chunks that address different tasks. For example, separate command-line parsing code from the code that does the actual work of the script. (The "actual work" is sometimes called the "business logic.")

The six simple scripts

countLines.py

The script accepts one argument: the name of a file. It should open the file, count the number of lines it contains, and print the number of lines to stdout.

For example, the user should be able to run your program from the UNIX prompt like so:

Icon

Check your output using the UNIX wc command, which does something very similar.

checkBed.py

This script accepts one argument: the name of a file. It should read the file and check that each line of the file contains exactly twelve tab-separated fields. If every line passes the test, it should print the string "OK" to stdout. If even one line fails the test, it should print "FAIL" to stdout.

For example, the user should be able to run your program from the UNIX prompt like so:

Icon

The program should not use the file extension to check the file. It should try to read whatever file is named by the user, regardless of how the name ends.

getMultiExon.py

This script accepts one argument: the name of a file. It should read the file and examine each line. If a line contains a gene model with more than one exon, it should write that line of data to stdout. However, if a line contains only one exon, it should not write that line to stdout.

Use your script and the UNIX wc utility to answer this question:

QUESTION1: How many protein-coding gene models from the Arabidopsis TAIR10 annotations contain multiple exons?

A BED-DETAIL format file containing the Arabidopsis protein-coding gene models is available from the IGB QuickLoad site:

http://www.igbquickload.org/quickload/A_thaliana_Jun_2009/TAIR10_mRNA.bed.gz

The file uses a variation on BED called bed-detail. It has the same fields as an ordinary BED file, but contains two extra fields: an identifier (field 13) and a free-text description fields (field 14).

So to answer the question, you may need to convert the BED-DETAIL file to ordinary, 12-field BED. Or, you could design your script to work with both formats. It's your choice.

Example)

getSingleExon.py

This script does the opposite of getMultiExon.py.

getSingleExon.py accepts one argument: the name of a file. It should read the file and examine each line. If a line contains a gene model with more only one exon, it should write that line of data to stdout. However, if a line contains more than one exon, it should not write that line to stdout.

Use the output from your program to answer the following question:

QUESTION2: How many gene models contain only one exon?

Example)

Icon

getSingleExon.py prints to stdout, which allows me to send the output to a file or pipe it into another program

findDuplicates.py

findDuplicates.py accepts one argument: the name of a BED format file. It should read the file and print any lines that appear multiple times in the file.

For example, let's say you have a BED file (example.bed) that repeats the same line of data in two different locations. Note that the first line is repeated once at the start of the file and once again at the end of the file.

Your program should write that repeated line once to stdout:

Icon

Use a dictionary to keep track of how often your script encounters the same line of data.

findNonZero.py

Write a script that finds Arabidopsis gene ids that do not end with zero.

Note that most gene ids look like: AT1G01010. That is, most end with zero.

findNoneZero.py should one argument: the name of a BED format file containing Arabidopsis gene model annotations. It should read the file and print any lines where the last digit of gene id is NOT zero.

See also: http://www.arabidopsis.org/portals/nomenclature/guidelines.jsp#guide

Use your script to answer the question:

QUESTION 3: How many gene models have gene ids that end a non-zero character?

  • No labels