Skip to end of metadata
Go to start of metadata

Introduction

Finding instances of transcription factor binding sites, restriction enzyme sites, and other short DNA or RNA motifs is a common task in bioinformatics.

In this assignment, you'll use the regular expression module (re) to implement a program that finds all instances of a DNA binding motif in regions defined in a file.

To run the program, users will provide three arguments and options:

  • a regular expression (e.g., [AT]ATG[AT])
  • the name of a fasta file containing genomic sequences to search (e.g., A_thaliana_Jun_2009.fa.gz in bioprogs/data)
  • the name of a file defining regions on the same genome reported in the fasta file (e.g., promoter regions upstream of annotated genes)

For each region defined in the regions file, the program will extract the corresponding sequence from the fasta file and then count how often the regular expression appears in the extracted sequence. Then, for each region, you'll output the original line read from the regions file plus the number of matches you found.

The regions file format will include four fields:

  • sequence name
  • start position (zero-based/interbase coordinates)
  • end position (zero-based/interbase coordinates)
  • a region name

For this assignment, search the plus strand sequence only.

Dr. Loraine has provided a stub program to help you get started (bioprogs/class/py/findMotifs.py)

You'll need to be able to read and work with sequences from a fasta file, and so she has provided two functions that use the SeqIO parser from BioPython.

Setting up

Probably you will find it easiest to complete the assignment on your VM, since you will need access to functions from BioPython. However, the lab computers may already have BioPython installed; if so, you can use the lab computers, as well.

If you are planning to use your VM to program, you'll need to install the BioPython libraries.

How you do this is up to you. For your reference, here are the steps Dr. Loraine followed to installed python 2.6 and BioPython on her VM.

Use svn to copy of the stub program from the class subversion repo.

Check a copy of the class subversion repo.

Use the "svn copy" command to copy of today's assignment from the class directory into your part of the repo. (Substitute your user name in the command below.)

Plan your program.

Examine the stub program file and make note of how it should be run.

Run the program with the -h option.

If you have any questions, post them to Yahoo group.

Create example input and output files.

You can use these for testing and they will also help you think about what the program will do.

Think about: What tasks does the program need to accomplish?

Write down a list of tasks that your program will need to do.

Ask yourself:

  • Which of these steps will be repeated?
  • Which will only happen once during the execution of the program?
  • What data will the program need to hold in memory?

Put the tasks in order

Look at the list of tasks. What tasks will happen first? Which tasks depend on other tasks?

Convert tasks to functions

For each task, design a function (or functions) that will carry out the task.

For each function, decide:

  • the name of the function (choose something descriptive and easy for you to remember)
  • the arguments it accepts (their type, their names, where they came from, and what they represent)
  • the values it returns (their type and what they represent)

Also decide: How will you link the tasks together into a task pipeline to carry out the work of the program?

Write your program

Add functions to your program

Add your functions to the program file.

Write your main method, which should invoke functions one-by-one to carry out the work of the program.

Test each function interactively

Use the python interpreter to develop and test each function independently from the rest.

Start with the first function invoked in your main method. Implement it and test it.

Then work on the next function that is called in your script, passing in the output of the previous function(s) if necessary.

Turn it in

Check in your final version to your part of the class repository.

Icon

Check in multiple draft versions of your file as you develop it. If you run into problems, this will allow Dr. Loraine to review your work and suggest improvements.

  • No labels