Skip to end of metadata
Go to start of metadata

Introduction

Number one first step when logging into any new system is to find out: What OS is it running? Number two next step is to install whatever software you need to do your work.

In this class, you'll learn to use a programming tool called version control to develop code and keep track of analyses. Using version control systems, you can document every step of an analysis, including important data sets, descriptions of how they were made, images intended for publication, and more.

Note that this is very different from how applications developers use version control. They use version control systems to combine the work of many developers; their goal is to build software.

In bioinformatics, we use version control to develop software, as well, but we also use it to document the steps of analysis and ensure reproducibility within groups and over time.

Icon

Ensuring reproducibility in research is a major theme of this class, and it is by no means a solved problem in any discipline. You will learn there are tools and techniques for ensuring reproducibility in this class.

In this exercise, you'll

  • Learn more about your VM
  • Install a version control system called subversion
  • Use it to check out a project
  • Make the files accessible via your VM's Web server

Exercises

Log into your VM (or start a new one if no VM is running)

By now, you should know what to do.

What UNIX OS is your VM running?

Do a google search - how do you find out what version of UNIX a system is running? Find out what version of UNIX your VM is running.

Install subversion

Most modern UNIX systems come with easy-to-use installer programs you can use to update the operating system and install new software.

To find out how this works for your particular OS, search google using the query string "How do I install software on XXX" where XXX is your VM's Unix.

Once you find out how to do it, use the install script for your system to install subversion.

To check that it worked, open a terminal and type:

Note the double hyphen. If something similar to the following prints to the screen, then you've successfully installed subversion and the svn command is now in your path.

What is subversion?

Subversion is a rather large and elaborate software ecosystem that includes both server and client components.You only need to install the subversion client program, which is a command line program called "svn" which itself has many diverse commands used to interact with what's called a subversion "repository."

Subversion is one of several popular version control systems. Others you will hear of include:

  • CVS (old, few people still use it)
  • git (newer, people like it because it you can clone repositories (called "repos") with great ease)
  • mercurial (also newer, people like it because it makes merging and branching easy)

You'll start by installing and learning to use subversion because soon you'll be working with version-controlled data from the Loraine Lab subversion repository.

Example: How Dr. Loraine installed a program on a VM used by the Loraine lab

First, she logged into the VM. Next, she executed the following command to find out what version of UNIX her VM was running:

to find out what UNIX the server is running. The answer was: CentOS release5.6 (Final).

She then googled "how to install software on CentOS" and quickly learned that one uses a program called yum for this.

To then install an editor she likes called emacs, she did this:

Check out  a repository using subversion

If all went well, you should now be able to run a subversion client program called svn, which lets you "check out" copies of subversion repositories onto your VM.

Ask Yourself: How do I find out if a program is in your path? Hint: Which command lets you find out if a program is in your path, as well as where it is installed on the system? Ask google if you don't already know the answer!

For future assignments, you will work with data from the Integrated Genome Browser QuickLoad repository, which contains genomic data and annotations for several plant and animal genomes. You'll use these data to perform bioinformatics analyses and get experience using version control and other useful tools for reproducible research.

However, for today, all you need to do is a get a copy of part of the repository.

Before you begin, look over the Red Bean Press book Version Control with Subversion, available for free on-line here: http://svnbook.red-bean.com/. You don't need to read the entire book today, but give it a quick review so that the material will be somewhat familiar to you later when you learn how to use subversion in earnest. (That's in Week Seven.) Don't worry if you don't understand much of this yet - just focus on getting the overview concepts:

  • A subversion server program runs on a host machine and provides access to files containing in its repositories
  • Repositories are collections of version-controlled files and a history of everything that has ever been done to them
  • You use the subversion client program svn to interact with the repository. Using svn, you can
    •  "check out" copies of the repository (you'll do this today)
    • commit changes to files and increase the version number of the repository (you'll do this later)
    • easily make multiple copies of a repository on different computers (you'll do this today and later)

Today, change into your Apache DocumentRoot directory (see previous assignment) and check out a copy of part of the repository.

Today you won't get a full copy of the data, which is around 8 Gb in total.

The iPlant base image VM you are running does not have sufficient disk space to accommodate the entire repository, and so you'll only get a part of it today.

The checkout command may take some time. If you lose your connection to the VM while the commands is executing, it will not finish. In other words, if the login shell terminates while the svn command is running, both the shell and the svn command will terminate.

Most UNIX systems come with a pre-installed program called "screen" that will launch a new shell that continues running even when its parent shell process terminates.

To learn about screen, read this:

To check out genome annotations and sequence for the latest revision of the rice (O sativa) genome, do this

Question: Should the files you just checked out now be visible on your VM's Web site? Why, or why not?

Icon

Your iPlant VM login prompt probably looks something like this:

user@vm142-90

Your IP VM's Web site can be reached via URL http://vm142-90.iplantcollaborative.org

Allowing your Web site to list directory contents with .htaccess

Some systems by default configure Apache (via the httpd.conf file) to not allow users to view directory listings. That is, when users visit a URL like http://example.com/folder, and if the folder doesn't contain a file called index.html, the user will see a "access denied" message.

This is meant to provide some extra security for the Web site because often people leave files no-one should see in folders of Web sites, thinking (wrongly) that no-one can see them because nothing links to them.

Your VM may be configured to prevent directory listings. To change this behavior on a directory by directory basis, you can create an .htaccess (pronounced "dot-htaccess") file and save into the directory whose properties you want to change.

Do a google search with query string ".htaccess allow directory listing" (or something similar) to find out more.

How to turn it in

For full credit, post the URL of your checked-out copy of the rice repository on the class Yahoo group. It should look something like

http://vm123-45.iplantcollaborative.org/O_sativa_japonica_Oct_2011

Check that the directory contents can be listed. If they can't, add a .htaccess file.

That's all for today!


  • No labels