Skip to end of metadata
Go to start of metadata

Introduction

Last week, you learned how to write and combine functions interactively using the python interpreter.

This week, you'll build on this skill while also developing a new one: the ability to write your own tools (in python) for the UNIX environment.

Also this week, you'll learn how to use subversion, a version control system used both to manage code and ensure reproducibility in bioinformatics research. From now on, when you turn in your work, you'll use svn (subversion client) to add files to you personal sapce with a shared repository (called a "repo").

Version control in bioinformatics helps ensure reproducibility

Software development projects use version control systems like subversion, cvs, or git to manage source code files and integrate efforts of many developers. In bioinformatics and other scientific disciplines, we use version control systems not only for source code management but also as a way to track analyses, data files, manuscript drafts and other resources.

Version control systems are becoming the main tool bioinformatics developers are using to share their work and ensure that their analyses are reproducible. However, this is a fairly new development and many labs have not yet discovered the power of version control in ensuring the longevity and reproducibility of their research.

How it works in practice

The classic scientific workflow in bioinformatics is that a collaborator (or your own lab) produces a data set from a high-throughput experiment, such as a microarray experiment or maybe a genome-wide association study.

As the bioinformatics expert, you assist with the analysis, and along the way, write code in various languages (python, R, or something else) that you use to process the data, create tables and figures illustrating the results, and so on. Before you start, you create a new "repo" (repository) to contain the data files you receive from the lab, source code you wrote, and documentation of the analysis. As you proceed, you add files to the repo one by one, providing detailed log messages describing each file and how you made it. Later, when you deliver results files (figures or tables) to others, you include those in the repo, as well. When you discover bugs or decide to change some aspect of your analysis, you check in the new versions of files. At any time you can retrieve older versions of any file, compare old files to new ones, and so on. So if you discover an error or think of a better way to run an analysis, you can easily make changes without losing track of your previous work.

Many labs don't use it (because they don't know how)

A lot of labs - even bioinformatics labs! - don't use version control. The reason is usually just that they don't know how it works or know it exists. Version control was developed originally as a tool for software development, and many groups are not aware of how it can assist with bioinformatics.

As a result, many of analyses published by bioinformatics researchers are difficult or impossible to reproduce. That is, it is hard or impossible to re-run the analysis because it is hard to determine which versions of key data files or scripts were used. 

This is a particularly challenging problem when a group submits a research article for publication. During the review process, which can take months, the group typically continues to modify and refine their methods and software tools developed to run the analyses. If the paper comes back from the review process with requests from editors or reviewers to make changes to the analysis, the lab may find that they cannot re-create some or even all of the work in the original manuscript because the code and data files used are no longer available. 

Even worse, not being able to track versions of data and code can lead to errors and in some cases has ruined careers because it is often hard to tell the difference between scientific fraud and an honest mistake arising from difficulties with managing data.

Which means: an opportunity for you

In some ways, this is good news for you; it means that if you master a version control system and use it to manage your code and analyses, you will have a valuable skill that will help researchers improve the impact and importance of their work. It will improve your ability to get a job or gain admittance to a high-quality graduate program.

So, for the rest of this class, you'll do your work using subversion to access and manage your files in a shared class subversion "repo." 

Read and/or Watch

Assignments

  • No labels