- Getting started
- R on-line help system
- R is an overgrown calculator
- R is an overgrown calculator - but you can save your results using variables.
- Vectors, vectorized arithmetic, and optional arguments.
- Modes and lists.
- Lists and data frames
- A linear regression refresher
- Using a model to predict values
The goal of this assignment is for you to get familiar with the R statistical programming environment and R language syntax.
What is R?
R is a powerful environment for statistical computing. You will see R again in other classes and you will no doubt use R again many times if you continue in bioinformatics.
R is an open source, free version of the S-plus statistical programming language formerly sold by Insightful. One nice feature of R is that it is free and community-supported. You can install it on as many machines as you like, without worrying about copyright issues.
Books on R
There are many outstanding R textbooks and references books. Books I've used and liked include:
- Introductory Statistics with R by Peter Dalgaard. Good for beginners.
- The Art of R Programming by Norman Matloff. Good for beginners who also have some computer science or programming experience, esp. with C.
- A "rough and partial draft" of the text is available from the author's Web site.
- See: http://heather.cs.ucdavis.edu/~matloff/145/PLN/RMaterials/145R.pdf
- Bioconductor Case Studies by Hahne, Huber Gentleman and Falcon. Follow the link to the Web supplement which includes code chunks.
Visit the R project Web site - http://www.r-project.org and look around.
There are two major sites for R - the r-project.org site (just one) and numerous identical CRAN mirror sites. CRAN stands for Comprehensive R Archive Network and offers links to instructions for downloading and installing R for a variety of platforms.
Then take a look at the Bioconductor Web site: http://www.bioconductor.org. Bioconductor is a large collection of packages designed mostly for analysis of microarray and sequence data analysis.
Launch the R interpreter
Start the R interpreter, a program that evaluates R commands interactively. How you do this will vary depending on the platform you are using. In Windows, you just double-click the "R" icon or launch R using the "Start" menu. On Unix systems with R installed, open a terminal window and type "R" at the command prompt.
On Apple computers, you can install R Aqua, a GUI-based version of R similar to the version available on Windows. You can also install a version of R that runs in an X server or in the Terminal. Either one should work fine for this assignment.
Once you've started R, you'll see a ">" symbol at the left of the interpreter window (Windows) or terminal (Unix). This is the R command prompt. To use R, you type commands into the R intepreter, hit return, and then R executes the commands. Note that this is very similar to how python operates. Once you have started an R session, you can type commands and see the results right away.
The R language contains many "built-in" functions (commands) for manipulating and analyzing data as well as for navigating your computer's file system. Fortunately, all of these functions are documented as part of the on-line help functionality packaged with each R distribution. (More on this later.)
If you are doing a lot of complicated things, it is usually a good idea to have another screen open with a text editor (like emacs) for text editing so that you can save your commands. To get R to execute the commands you've saved in a file, you can use the "source" command like so:
where "afile.R" is a file containing the R commands you would like to execute.
One neat feature of R's "source" command is that if your computer is connected to the Internet, you can even run commands contained in files at remote locations. To see how this works, type this into the R interpreter window - don't type the ">" symbol, which is the R prompt.
To quit R, you type:
R will then ask you if you want to save your data environment. If you answer "yes," the next time you launch the R interpreter, all the variables you defined previously will be available to you, and you can start up again exactly where you left off.
R on-line help system
The R interpeter contains built-in commands for manipulating and analyzing data that you import from files or create "randomly" using R's pseudo-random number generators (very useful for simulation). Fortunately, these commands are documented as part of the on-line help functionality packaged with each R distribution. Less fortunately, the documentation is not always easy for a beginner to understand. However, the more you use the help pages, the easier it will become for you to master the language. So even if the "help" pages seem confusing at first, you should make a point of taking time to read these pages carefully for each new command you use.
Start the on-line help function. To do this, type:
Note that invoking a command in R requires you to type the name of the command plus a pair of parentheses. In this case, you have invoked "help.start()" with no arguments, which would normally be placed within the parentheses.
The "help.start" command launches a Web browser which shows a page containing the table of contents for the on-line help system. To view the help page for the rnorm command, which lets you generate random numbers sampled from the normal distribution, you would type something like:
You can also see these commands in action by typing something like:
QUESTION 1: How do you use the help command to search for functions that you don't already know the name of? Hint: help(help) and go to the end of the page to the "See also" section.
R is an overgrown calculator
Like python, you can use R as a calculator. Just as with python, when you type a command and hit return, the R interpreter parses the command and then executes (evaluates) it, printing the result to the screen. If the command syntax is incorrect, R prints an error message instead of computing any output.
Try it out.
The number of codons in the genetic code:
The log, base 2, of 256:
To learn about other mathematical functions:
R is an overgrown calculator - but you can save your results using variables.
As with python, a variable in R is symbol that represents a value. To save output to a variable name, use the assignment operator: "<-" greater than symbol followed by a hyphen or an equals (=) sign.
Try out these simple examples:
Note that when you use the assignment operator, no value is returned or printed to the screen. The reason is that the assignment operator makes a change to the current R environment but doesn't return a value.
If you want to see the value of what you have created, you simply type the name of the variable at the prompt like so:
To see a list of the variables (objects) you have created thus far, use the ls command.
QUESTION 2: What does the "dir" command do? How is it different from "ls" ?
You should choose variable names that are easy for you to remember and which are not already assigned to built-in variables or functions.
To get rid of a variable, use the rm command. Try this:
Vectors, vectorized arithmetic, and optional arguments.
To create a vector (list) of elements, use the "c" operator.
Create vectors: z = c(4,8,10) and v = c(-1,2,5). Note what happens with:
You can change the values contained in an array using subscripting notation () and the assignment operator. Try this:
Note that R contains an object called NA which stands for "data not available". An data container like a vector or a list can contain multiple NA values. However, performing some statistical functions on vectors or lists containing NA values can yield results you may not expect, as in the example above.
Many commands in R (as with Unix) can take optional arguments. These optional arguments (also called flags) modify the behavior of the command. For example, mean computes the arithmatic average of its first argument, and the the na.rm=T flag tells it to ignore missing values (NA) in the calculation.
QUESTION 3: What option tells the 'mean' command to computed a trimmed average? When are trimmed averages useful? (To answer this question, do a Web search with keywords "trimmed mean".)
To define a vector containing character values, you would enter each character value using double or single quotes. Try this:
Character vectors are very useful when creating x and y-axis labels, or when you are working with sequence data.
Type in the following commands and note the result:
QUESTION 4: How does R deal with an attempt to create a vector with mixed types? To find out, try this:
You can use R's "seq" command to create new vectors of numeric values:
R also provides a shortcut syntax for creating a sequence:
which is equivalent to "seq(1,10)"
Subscripting in R works as you might expect - square brackets operators allow you to extract values:
However, you can also insert logical expressions in the square brackets to retrieve subsets of data from a vector or list. This is one of the most powerful (and useful) features of R.
And to negate:
QUESTION 5: What command would you use to select the components of v that are not missing (NA) values? Use the "v" below to test your answer:
Modes and lists.
Objects and variables can have different modes, which are similar to types in python. To determine the mode of given object, use the "mode" command.
QUESTION 6: What do the following commands return?
Vectors in R are like arrays in other languages, where only objects of the same mode can co-exist within the same vector. Thus, if a vector contains strings, it cannot also contain numbers. However, R also has a list data type that lets you store objects of any mode, including other lists. For example:
In this example, we first create a new list. Next, we add some data to it, using the double-square-brackets sub-scripting operator. When we access data stored in the list l, using the double-brackets retrieves the just the object stored at the requested location. When we access data using the single brackets, we get an object that is itself a list, a sub-list of the original list l.
QUESTION 7: How does this behavior compare with python? Does python bother with having two different list-like data structures? Why do you suppose that R has both vectors and lists?
QUESTION 8: Use the help function to determine the command you would use to count the number of items in a vector. Does the same command work on lists? Does the python len operator work similarly? What is it called when you can use the same command on objects of many different types?
A matrix in R is just a two-dimensional array of values that all have the same type. A convenient way to create a matrix is to use the matrix function:
Create a vector using the following command:
Read the R help section describing the matrix command.
QUESTION 9: What commands would you use to convert the vector v that you defined above to the following matrices?
Lists and data frames
Data frames are similar to matrices in that vector components of data frames must all be the same length. However, unlike matrices, data frames are actually lists (of class "data.frame") and can therefore contain objects of different modes.
The base R package comes with a number of built-in data sets which you can use to try out some of the features of data frames. To view a list of these built-in data sets, use the "data" command:
Load the "women" data frame in your R session and view the data:
Note that the columns represent variables (height and weight) while the rows represent individual women – samples. You can access a vector containing the women's heights using the following notations:
Note how the n,m syntax retrieves values from row(s) n and column(s) m. Note also that the values separated by the comma in n,m can be sequences, individual values, or no values at all, in which case the all possible values are assumed.
A linear regression refresher
You can fit a linear model (predicting weight based on height) to these data using the "lm" command like so
Note that the attach command allows us to refer to columns in the women data frame directly. (This will be important for the predict function in the next section.)
Use the 'cor' command to compute Pearson's correlation coefficient:
QUESTION 10: Which value reported by the summary command is equivalent to the square of the correlation coefficient?
Using a model to predict values
Once you have created a linear model, you can use it to predict values for the dependent variable (weight in this case) for a given value of the independent variable (height).
Read this tutorial on predicting values from a linear model
- Prediction Interval for Linear Regression http://www.r-tutor.com/elementary-statistics/simple-linear-regression/prediction-interval-linear-regression
QUESTION 11. Use your model to fill in the following table.