R programming language

From Christoph's Personal Wiki
Revision as of 06:52, 1 September 2006 by Christoph (Talk | contribs)

Jump to: navigation, search

The R programming language (or just "R"), sometimes described as "GNU S", is a mathematical language and environment used for statistical analysis and display. It was originally created by Ross Ihaka and Robert Gentleman (hence the name R) at the University of Auckland, New Zealand, and is now steadily developed further by a large community around the world.

It is based upon S, which was developed by John Chambers of Bell Laboratories and described in the paper "Evolution of the S Language" [1]. R is considered by its developers to be an implementation of S, with semantics derived from Scheme. The commercial implementation of S is S-PLUS [2].

R's source code is freely available under the GNU GPL. There are several GUIs for R, including RKWard, SciViews-R [3], and Rcmdr [4]. Many editors have specialised modes for R, including Emacs (Emacs Speaks Statistics), jEdit [5], Kate (text editor) [6], and Tinn [7], and there is an R plug-in for the Eclipse IDE framework.

R is highly extensible through the use of packages, which are user submitted libraries for specific functions or specific areas of study. A core set of packages are included with the installation of R, with many more available at the comprehensive R archive network, CRAN. The bioinformatics community has seeded a successful effort to use R for the analysis of data from molecular biology laboratories. The bioconductor project started in the fall of 2001 provides R packages for the analysis of genomic data. e.g. Affymetrix and cDNA microarray object-oriented data handling and analysis tools.

Installation

Installing R on SuSE 10.1 using the default settings for the rpm or source distribution seems to be a problem. Below are the methods I have used to resolved these problems.

First make sure you have the following installed (check http://www.rpmfind.net for the packages):

compat-g77
compat-gcc
gcc-g77

It also sometimes helps to create a soft link to gfortran like so (changing the directory to suit your needs):

ln -s /usr/bin/g77 /usr/bin/gfortran

Then, and this is important, add the following to your config.site (found in your R source directory):

FPICFLAGS=-g

Now you are ready to install R on SuSE:

./configure
make
make check
make pdf     # optional
make info    # optional
make install # as superuser ('root')

That's it. You are now ready to use R

Comparison with other programs

Although R is mostly used by statisticians, and other people in need of statistics, it can also be used as a general matrix calculation toolbox in a program such as GNU Octave or its proprietary counterpart, MATLAB.

It should not be confused with the R package [8], a collection of programs for multidimensional and spatial analysis available on Macintosh and VAX/VMS systems.

Basics

How to get help:

  • help.start() #Opens browser
  • help() #For more on using help
  • help(..) #For help on ..
  • help.search("..") #To search for ..

How to leave again:

  • q() #Image can be saved to .RData

Basic R commands

Most arithmetic operators work like you would expect in R:

> 4 + 2 #Prints ‘6’
> 3 * 4 #Prints ‘12’

Operators have precedence as known from basic algebra:

> 1 + 2 * 4 #Prints ‘9’, while
> (1 + 2) * 4 #Prints ‘12’

Functions

A function call in R looks like this:

  • function_name(arguments)
  • Examples:
> cos(pi/3) #Prints ‘0.5’
> exp(1) #Prints ‘2.718282’

A function is identified in R by the parentheses

  • That’s why it’s: help(), and not: help

Variables (objects) in R

To assign a value to a variable (object):

> x <- 4 #Assigns 4 to x
> x = 4 #Assigns 4 to x (new)
> x #Prints ‘4’
> y <- x + 2 #Assigns 6 to y

Functions for managing variables:

  • ls() or objects() lists all existing objects
  • str(x) tells the structure (type) of object ‘x’
  • rm(x) removes (deletes) the object ‘x’

Vectors

A vector in R is like a sequence of elements of the same mode.

> x <- 1:10 #Creates a vector
> y <- c(“a”,“b”,“c”) #So does this

Handy functions for vectors:

  • c() – Concatenates arguments into a vector
  • min() – Returns the smallest value in vector
  • max() – Returns the largest value in vector
  • mean() – Returns the mean of the vector

Elements in a vector can be accessed individually:

> x[1] #Prints first element
> x[1:10] #Prints first 10 elements
> x[c(1,3)] #Prints element 1 and 3

Most functions expect one vector as argument, rather than individual numbers

> mean(1,2,3) #Replies ‘1’
> mean(c(1,2,3)) #Replies ‘2’

The Recycling Rule

The recycling rule is a key concept for vector algebra in R.

When a vector is too short for a given operation, the elements are recycled and used again.

Examples of vectors that are too short:

> x <- c(1,2,3,4)
> y <- c(1,2) #y is too short
> x + y #Returns ‘2,4,4,6’

Data

All simple numerical objects in R function like a long string of numbers. In fact, even the simple: x <- 1, can be thought of like a vector with one element.

The functions dim(x) and str(x) returns information on the dimensionality of x.

Important Objects

  • vector – “A series of numbers”
  • matrix – “Tables of numbers”
  • data.frame – “More ‘powerful’ matrix (list of vectors)”
  • list – “Collections of other objects”
  • class – “Intelligent(?) lists”

Data Matrices

Matrices are created with the matrix() function.

> m <- matrix(1:12,nrow=3)

This produces something like this:

– [,1] [,2] [,3] [,4]
– [1,] 1 4 7 10
– [2,] 2 5 8 11
– [3,] 3 6 9 12

The recycling rule still applies:

> m <- matrix(c(2,5),nrow=3,ncol=3)

Gives the following matrix:

– [,1] [,2] [,3]
– [1,] 2 5 2
– [2,] 5 2 5
– [3,] 2 5 2

Indexing Matrices

For vectors we could specify one index vector like this:

> x <- c(2,0,1,5)
> x[c(1,3)] #Returns ‘2’ and ‘1’

For matrices we have to specify two vectors:

> m <- matrix(1:3,nrow=3,ncol=3)
> m[c(1,3),c(1,3)] #Ret. 2*2 matrix
> m[1,] #First row as vector

Beyond two dimensions

You can actually assign to dim():

> x <- 1:12
> dim(x) #Returns ‘NULL’
> dim(x) <- c(3,4) #3*4 Matrix
> dim(x) #Returns ‘3 4’
> dim(x) <- c(2,3,2) #x is now in 3d
> dim(x) #Returns ‘2 3 2’

But functions like mean() still work:

> mean(x) #Returns ‘6.5’

Graphics and visualisation

Visualization is one of R’s strong points.

R has many functions for drawing graphs, including:

  • hist(x) – Draws a histogram of values in x
  • plot(x,y) – Draws a basic xy plot of x against y

Adding stuff to plots

  • points(x,y) – Add point (x,y) to existing graph.
  • lines(x,y) – Connect points with line.

Graphical devices

A graphical device is what ‘displays’ the graph. It can be a window, it can be the printer.

Functions for plotting “Devices”:

  • X11() – This function allows you to change the size and composition of the plotting window.
  • par(mfrow=c(x,y)) – Splits a plotting device into x rows and y columns.
  • dev.print(postscript, file=“???.ps”)
  • Use this device to save the plot to a file.

DNA Microarray Analysis - Example

## Objects

x <- rnorm(30)

y <- x[x>0]

z <- x
z[z<0] <- 0

m <- matrix(x, nrow = 5)
str(m)

d.f <- as.data.frame(m)
str(d.f)

m[2,2] = "a"
d.f[2,2] = "a"
str(m)
str(d.f)


## Functions

cube <- function(x) {
  z <- x*x*x
  return(z)
}

fact <- function(x) {
z <- 1
  for (i in 2:x) {
    z <- z * i
  }
  return(z)
}

func <- function(x, y) {
  z <- cube(x) - fact(y)
  return(z)
}


## Graphics

hist(a <- rnorm(100))

X11()
plot(a <- rnorm(100), b <- rnorm(100))
points(a[a<0 & b>0], b[a<0 & b>0],col="green")
points(a[a>0 & b>0], b[a>0 & b>0],col="red")
points(a[a>0 & b<0], b[a>0 & b<0],col="blue")
points(a[a<0 & b<0], b[a<0 & b<0],col="yellow")
lines(c(-10,10),c(0,0))
lines(c(0,0),c(-10,10))

See also

External links