# DNA Microarray Analysis (academic course)/Dimension reduction

**PRINCIPAL COMPONENT ANALYSIS (PCA) AND CLUSTERING OF DNA MICROARRAY DATA**

This exercise will illustrate PCA and clustering analysis of data produced by DNA microarrays.

The exercise uses a leukemia dataset and several analysis methods in the R statistics package.

## Contents

## Dataset

The leukemia dataset contains gene expression levels for patients with high and low minimal residual disease (MRD). MRD is a measure of the number of leukemic cells that are present in the bone marrow after a period of treatment. MRD is therefore a measure for how well the patients respond to the given treatment. Patients with low MRD have a good response to treatment (and the cure rates for these patients are high), whereas patients with high MRD have a poor response to treatment (and a lower chance of long term survival).

The data set consists of 15 patients each analyzed using Affymetrix Focus Array GeneChips (8793 measured gene expressions for each patient). 9 of these 15 patients have a high MRD and 6 has low MRD.

In this study, an "experiment" refers to a patient, so "experiment" can be thought of as a patient.

## Method

R: Public domain data analysis package available for download for a number of different computer platforms at http://www.r-project.org/. There are several online manuals available: Kickstarting R and An Introduction to R (PDF file).

## Exercise

Read in Data First load the data into R:

Start R:

R

Check that your R directory is empty ( ls() will return characters(0) if directory is empty):

ls()

Load the data:

load("leukemiaMRD-data")

Check that you have loaded the data correctly:

ls()

You should now have the following objects loaded into R:

[1] "classes" "expnames" "Matrix"

Look at your data by typing "`classes`" and "`expnames`".

"classes" is a vector with the assigned classes (H: high MRD, L: low MRD).

The "expnames" are the names of the experiments followed by the class. Eg. the first patient (P) in the vector has number 20, has high MRD (H) and is therefore named: "P20.H".

Look at the first 15 lines of the data object "Matrix":

Matrix[1:15,]

Here, you see the first 15 rows of the "Matrix" object, where each row is a different gene (the Affymetrix ID’s are listed), and each column is an experiment (a patient). The values are gene expression levels for each combination of experiment and gene.

### Principal Component Analysis (PCA) on all Genes

Visualize the data before performing any analysis.

Load packages and functions: (Don’t worry about the Warning message: package 'ctest' has been merged into 'stats')

library(MASS) # function for writing to file is used library(class) # functions for class prediction source("R.functions")

Transpose[1] data matrix:

tposed <- t(Matrix)

Run PCA:

pca <- fast.prcomp(tposed, retx=TRUE,center=TRUE, scale. = TRUE)

Plot the first two principal components:

plot(pca$x[,1:2],type="n") text(pca$x[,1:2], expnames)

This plot shows the two patient groups using the labels of low (L) and high (H) MRD.

- Question 1: How well are the individual classes separated?

### Feature Selection: T-test

(See Section 3.5 and page 100 in the Microarray book )

Remaining in the R session started above; enter the following commands to perform a t-test of all genes.

First, assign your data to each class according to the vector "classes":

factors <- factor(classes) index <- 1:length(factors) index1 <- index[factors==levels(factors)[1]] # Category H index2 <- index[factors==levels(factors)[2]] # Category L

Calculated p-values for each gene:

pValues <- get.pval.ttest(Matrix,index1,index2)

Order the data according to lowest p-value:

orders <- order(pValues) Matrix.pvalues <- cbind(Matrix[orders,],pValues[orders])

Look at the 25 genes with the lowest p-value (the most significant genes):

Matrix.pvalues[1:25,]

- Question 2: With a Bonferroni cutoff of 1 false positive (a maximum p-value of 1/8793 = 1.14e-4), how many genes are significant?

### PCA of Top Ranking Genes

Perform a new principal component analysis by using only the 25 most significant genes.

Transpose data matrix of 25 top ranking genes:

tposed <- t(Matrix.pvalues[1:25,1:length(classes)])

Run PCA:

pca <- fast.prcomp(tposed, retx=TRUE,center=TRUE, scale. = TRUE)

Plot the first two principal components:

plot(pca$x[,1:2],type="n") text(pca$x[,1:2], expnames)

- Question 3: How well are the individual classes separated, now?

### Hierarchical Cluster Analysis

Make a copy of the original Matrix to work with and sort the rows according to p-values:

cluster.matrix <- Matrix[orders,] colnames(cluster.matrix) <- expnames

Calculate Euclidian distances:

d <- dist(t(cluster.matrix))

Perform clustering:

hc <- hclust(d)

Make a cluster plot:

plot(hc)

- Question 4: What do you see on this plot? How are the samples clustered?

Continue working with only the 50 most significant genes:

m <- cluster.matrix[1:50,]

Calculate Euclidian distances:

d <- dist(t(m))

Perform clustering:

hc <- hclust(d)

Make a cluster plot:

plot(hc)

- Question 5: What do you see on this plot? How are the samples clustered?

Perform another clustering using the top 50 most significant genes (here, the distances are based on Pearson's correlation coefficient):

heatmap(m, distfun = cordist, col = topo.colors(32))

- Question 6: What is the difference between this plot and the previous plot? Describe the plot.

Close down R Quit the R session with Control-D followed by the answer "n" to the question whether you want to save the workspace image.

[1] Transpose: is simply swapping the rows and columns, in this example genes are swapped to columns and patients to rows. This is done so the PCA is done on genes and not patients.