Contents

Part I: Clustering Methods

Part II: Principal Components Analysis


Tutorial

Clustering 1 - Clustering Methods

K-Means Clustering

  • The function kmeans() performs K-means clustering in R.
  • We begin with a simple simulated example in which there truly are two clusters in the data:
  • the first 25 observations have a mean shift relative to the next 25 observations.
set.seed(2)
x = matrix(rnorm(50*2), ncol = 2)
x[1:25, 1] = x[1:25, 1]+3
x[1:25, 2] = x[1:25, 2]-4

We now perform K-means clustering with K=2

km.out = kmeans(x, 2, nstart = 20)
km.out
## K-means clustering with 2 clusters of sizes 25, 25
## 
## Cluster means:
##         [,1]       [,2]
## 1 -0.1956978 -0.1848774
## 2  3.3339737 -4.0761910
## 
## Clustering vector:
##  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 65.40068 63.20595
##  (between_SS / total_SS =  72.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

The cluster assignments of the 50 observations are contained in km.out$cluster

km.out$cluster
##  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

The K-means clustering perfectly separated the observations into two clusters even though we did not supply any group information to kmeans(). We can plot the data, with each observation colored according to its cluster assignment.

plot(x, col=(km.out$cluster+1), main="K means clustering results with K=2", 
     xlab="", ylab="", pch=20, cex=2)

Here the observations can be easily plotted because they are two-dimensional.If there were more than two variables then we could instead perform PCA and plot the first two PCs score vectors.In this example, we knew that there really were two clusters because we generated the data. However, for real data, in general we do not know the true number of clusters. We could instead have performed K-means clustering on this example with K=3.

set.seed(4)
km.out = kmeans(x,3,nstart=20)
km.out
## K-means clustering with 3 clusters of sizes 10, 23, 17
## 
## Cluster means:
##         [,1]        [,2]
## 1  2.3001545 -2.69622023
## 2 -0.3820397 -0.08740753
## 3  3.7789567 -4.56200798
## 
## Clustering vector:
##  [1] 3 1 3 1 3 3 3 1 3 1 3 1 3 1 3 1 3 3 3 3 3 1 3 3 3 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 19.56137 52.67700 25.74089
##  (between_SS / total_SS =  79.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
plot(x, col=(km.out$cluster+1), main="K means clustering results with K=3",
     xlab="", ylab="", pch=20, cex=2)

We can check optimized number of clusters using silhouette index(fviz_nbclust function) function to compute average silhouette for k clusters

library(factoextra)
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_nbclust(x, kmeans, method = "silhouette")

K-Means Clustering - biological example

yeast <- read.csv("http://bisyn.kaist.ac.kr/bis335/14-Clustering-yeast.csv")
#select from 2 to 9 columns using yeast data
yeast <- yeast[,-1]
myData <- yeast[,2:9] 
#select the number of clusters / select the number of data and clusters as input
k <- 4
kmeans_result <- kmeans(myData, k)
#After switching to table format, check the number of proteins in each cluster
table(kmeans_result$cluster)
## 
##   1   2   3   4 
## 123 228 498 450
head(kmeans_result$cluster)
## [1] 3 3 3 3 3 3
#visualize the clusters and draw a plot in the class/ Identify the center of the clusters
par(mfrow = c(1,2))
head(myData[c("mit","gvh")])
plot(myData[c("mit","gvh")],col = kmeans_result$cluster, main = "(A) Plot with clusters")
plot(myData[c("mit","gvh")],col = yeast$class, main = "(B) Plot with actual classes")
points(kmeans_result$centers[,c("mit","gvh")],col = 1:4,pch = 8,cex = 2)

Hierarchical clustering

  • The hclust() function implements hierarchical clustering in R.
  • In the following example we use the data from Section 10.5.1 to plot the hierarchical clustering dendogram
  • using complete, single, and average linkage clustering, with Euclidean distance as the dissimilarity measure.

We begin by clustering observations using complete linkage. The dist() function is used to compute the 50 x 50 inter-observation Euclidean distance matrix.

hc.complete = hclust(dist(x), method="complete")

We could just as easily perform hierarchical clustering with average or single linkage instead:

hc.average = hclust(dist(x), method="average")
hc.single = hclust(dist(x), method="single")

We can now plot the dendrograms obtained using the usual plot() function. The numbers at the bottom of the plot identify each observation.

par(mfrow=c(1,3))
plot(hc.complete, main="Complete Linkage", xlab="", sub="", cex=.9)
plot(hc.average, main="Average Linkage", xlab="", sub="", cex=.9)
plot(hc.single, main="Single Linkage", xlab="", sub="", cex=.9)

par(mfrow=c(1,1))

To determine the cluster labels for each observation associated with a given cut of the dendrogram, we can use the cutree() function:

cutree(hc.complete, 2)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
cutree(hc.average, 2)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 2 2
## [36] 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2
cutree(hc.single, 2)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

For this data, complete and average linkage generally separate the observations into their correct groups. However, single linkage identifies one point as belonging to its own cluster. A more sensible answer is obtained when four clusters are selected, although there are still two singletons.

cutree(hc.single, 4)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3
## [36] 3 3 3 3 3 3 4 3 3 3 3 3 3 3 3

To scale the variables before performing hierarchical clustering of the observations, we use the scale() function:

xsc = scale(x)
plot(hclust(dist(xsc), method="complete"), main="Hierarchical Clustering with Scaled Features")

Correlation-based distance can be computed using the as.dist() function, which converts an arbitrary square symmetric matrix into a form that the hclust() function recognizes as a distance matrix. However, this only makes sense for data with at least three features since the absolute correlation between any two observations with measurements on two features is always 1. Hence, we will cluster a three-dimensional data set.

x = matrix(rnorm(30*3), ncol=3)
dd = as.dist(1-cor(t(x)))
plot(hclust(dd, method = "complete"), main = "Complete Linkage with Correlation-Based Distance",
     xlab="",sub="")

Hierarchical Clustering - biological example

#Find the distance matrix between the data points
yeast100 <- read.csv("http://bisyn.kaist.ac.kr/bis335/14-Clustering-yeast100.csv")
yeast100 <- yeast100[,-1]
myData_100 <- yeast100[,2:9]
myDist <- dist(myData_100)

#Implement the hierarchical clustering
hc <- hclust(myDist,method = "ave")
groups <- cutree(hc, k = k)

#Generate the dendrogram and draw the plot
plot(hc, hang <- -1,labels = groups)
rect.hclust(hc, k = 4, which = NULL,x = NULL,h = NULL,border = 2,cluster = NULL)

Clustering 2 - Principal Components Analysis

PCA

  • Unsupervised techniques are often used in the analysis of genomic data.
  • In particular, PCA and hierarchical clustering are popular tools.
  • We illustrate these techniques on the NCI60 cancer cell line microarray data,
  • which consists of 6,830 gene expression measurements on 64 cancer cell lines.
library(ISLR)
nci.labs = NCI60$labs
nci.data = NCI60$data

Each cell line is labeled with a cancer type. We do not make use of the cancer types in performing PCA and clustering, as there are unsupervised techniques. But after performing PCA, we will check to see the extent to which these cancer types agree with the results of these unsupervised techniques. The data has 64 rows and 6,830 columns.

dim(nci.data)
## [1]   64 6830

We begin by examining the cancer types for the cell lines.

nci.labs[1:4]
## [1] "CNS"   "CNS"   "CNS"   "RENAL"
table(nci.labs)
## nci.labs
##      BREAST         CNS       COLON K562A-repro K562B-repro    LEUKEMIA 
##           7           5           7           1           1           6 
## MCF7A-repro MCF7D-repro    MELANOMA       NSCLC     OVARIAN    PROSTATE 
##           1           1           8           9           6           2 
##       RENAL     UNKNOWN 
##           9           1

We perform PCA on the data after scaling the variables (genes) to have standard deviation one, although one could reasonably argue that it is better not to scale the genes.

pr.out = prcomp(nci.data, scale=TRUE)

We now plot the first few principal component score vectors, in order to visualize the data. The observations (cell lines) corresponding to a given cancer type will be plotted in the same color, so that we can see to what extent the observations within a cancer type are similar to each other.

We first create a simple function that assigns a distinct color to each element of a numeric vector.The function will be used to assign a color to each of the 64 cell lines, based on the cancer type to which it corresponds.

Cols = function(vec){
    cols = rainbow(length(unique(vec)))
    return(cols[as.numeric(as.factor(vec))])
}

Note that the rainbow() function takes as its argument a positive integer, and returns a vector containing that number of distinct colors. We now can plot the principal component score vectors.

par(mfrow=c(1,2))
plot(pr.out$x[,1:2], col = Cols(nci.labs), pch=19,
     xlab="Z1",ylab="Z2")
plot(pr.out$x[,c(1,3)], col = Cols(nci.labs), pch=19,
     xlab="Z1",ylab="Z2")

cell lines corresponding to a single cancer type do tend to have similar values on the first few principal component score vectors. This indicates that cell lines from the same cancer type tend to have pretty similar gene expression levels. We can obtain a summary of the proportion of variance explained (PVE) of the first few principal components using the summary() method for a prcomp object:

summary(pr.out)
## Importance of components:
##                            PC1      PC2      PC3      PC4      PC5
## Standard deviation     27.8535 21.48136 19.82046 17.03256 15.97181
## Proportion of Variance  0.1136  0.06756  0.05752  0.04248  0.03735
## Cumulative Proportion   0.1136  0.18115  0.23867  0.28115  0.31850
##                             PC6      PC7      PC8      PC9     PC10
## Standard deviation     15.72108 14.47145 13.54427 13.14400 12.73860
## Proportion of Variance  0.03619  0.03066  0.02686  0.02529  0.02376
## Cumulative Proportion   0.35468  0.38534  0.41220  0.43750  0.46126
##                            PC11     PC12     PC13     PC14     PC15
## Standard deviation     12.68672 12.15769 11.83019 11.62554 11.43779
## Proportion of Variance  0.02357  0.02164  0.02049  0.01979  0.01915
## Cumulative Proportion   0.48482  0.50646  0.52695  0.54674  0.56590
##                            PC16     PC17     PC18     PC19    PC20
## Standard deviation     11.00051 10.65666 10.48880 10.43518 10.3219
## Proportion of Variance  0.01772  0.01663  0.01611  0.01594  0.0156
## Cumulative Proportion   0.58361  0.60024  0.61635  0.63229  0.6479
##                            PC21    PC22    PC23    PC24    PC25    PC26
## Standard deviation     10.14608 10.0544 9.90265 9.64766 9.50764 9.33253
## Proportion of Variance  0.01507  0.0148 0.01436 0.01363 0.01324 0.01275
## Cumulative Proportion   0.66296  0.6778 0.69212 0.70575 0.71899 0.73174
##                           PC27   PC28    PC29    PC30    PC31    PC32
## Standard deviation     9.27320 9.0900 8.98117 8.75003 8.59962 8.44738
## Proportion of Variance 0.01259 0.0121 0.01181 0.01121 0.01083 0.01045
## Cumulative Proportion  0.74433 0.7564 0.76824 0.77945 0.79027 0.80072
##                           PC33    PC34    PC35    PC36    PC37    PC38
## Standard deviation     8.37305 8.21579 8.15731 7.97465 7.90446 7.82127
## Proportion of Variance 0.01026 0.00988 0.00974 0.00931 0.00915 0.00896
## Cumulative Proportion  0.81099 0.82087 0.83061 0.83992 0.84907 0.85803
##                           PC39    PC40    PC41   PC42    PC43   PC44
## Standard deviation     7.72156 7.58603 7.45619 7.3444 7.10449 7.0131
## Proportion of Variance 0.00873 0.00843 0.00814 0.0079 0.00739 0.0072
## Cumulative Proportion  0.86676 0.87518 0.88332 0.8912 0.89861 0.9058
##                           PC45   PC46    PC47    PC48    PC49    PC50
## Standard deviation     6.95839 6.8663 6.80744 6.64763 6.61607 6.40793
## Proportion of Variance 0.00709 0.0069 0.00678 0.00647 0.00641 0.00601
## Cumulative Proportion  0.91290 0.9198 0.92659 0.93306 0.93947 0.94548
##                           PC51    PC52    PC53    PC54    PC55    PC56
## Standard deviation     6.21984 6.20326 6.06706 5.91805 5.91233 5.73539
## Proportion of Variance 0.00566 0.00563 0.00539 0.00513 0.00512 0.00482
## Cumulative Proportion  0.95114 0.95678 0.96216 0.96729 0.97241 0.97723
##                           PC57   PC58    PC59    PC60    PC61    PC62
## Standard deviation     5.47261 5.2921 5.02117 4.68398 4.17567 4.08212
## Proportion of Variance 0.00438 0.0041 0.00369 0.00321 0.00255 0.00244
## Cumulative Proportion  0.98161 0.9857 0.98940 0.99262 0.99517 0.99761
##                           PC63      PC64
## Standard deviation     4.04124 2.148e-14
## Proportion of Variance 0.00239 0.000e+00
## Cumulative Proportion  1.00000 1.000e+00

Using the plot() function, we can also plot the variance explained by the first few principal components.

plot(pr.out)

Note that the height of each bar in the bar plot is given by squaring the corresponding element of pr.out$sdev. However, it is more informative to plot the PVE of each principal component (i.e. a scree plot) and the cumulative PVE of each principal component. This can be done with just a little work.

pve = 100*pr.out$sdev^2/sum(pr.out$sdev^2)
par(mfrow=c(1,2))
plot(pve, type = "o", ylab = "PVE", xlab="Principal Component", col="blue")
plot(cumsum(pve), type="o", ylab="Cumulative PVE", xlab="Principal Component", col ="brown3")

Summary example

  • Hierarchical clustering of cancer patients
  • We extracted top50 DEG among cancer patients and using these genes, made distance matrix based on “pearson correlation” between patients. Below table is a distance matrix of five cancer patients whose cancer types are different and unknown.
cancer <- read.csv("http://bisyn.kaist.ac.kr/bis335/14-Clustering-cancer.csv")
rownames(cancer) <- cancer[, 1]
cancer <- cancer[, -1]

cor <- abs(cor(cancer, method = "pearson"))
dist.cancer <- as.dist(1 - cor, diag = T, upper = T)

## a)Using complete linkage method, make hierarchical clustering tree 
hc <- hclust(dist.cancer, method = "complete")
plot(hc)

## b)determine how many clusters are appropriate to these patients group. Use Silhouette index
dist.cancer <- 1 - cor

## 2 clusters case ##
#for ("D", "E") cluster
incluster <- colMeans(dist.cancer[4:5, 4:5])
outcluster <- colMeans(dist.cancer[1:3, 4:5])
sj1 <- (outcluster[1] - incluster[1])/max(incluster[1], outcluster[1])
sj1 <- sj1 + (outcluster[2] - incluster[2])/max(incluster[2], outcluster[2])
sj1/2  #0.885
##         D 
## 0.8857695
#for ("A", "B", "C") cluster
incluster <- colMeans(dist.cancer[1:3, 1:3])
outcluster <- colMeans(dist.cancer[4:5, 1:3])
sj2 <- (outcluster[1] - incluster[1])/max(incluster[1], outcluster[1])
sj2 <- sj2 + (outcluster[2] - incluster[2])/max(incluster[2], outcluster[2])
sj2 <- sj2 + (outcluster[3] - incluster[3])/max(incluster[3], outcluster[3])
sj2/3  #0.756
##         A 
## 0.7561417
GSu <- (sj1/2 + sj2/3)/5
GSu #0.328
##         D 
## 0.3283822
# you can calculate Gsu(silhouette index) for the 3 clusters and 4 clusters using above code
## 3 clusters case##
## 4 clusters case##