Contents

Clustering Methods

  • K-means clustering
  • Hierarchical clustering
  • Clustering Assessment methods

Principal Components Analysis

  • Principle, Interpretation, Application of PCA

Tutorial

K-Means Clustering

  • Reference: Bioinformatics using R (acorn)
load("Z:/Lecture/1-BIS335BioStat/14-Clustering/4-tutorial/14-Clustering-yeast.rda")
#select from 2 to 9 columns using yeast data
myData <- yeast[,2:9] 
#select the number of clusters / select the number of data and clusters as input
k <- 4
kmeans_result <- kmeans(myData, k)
#After switching to table format, check the number of proteins in each cluster
table(kmeans_result$cluster)
## 
##   1   2   3   4 
## 450 228 123 498
head(kmeans_result$cluster)
##  6 10 13 16 17 21 
##  4  4  4  4  4  4
#visualize the clusters and draw a plot in the class/ Identify the center of the clusters
par(mfrow = c(1,2))
head(myData[c("mit","gvh")])
##     mit  gvh
## 6  0.17 0.40
## 10 0.15 0.39
## 13 0.35 0.42
## 16 0.11 0.44
## 17 0.11 0.39
## 21 0.16 0.40
plot(myData[c("mit","gvh")],col = kmeans_result$cluster, main = "(A) Plot with clusters")
plot(myData[c("mit","gvh")],col = yeast$class, main = "(B) Plot with actual classes")
points(kmeans_result$centers[,c("mit","gvh")],col = 1:4,pch = 8,cex = 2)

Hierarchical clustering

#Find the distance matrix between the data points
load("Z:/Lecture/1-BIS335BioStat/14-Clustering/4-tutorial/14-Clustering-yeast100.rda")

myData_100 <- yeast100[,2:9]
myDist <- dist(myData_100)

#Implement the hierarchical clustering
hc <- hclust(myDist,method = "ave")
groups <- cutree(hc, k = k)

#Generate the dendrogram and draw the plot
plot(hc, hang <- -1,labels = groups)
rect.hclust(hc, k = 4, which = NULL,x = NULL,h = NULL,border = 2,cluster = NULL)

PCA

  • Let’s consider a case where there is a relationship between the xs and ys.
  • In this case, the plot shows that there is a significant relationship between x and y; this is confirmed by the non-zero off-diagonal elements of the covariance matrix.
x <- rnorm(n = 300, mean = 7, sd = 1)
y <- rnorm(n = 300, mean = 7, sd = 2) +  2 * x

plot(y ~ x)

  • If you look at the code for this example, you can see that y is a mix of a random component and a weighted portion of x, so these results shouldn’t be all that surprising. If you were given the data in the figure above and asked to best represent each point using only one number instead of two, a few options might come to mind. You could just forget about all of the x values and report the ys. Or you could just forget about the y values and report the xs. While either of these options might seem equally good, replotting the data with equal scales might cause you to change your mind:
plot(y ~ x, xlim = c(0,30), ylim = c(0,30), cex = 0.5)

  • Here we can see that there is more variance in y, so the best that we might do given one value per point is to report the ys, and tell our audience to assume that all the xs were 7. In this case, the estimated location of each point wouldn’t be too far from its actual location.
  • However, since there is a clear correlation between x and y, we can do better by reporting the position along the blue line shown in the plot below.
samples <- cbind(x,y)
head(samples)
##             x        y
## [1,] 6.836541 18.00537
## [2,] 8.754436 22.85599
## [3,] 6.410522 21.60640
## [4,] 7.760650 24.88630
## [5,] 8.516666 22.25930
## [6,] 6.920867 20.18476
results <- prcomp(samples)
pc1 <- results$rotation[,"PC1"]
samples.xfromed <- results$x
  • In PCA, we choose the first principal component as the direction in the original data that contains the most variance. The second principal component is the direction that is orthogonal (i.e., perpendicular) to PC1 that contains the most remaining variance. In a two dimensional case the second PC is trivially determined since there is only one direction left that is orthogonal to PC1, but in higher dimensions, the remaining variance comes into play. PC3 is chosen as the direction that is orthogonal to both PC1 and PC2, and contains the most remaining variance, and so on. . . Regardless of the number of dimensions, this all just boils down to a rotation of the original data to align to newly chosen axes. One of the properties of the transformed data is that all of the covariances are zero. You can see this from the covariance matrix of the transformed data for our example:
round(cov(samples.xfromed),digits = 3)
##       PC1   PC2
## PC1 9.396 0.000
## PC2 0.000 0.456
  • Here we see that, for our toy example, 95% of the variation in the data is captured by the first PC.
results$sdev
## [1] 3.0652828 0.6752466
explained_variance <- round((results$sdev^2) / sum(results$sdev^2), digits = 2)
explained_variance
## [1] 0.95 0.05

Clustering for biological data (in the lecture material)

  • Hierarchical clustering of cancer patients
  • We extracted top50 DEG among cancer patients and using these genes, made distance matrix based on “pearson correlation” between patients. Below table is a distance matrix of five cancer patients whose cancer types are different and unknown.
cancer <- read.csv("http://bisyn.kaist.ac.kr/bis335/14-Clustering-cancer.csv")
rownames(cancer) <- cancer[, 1]
cancer <- cancer[, -1]

cor <- abs(cor(cancer, method = "pearson"))
dist.cancer <- as.dist(1 - cor, diag = T, upper = T)

## a)Using complete linkage method, make hierarchical clustering tree 
hc <- hclust(dist.cancer, method = "complete")
plot(hc)

## b)determine how many clusters are appropriate to these patients group. Use Silhouette index
dist.cancer <- 1 - cor

## 2 clusters case ##
#for ("D", "E") cluster
incluster <- colMeans(dist.cancer[4:5, 4:5])
outcluster <- colMeans(dist.cancer[1:3, 4:5])
sj1 <- (outcluster[1] - incluster[1])/max(incluster[1], outcluster[1])
sj1 <- sj1 + (outcluster[2] - incluster[2])/max(incluster[2], outcluster[2])
sj1/2  #0.885
##         D 
## 0.8857695
#for ("A", "B", "C") cluster
incluster <- colMeans(dist.cancer[1:3, 1:3])
outcluster <- colMeans(dist.cancer[4:5, 1:3])
sj2 <- (outcluster[1] - incluster[1])/max(incluster[1], outcluster[1])
sj2 <- sj2 + (outcluster[2] - incluster[2])/max(incluster[2], outcluster[2])
sj2 <- sj2 + (outcluster[3] - incluster[3])/max(incluster[3], outcluster[3])
sj2/3  #0.756
##         A 
## 0.7561417
GSu <- (sj1/2 + sj2/3)/5
GSu #0.328
##         D 
## 0.3283822
# you can calculate Gsu(silhouette index) for the 3 clusters and 4 clusters using above code
## 3 clusters case##
## 4 clusters case##