set.seed(2)
x = matrix(rnorm(50*2), ncol = 2)
x[1:25, 1] = x[1:25, 1]+3
x[1:25, 2] = x[1:25, 2]-4
We now perform K-means clustering with K=2
km.out = kmeans(x, 2, nstart = 20)
km.out
## K-means clustering with 2 clusters of sizes 25, 25
##
## Cluster means:
## [,1] [,2]
## 1 -0.1956978 -0.1848774
## 2 3.3339737 -4.0761910
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 65.40068 63.20595
## (between_SS / total_SS = 72.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
The cluster assignments of the 50 observations are contained in km.out$cluster
km.out$cluster
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
The K-means clustering perfectly separated the observations into two clusters even though we did not supply any group information to kmeans(). We can plot the data, with each observation colored according to its cluster assignment.
plot(x, col=(km.out$cluster+1), main="K means clustering results with K=2",
xlab="", ylab="", pch=20, cex=2)
Here the observations can be easily plotted because they are two-dimensional.If there were more than two variables then we could instead perform PCA and plot the first two PCs score vectors.In this example, we knew that there really were two clusters because we generated the data. However, for real data, in general we do not know the true number of clusters. We could instead have performed K-means clustering on this example with K=3.
set.seed(4)
km.out = kmeans(x,3,nstart=20)
km.out
## K-means clustering with 3 clusters of sizes 10, 23, 17
##
## Cluster means:
## [,1] [,2]
## 1 2.3001545 -2.69622023
## 2 -0.3820397 -0.08740753
## 3 3.7789567 -4.56200798
##
## Clustering vector:
## [1] 3 1 3 1 3 3 3 1 3 1 3 1 3 1 3 1 3 3 3 3 3 1 3 3 3 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 19.56137 52.67700 25.74089
## (between_SS / total_SS = 79.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
plot(x, col=(km.out$cluster+1), main="K means clustering results with K=3",
xlab="", ylab="", pch=20, cex=2)
We can check optimized number of clusters using silhouette index(fviz_nbclust function) function to compute average silhouette for k clusters
library(factoextra)
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_nbclust(x, kmeans, method = "silhouette")
yeast <- read.csv("http://bisyn.kaist.ac.kr/bis335/14-Clustering-yeast.csv")
#select from 2 to 9 columns using yeast data
yeast <- yeast[,-1]
myData <- yeast[,2:9]
#select the number of clusters / select the number of data and clusters as input
k <- 4
kmeans_result <- kmeans(myData, k)
#After switching to table format, check the number of proteins in each cluster
table(kmeans_result$cluster)
##
## 1 2 3 4
## 123 228 498 450
head(kmeans_result$cluster)
## [1] 3 3 3 3 3 3
#visualize the clusters and draw a plot in the class/ Identify the center of the clusters
par(mfrow = c(1,2))
head(myData[c("mit","gvh")])
plot(myData[c("mit","gvh")],col = kmeans_result$cluster, main = "(A) Plot with clusters")
plot(myData[c("mit","gvh")],col = yeast$class, main = "(B) Plot with actual classes")
points(kmeans_result$centers[,c("mit","gvh")],col = 1:4,pch = 8,cex = 2)
We begin by clustering observations using complete linkage. The dist() function is used to compute the 50 x 50 inter-observation Euclidean distance matrix.
hc.complete = hclust(dist(x), method="complete")
We could just as easily perform hierarchical clustering with average or single linkage instead:
hc.average = hclust(dist(x), method="average")
hc.single = hclust(dist(x), method="single")
We can now plot the dendrograms obtained using the usual plot() function. The numbers at the bottom of the plot identify each observation.
par(mfrow=c(1,3))
plot(hc.complete, main="Complete Linkage", xlab="", sub="", cex=.9)
plot(hc.average, main="Average Linkage", xlab="", sub="", cex=.9)
plot(hc.single, main="Single Linkage", xlab="", sub="", cex=.9)
par(mfrow=c(1,1))
To determine the cluster labels for each observation associated with a given cut of the dendrogram, we can use the cutree() function:
cutree(hc.complete, 2)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
cutree(hc.average, 2)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 2 2
## [36] 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2
cutree(hc.single, 2)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
For this data, complete and average linkage generally separate the observations into their correct groups. However, single linkage identifies one point as belonging to its own cluster. A more sensible answer is obtained when four clusters are selected, although there are still two singletons.
cutree(hc.single, 4)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3
## [36] 3 3 3 3 3 3 4 3 3 3 3 3 3 3 3
To scale the variables before performing hierarchical clustering of the observations, we use the scale() function:
xsc = scale(x)
plot(hclust(dist(xsc), method="complete"), main="Hierarchical Clustering with Scaled Features")
Correlation-based distance can be computed using the as.dist() function, which converts an arbitrary square symmetric matrix into a form that the hclust() function recognizes as a distance matrix. However, this only makes sense for data with at least three features since the absolute correlation between any two observations with measurements on two features is always 1. Hence, we will cluster a three-dimensional data set.
x = matrix(rnorm(30*3), ncol=3)
dd = as.dist(1-cor(t(x)))
plot(hclust(dd, method = "complete"), main = "Complete Linkage with Correlation-Based Distance",
xlab="",sub="")
#Find the distance matrix between the data points
yeast100 <- read.csv("http://bisyn.kaist.ac.kr/bis335/14-Clustering-yeast100.csv")
yeast100 <- yeast100[,-1]
myData_100 <- yeast100[,2:9]
myDist <- dist(myData_100)
#Implement the hierarchical clustering
hc <- hclust(myDist,method = "ave")
groups <- cutree(hc, k = k)
#Generate the dendrogram and draw the plot
plot(hc, hang <- -1,labels = groups)
rect.hclust(hc, k = 4, which = NULL,x = NULL,h = NULL,border = 2,cluster = NULL)
library(ISLR)
nci.labs = NCI60$labs
nci.data = NCI60$data
Each cell line is labeled with a cancer type. We do not make use of the cancer types in performing PCA and clustering, as there are unsupervised techniques. But after performing PCA, we will check to see the extent to which these cancer types agree with the results of these unsupervised techniques. The data has 64 rows and 6,830 columns.
dim(nci.data)
## [1] 64 6830
We begin by examining the cancer types for the cell lines.
nci.labs[1:4]
## [1] "CNS" "CNS" "CNS" "RENAL"
table(nci.labs)
## nci.labs
## BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA
## 7 5 7 1 1 6
## MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE
## 1 1 8 9 6 2
## RENAL UNKNOWN
## 9 1
We perform PCA on the data after scaling the variables (genes) to have standard deviation one, although one could reasonably argue that it is better not to scale the genes.
pr.out = prcomp(nci.data, scale=TRUE)
We now plot the first few principal component score vectors, in order to visualize the data. The observations (cell lines) corresponding to a given cancer type will be plotted in the same color, so that we can see to what extent the observations within a cancer type are similar to each other.
We first create a simple function that assigns a distinct color to each element of a numeric vector.The function will be used to assign a color to each of the 64 cell lines, based on the cancer type to which it corresponds.
Cols = function(vec){
cols = rainbow(length(unique(vec)))
return(cols[as.numeric(as.factor(vec))])
}
Note that the rainbow() function takes as its argument a positive integer, and returns a vector containing that number of distinct colors. We now can plot the principal component score vectors.
par(mfrow=c(1,2))
plot(pr.out$x[,1:2], col = Cols(nci.labs), pch=19,
xlab="Z1",ylab="Z2")
plot(pr.out$x[,c(1,3)], col = Cols(nci.labs), pch=19,
xlab="Z1",ylab="Z2")
cell lines corresponding to a single cancer type do tend to have similar values on the first few principal component score vectors. This indicates that cell lines from the same cancer type tend to have pretty similar gene expression levels. We can obtain a summary of the proportion of variance explained (PVE) of the first few principal components using the summary() method for a prcomp object:
summary(pr.out)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 27.8535 21.48136 19.82046 17.03256 15.97181
## Proportion of Variance 0.1136 0.06756 0.05752 0.04248 0.03735
## Cumulative Proportion 0.1136 0.18115 0.23867 0.28115 0.31850
## PC6 PC7 PC8 PC9 PC10
## Standard deviation 15.72108 14.47145 13.54427 13.14400 12.73860
## Proportion of Variance 0.03619 0.03066 0.02686 0.02529 0.02376
## Cumulative Proportion 0.35468 0.38534 0.41220 0.43750 0.46126
## PC11 PC12 PC13 PC14 PC15
## Standard deviation 12.68672 12.15769 11.83019 11.62554 11.43779
## Proportion of Variance 0.02357 0.02164 0.02049 0.01979 0.01915
## Cumulative Proportion 0.48482 0.50646 0.52695 0.54674 0.56590
## PC16 PC17 PC18 PC19 PC20
## Standard deviation 11.00051 10.65666 10.48880 10.43518 10.3219
## Proportion of Variance 0.01772 0.01663 0.01611 0.01594 0.0156
## Cumulative Proportion 0.58361 0.60024 0.61635 0.63229 0.6479
## PC21 PC22 PC23 PC24 PC25 PC26
## Standard deviation 10.14608 10.0544 9.90265 9.64766 9.50764 9.33253
## Proportion of Variance 0.01507 0.0148 0.01436 0.01363 0.01324 0.01275
## Cumulative Proportion 0.66296 0.6778 0.69212 0.70575 0.71899 0.73174
## PC27 PC28 PC29 PC30 PC31 PC32
## Standard deviation 9.27320 9.0900 8.98117 8.75003 8.59962 8.44738
## Proportion of Variance 0.01259 0.0121 0.01181 0.01121 0.01083 0.01045
## Cumulative Proportion 0.74433 0.7564 0.76824 0.77945 0.79027 0.80072
## PC33 PC34 PC35 PC36 PC37 PC38
## Standard deviation 8.37305 8.21579 8.15731 7.97465 7.90446 7.82127
## Proportion of Variance 0.01026 0.00988 0.00974 0.00931 0.00915 0.00896
## Cumulative Proportion 0.81099 0.82087 0.83061 0.83992 0.84907 0.85803
## PC39 PC40 PC41 PC42 PC43 PC44
## Standard deviation 7.72156 7.58603 7.45619 7.3444 7.10449 7.0131
## Proportion of Variance 0.00873 0.00843 0.00814 0.0079 0.00739 0.0072
## Cumulative Proportion 0.86676 0.87518 0.88332 0.8912 0.89861 0.9058
## PC45 PC46 PC47 PC48 PC49 PC50
## Standard deviation 6.95839 6.8663 6.80744 6.64763 6.61607 6.40793
## Proportion of Variance 0.00709 0.0069 0.00678 0.00647 0.00641 0.00601
## Cumulative Proportion 0.91290 0.9198 0.92659 0.93306 0.93947 0.94548
## PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 6.21984 6.20326 6.06706 5.91805 5.91233 5.73539
## Proportion of Variance 0.00566 0.00563 0.00539 0.00513 0.00512 0.00482
## Cumulative Proportion 0.95114 0.95678 0.96216 0.96729 0.97241 0.97723
## PC57 PC58 PC59 PC60 PC61 PC62
## Standard deviation 5.47261 5.2921 5.02117 4.68398 4.17567 4.08212
## Proportion of Variance 0.00438 0.0041 0.00369 0.00321 0.00255 0.00244
## Cumulative Proportion 0.98161 0.9857 0.98940 0.99262 0.99517 0.99761
## PC63 PC64
## Standard deviation 4.04124 2.148e-14
## Proportion of Variance 0.00239 0.000e+00
## Cumulative Proportion 1.00000 1.000e+00
Using the plot() function, we can also plot the variance explained by the first few principal components.
plot(pr.out)
Note that the height of each bar in the bar plot is given by squaring the corresponding element of pr.out$sdev. However, it is more informative to plot the PVE of each principal component (i.e. a scree plot) and the cumulative PVE of each principal component. This can be done with just a little work.
pve = 100*pr.out$sdev^2/sum(pr.out$sdev^2)
par(mfrow=c(1,2))
plot(pve, type = "o", ylab = "PVE", xlab="Principal Component", col="blue")
plot(cumsum(pve), type="o", ylab="Cumulative PVE", xlab="Principal Component", col ="brown3")
cancer <- read.csv("http://bisyn.kaist.ac.kr/bis335/14-Clustering-cancer.csv")
rownames(cancer) <- cancer[, 1]
cancer <- cancer[, -1]
cor <- abs(cor(cancer, method = "pearson"))
dist.cancer <- as.dist(1 - cor, diag = T, upper = T)
## a)Using complete linkage method, make hierarchical clustering tree
hc <- hclust(dist.cancer, method = "complete")
plot(hc)
## b)determine how many clusters are appropriate to these patients group. Use Silhouette index
dist.cancer <- 1 - cor
## 2 clusters case ##
#for ("D", "E") cluster
incluster <- colMeans(dist.cancer[4:5, 4:5])
outcluster <- colMeans(dist.cancer[1:3, 4:5])
sj1 <- (outcluster[1] - incluster[1])/max(incluster[1], outcluster[1])
sj1 <- sj1 + (outcluster[2] - incluster[2])/max(incluster[2], outcluster[2])
sj1/2 #0.885
## D
## 0.8857695
#for ("A", "B", "C") cluster
incluster <- colMeans(dist.cancer[1:3, 1:3])
outcluster <- colMeans(dist.cancer[4:5, 1:3])
sj2 <- (outcluster[1] - incluster[1])/max(incluster[1], outcluster[1])
sj2 <- sj2 + (outcluster[2] - incluster[2])/max(incluster[2], outcluster[2])
sj2 <- sj2 + (outcluster[3] - incluster[3])/max(incluster[3], outcluster[3])
sj2/3 #0.756
## A
## 0.7561417
GSu <- (sj1/2 + sj2/3)/5
GSu #0.328
## D
## 0.3283822
# you can calculate Gsu(silhouette index) for the 3 clusters and 4 clusters using above code
## 3 clusters case##
## 4 clusters case##