Introduction
In this lab we will try to classify italian wines from several wine
makers. The dataset has 178 wines for which 13 quantitative variables
have been recorded. You will perform a K–means classification for those
wines. However you’ll have to wait a bit and re-work on the Fisher’s
iris data set to see if you understand and are able to reproduce the
results we discussed during the lecture.
Re-analyzing the Fisher’s iris dataset
The data set is already embbeded into the R software. To retrieve
them, just invoke
data(iris)
head(iris)
Recall that K–means classification is unsupervised so you should not
use the last variable—which is even not quantitative so, unless you do
not understand at all the K–means, you would not have used it, wouldn’t
you?
- Start with a (short) descriptive analysis
- Read the documentation of the kmeans function. Please pay
particular attention to the arguments x, centers,
iter.max et nstart.
- Perform a K–means classification
- Interpret results
- Vary the number of classes and get the “optimal” number of classes
(if you’re lost elbow rule)
- Create some fictious flowers and predict its species. (Eventually
create a function dedicated to that purpose)
- Since we do have access to the Species for each flower (5th
column), compute the confusion matrix. The table function might
help.
- Optional. I wrote an incomplete piece of code which is a
(re)implementation of the LLoyd alogorithm. Fill in gaps.
mykmeans <- function(data, n.class, iter.max = 50){
## data : matrix of the data
## n.class : number of classes
## iter.max : maximal number of iterations
n.obs <- nrow(data)
data <- as.matrix(data)## convert a data frame to matrix (if needed)
## Centroid initialization (taken randomly without replacement into individuals)
centroids <- data[sample(n.obs, n.class),]
for (i in 1:iter.max){
## Compute the distance for each individuals to each centroid
dist <- matrix(NA, n.obs, n.class)
for (class in 1:n.class)
dist[,class] <- colSums((t(data) - centroids[class,])^2)## nasty piece of code that I can explain
## Assign labels to each individuals
cluster <- 1#### To be filled
## Centroids update
centroids <- 1## To be filled
if (a_stopping_rule_to_define)
break
}
## Compute within and between inertia
within <- 1## To be filled
between <- 1## To be filled
if (iter == iter.max)
warning("Maximal number of iterations has been reached :-(")
return(list(cluster = cluster, within = within, between = between, iter = iter, centroids = centroids))
}
Application: Italian wines
The data can be retrieved from here.
Start with a descriptive analysis of the data and, next, perform a
K–means classification. Comment results. Perform predictions on the
following new wines
new.wines <- rbind(wine1 = c(12.37, 1.17, 1.92, 19.6, 78, 2.11, 2.00, 0.27, 1.04, 4.68, 1.12, 3.48, 510),
wine2 = c(13.45, 3.70, 2.60, 23.0, 111, 1.70, 0.92, 0.43, 1.46, 10.68, 0.85, 1.56, 695))
Good luck !
LS0tCnRpdGxlOiAiVmlucyBpdGFsaWVucyBldCBLLW1lYW5zIgphdXRob3I6IE1hdGhpZXUgUmliYXRldApvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgojIyBJbnRyb2R1Y3Rpb24KCkluIHRoaXMgbGFiIHdlIHdpbGwgdHJ5IHRvIGNsYXNzaWZ5IGl0YWxpYW4gd2luZXMgZnJvbSBzZXZlcmFsIHdpbmUgbWFrZXJzLiBUaGUgZGF0YXNldCBoYXMgMTc4IHdpbmVzIGZvciB3aGljaCAxMyBxdWFudGl0YXRpdmUgdmFyaWFibGVzIGhhdmUgYmVlbiByZWNvcmRlZC4gWW91IHdpbGwgcGVyZm9ybSBhIEstLW1lYW5zIGNsYXNzaWZpY2F0aW9uIGZvciB0aG9zZSB3aW5lcy4gSG93ZXZlciB5b3UnbGwgaGF2ZSB0byB3YWl0IGEgYml0IGFuZCByZS13b3JrIG9uIHRoZSBGaXNoZXIncyBpcmlzIGRhdGEgc2V0IHRvIHNlZSBpZiB5b3UgdW5kZXJzdGFuZCBhbmQgYXJlIGFibGUgdG8gcmVwcm9kdWNlIHRoZSByZXN1bHRzIHdlIGRpc2N1c3NlZCBkdXJpbmcgdGhlIGxlY3R1cmUuCgojIyBSZS1hbmFseXppbmcgdGhlIEZpc2hlcidzIGlyaXMgZGF0YXNldAoKVGhlIGRhdGEgc2V0IGlzIGFscmVhZHkgZW1iYmVkZWQgaW50byB0aGUgUiBzb2Z0d2FyZS4gVG8gcmV0cmlldmUgdGhlbSwganVzdCBpbnZva2UKYGBge3J9CmRhdGEoaXJpcykKaGVhZChpcmlzKQpgYGAKClJlY2FsbCB0aGF0IEstLW1lYW5zIGNsYXNzaWZpY2F0aW9uIGlzIHVuc3VwZXJ2aXNlZCBzbyB5b3Ugc2hvdWxkIG5vdCB1c2UgdGhlIGxhc3QgdmFyaWFibGUtLS13aGljaCBpcyBldmVuIG5vdCBxdWFudGl0YXRpdmUgc28sIHVubGVzcyB5b3UgZG8gbm90IHVuZGVyc3RhbmQgYXQgYWxsIHRoZSBLLS1tZWFucywgeW91IHdvdWxkIG5vdCBoYXZlIHVzZWQgaXQsIHdvdWxkbid0IHlvdT8KCjEuIFN0YXJ0IHdpdGggYSAoc2hvcnQpIGRlc2NyaXB0aXZlIGFuYWx5c2lzCjIuIFJlYWQgdGhlIGRvY3VtZW50YXRpb24gb2YgdGhlICprbWVhbnMqIGZ1bmN0aW9uLiBQbGVhc2UgcGF5IHBhcnRpY3VsYXIgYXR0ZW50aW9uIHRvIHRoZSBhcmd1bWVudHMgKngqLCAqY2VudGVycyosICppdGVyLm1heCogZXQgKm5zdGFydCouCjMuIFBlcmZvcm0gYSBLLS1tZWFucyBjbGFzc2lmaWNhdGlvbgo0LiBJbnRlcnByZXQgcmVzdWx0cwo1LiBWYXJ5IHRoZSBudW1iZXIgb2YgY2xhc3NlcyBhbmQgZ2V0IHRoZSAib3B0aW1hbCIgbnVtYmVyIG9mIGNsYXNzZXMgKGlmIHlvdSdyZSBsb3N0IGVsYm93IHJ1bGUpCjYuIENyZWF0ZSBzb21lIGZpY3Rpb3VzIGZsb3dlcnMgYW5kIHByZWRpY3QgaXRzIHNwZWNpZXMuIChFdmVudHVhbGx5IGNyZWF0ZSBhIGZ1bmN0aW9uIGRlZGljYXRlZCB0byB0aGF0IHB1cnBvc2UpCjYuIFNpbmNlIHdlIGRvIGhhdmUgYWNjZXNzIHRvIHRoZSAqU3BlY2llcyogZm9yIGVhY2ggZmxvd2VyICg1dGggY29sdW1uKSwgY29tcHV0ZSB0aGUgY29uZnVzaW9uIG1hdHJpeC4gVGhlICp0YWJsZSogZnVuY3Rpb24gbWlnaHQgaGVscC4KNy4gT3B0aW9uYWwuIEkgd3JvdGUgYW4gaW5jb21wbGV0ZSBwaWVjZSBvZiBjb2RlIHdoaWNoIGlzIGEgKHJlKWltcGxlbWVudGF0aW9uIG9mIHRoZSBMTG95ZCBhbG9nb3JpdGhtLiBGaWxsIGluIGdhcHMuCmBgYHtyfQpteWttZWFucyA8LSBmdW5jdGlvbihkYXRhLCBuLmNsYXNzLCBpdGVyLm1heCA9IDUwKXsKICAjIyBkYXRhIDogbWF0cml4IG9mIHRoZSBkYXRhCiAgIyMgbi5jbGFzcyA6IG51bWJlciBvZiBjbGFzc2VzCiAgIyMgaXRlci5tYXggOiBtYXhpbWFsIG51bWJlciBvZiBpdGVyYXRpb25zCiAgCiAgbi5vYnMgPC0gbnJvdyhkYXRhKQogIGRhdGEgPC0gYXMubWF0cml4KGRhdGEpIyMgY29udmVydCBhIGRhdGEgZnJhbWUgdG8gbWF0cml4IChpZiBuZWVkZWQpCiAgCiAgIyMgQ2VudHJvaWQgaW5pdGlhbGl6YXRpb24gKHRha2VuIHJhbmRvbWx5IHdpdGhvdXQgcmVwbGFjZW1lbnQgaW50byBpbmRpdmlkdWFscykKICBjZW50cm9pZHMgPC0gZGF0YVtzYW1wbGUobi5vYnMsIG4uY2xhc3MpLF0KICAKICBmb3IgKGkgaW4gMTppdGVyLm1heCl7CiAgICAjIyBDb21wdXRlIHRoZSBkaXN0YW5jZSBmb3IgZWFjaCBpbmRpdmlkdWFscyB0byBlYWNoIGNlbnRyb2lkCiAgICBkaXN0IDwtIG1hdHJpeChOQSwgbi5vYnMsIG4uY2xhc3MpCiAgICBmb3IgKGNsYXNzIGluIDE6bi5jbGFzcykKICAgICAgCiAgICAgIGRpc3RbLGNsYXNzXSA8LSBjb2xTdW1zKCh0KGRhdGEpIC0gY2VudHJvaWRzW2NsYXNzLF0pXjIpIyMgbmFzdHkgcGllY2Ugb2YgY29kZSB0aGF0IEkgY2FuIGV4cGxhaW4KICAgICAgCiAgICAjIyBBc3NpZ24gbGFiZWxzIHRvIGVhY2ggaW5kaXZpZHVhbHMKICAgIGNsdXN0ZXIgPC0gMSMjIyMgVG8gYmUgZmlsbGVkCiAgICAgIAogICAgIyMgQ2VudHJvaWRzIHVwZGF0ZQogICAgY2VudHJvaWRzIDwtIDEjIyBUbyBiZSBmaWxsZWQKICAgIAogICAgaWYgKGFfc3RvcHBpbmdfcnVsZV90b19kZWZpbmUpCiAgICAgIGJyZWFrCiAgfQogIAogICMjIENvbXB1dGUgd2l0aGluIGFuZCBiZXR3ZWVuIGluZXJ0aWEKICB3aXRoaW4gPC0gMSMjIFRvIGJlIGZpbGxlZAogIGJldHdlZW4gPC0gMSMjIFRvIGJlIGZpbGxlZAogIAogIGlmIChpdGVyID09IGl0ZXIubWF4KQogICAgd2FybmluZygiTWF4aW1hbCBudW1iZXIgb2YgaXRlcmF0aW9ucyBoYXMgYmVlbiByZWFjaGVkIDotKCIpCiAgCiAgcmV0dXJuKGxpc3QoY2x1c3RlciA9IGNsdXN0ZXIsIHdpdGhpbiA9IHdpdGhpbiwgYmV0d2VlbiA9IGJldHdlZW4sIGl0ZXIgPSBpdGVyLCBjZW50cm9pZHMgPSBjZW50cm9pZHMpKQp9CmBgYAoKIyMgQXBwbGljYXRpb246IEl0YWxpYW4gd2luZXMKClRoZSBkYXRhIGNhbiBiZSByZXRyaWV2ZWQgZnJvbSBbaGVyZV0oaHR0cHM6Ly9tcmliYXRldC5wZXJzby5tYXRoLmNucnMuZnIvQ2VudHJhbGVOYW50ZXMvQmlvU1RJQy9JdGFsaWFuV2luZS5SRGF0YSkuIFN0YXJ0IHdpdGggYSBkZXNjcmlwdGl2ZSBhbmFseXNpcyBvZiB0aGUgZGF0YSBhbmQsIG5leHQsIHBlcmZvcm0gYSBLLS1tZWFucyBjbGFzc2lmaWNhdGlvbi4gQ29tbWVudCByZXN1bHRzLiBQZXJmb3JtIHByZWRpY3Rpb25zIG9uIHRoZSBmb2xsb3dpbmcgbmV3IHdpbmVzCmBgYHtyfQpuZXcud2luZXMgPC0gcmJpbmQod2luZTEgPSBjKDEyLjM3LCAxLjE3LCAxLjkyLCAxOS42LCA3OCwgMi4xMSwgMi4wMCwgMC4yNywgMS4wNCwgNC42OCwgMS4xMiwgMy40OCwgNTEwKSwKICAgICAgICAgICAgICAgICAgd2luZTIgPSBjKDEzLjQ1LCAzLjcwLCAyLjYwLCAyMy4wLCAxMTEsIDEuNzAsIDAuOTIsIDAuNDMsIDEuNDYsIDEwLjY4LCAwLjg1LCAxLjU2LCA2OTUpKQpgYGAKCkdvb2QgbHVjayAhCgoKCgoKCgo=