Introduction

In this lab we will try to classify italian wines from several wine makers. The dataset has 178 wines for which 13 quantitative variables have been recorded. You will perform a K–means classification for those wines. However you’ll have to wait a bit and re-work on the Fisher’s iris data set to see if you understand and are able to reproduce the results we discussed during the lecture.

Re-analyzing the Fisher’s iris dataset

The data set is already embbeded into the R software. To retrieve them, just invoke

data(iris)
head(iris)

Recall that K–means classification is unsupervised so you should not use the last variable—which is even not quantitative so, unless you do not understand at all the K–means, you would not have used it, wouldn’t you?

  1. Start with a (short) descriptive analysis
  2. Read the documentation of the kmeans function. Please pay particular attention to the arguments x, centers, iter.max et nstart.
  3. Perform a K–means classification
  4. Interpret results
  5. Vary the number of classes and get the “optimal” number of classes (if you’re lost elbow rule)
  6. Create some fictious flowers and predict its species. (Eventually create a function dedicated to that purpose)
  7. Since we do have access to the Species for each flower (5th column), compute the confusion matrix. The table function might help.
  8. Optional. I wrote an incomplete piece of code which is a (re)implementation of the LLoyd alogorithm. Fill in gaps.
mykmeans <- function(data, n.class, iter.max = 50){
  ## data : matrix of the data
  ## n.class : number of classes
  ## iter.max : maximal number of iterations
  
  n.obs <- nrow(data)
  data <- as.matrix(data)## convert a data frame to matrix (if needed)
  
  ## Centroid initialization (taken randomly without replacement into individuals)
  centroids <- data[sample(n.obs, n.class),]
  
  for (i in 1:iter.max){
    ## Compute the distance for each individuals to each centroid
    dist <- matrix(NA, n.obs, n.class)
    for (class in 1:n.class)
      
      dist[,class] <- colSums((t(data) - centroids[class,])^2)## nasty piece of code that I can explain
      
    ## Assign labels to each individuals
    cluster <- 1#### To be filled
      
    ## Centroids update
    centroids <- 1## To be filled
    
    if (a_stopping_rule_to_define)
      break
  }
  
  ## Compute within and between inertia
  within <- 1## To be filled
  between <- 1## To be filled
  
  if (iter == iter.max)
    warning("Maximal number of iterations has been reached :-(")
  
  return(list(cluster = cluster, within = within, between = between, iter = iter, centroids = centroids))
}

Application: Italian wines

The data can be retrieved from here. Start with a descriptive analysis of the data and, next, perform a K–means classification. Comment results. Perform predictions on the following new wines

new.wines <- rbind(wine1 = c(12.37, 1.17, 1.92, 19.6, 78, 2.11, 2.00, 0.27, 1.04, 4.68, 1.12, 3.48, 510),
                  wine2 = c(13.45, 3.70, 2.60, 23.0, 111, 1.70, 0.92, 0.43, 1.46, 10.68, 0.85, 1.56, 695))

Good luck !

LS0tCnRpdGxlOiAiVmlucyBpdGFsaWVucyBldCBLLW1lYW5zIgphdXRob3I6IE1hdGhpZXUgUmliYXRldApvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgojIyBJbnRyb2R1Y3Rpb24KCkluIHRoaXMgbGFiIHdlIHdpbGwgdHJ5IHRvIGNsYXNzaWZ5IGl0YWxpYW4gd2luZXMgZnJvbSBzZXZlcmFsIHdpbmUgbWFrZXJzLiBUaGUgZGF0YXNldCBoYXMgMTc4IHdpbmVzIGZvciB3aGljaCAxMyBxdWFudGl0YXRpdmUgdmFyaWFibGVzIGhhdmUgYmVlbiByZWNvcmRlZC4gWW91IHdpbGwgcGVyZm9ybSBhIEstLW1lYW5zIGNsYXNzaWZpY2F0aW9uIGZvciB0aG9zZSB3aW5lcy4gSG93ZXZlciB5b3UnbGwgaGF2ZSB0byB3YWl0IGEgYml0IGFuZCByZS13b3JrIG9uIHRoZSBGaXNoZXIncyBpcmlzIGRhdGEgc2V0IHRvIHNlZSBpZiB5b3UgdW5kZXJzdGFuZCBhbmQgYXJlIGFibGUgdG8gcmVwcm9kdWNlIHRoZSByZXN1bHRzIHdlIGRpc2N1c3NlZCBkdXJpbmcgdGhlIGxlY3R1cmUuCgojIyBSZS1hbmFseXppbmcgdGhlIEZpc2hlcidzIGlyaXMgZGF0YXNldAoKVGhlIGRhdGEgc2V0IGlzIGFscmVhZHkgZW1iYmVkZWQgaW50byB0aGUgUiBzb2Z0d2FyZS4gVG8gcmV0cmlldmUgdGhlbSwganVzdCBpbnZva2UKYGBge3J9CmRhdGEoaXJpcykKaGVhZChpcmlzKQpgYGAKClJlY2FsbCB0aGF0IEstLW1lYW5zIGNsYXNzaWZpY2F0aW9uIGlzIHVuc3VwZXJ2aXNlZCBzbyB5b3Ugc2hvdWxkIG5vdCB1c2UgdGhlIGxhc3QgdmFyaWFibGUtLS13aGljaCBpcyBldmVuIG5vdCBxdWFudGl0YXRpdmUgc28sIHVubGVzcyB5b3UgZG8gbm90IHVuZGVyc3RhbmQgYXQgYWxsIHRoZSBLLS1tZWFucywgeW91IHdvdWxkIG5vdCBoYXZlIHVzZWQgaXQsIHdvdWxkbid0IHlvdT8KCjEuIFN0YXJ0IHdpdGggYSAoc2hvcnQpIGRlc2NyaXB0aXZlIGFuYWx5c2lzCjIuIFJlYWQgdGhlIGRvY3VtZW50YXRpb24gb2YgdGhlICprbWVhbnMqIGZ1bmN0aW9uLiBQbGVhc2UgcGF5IHBhcnRpY3VsYXIgYXR0ZW50aW9uIHRvIHRoZSBhcmd1bWVudHMgKngqLCAqY2VudGVycyosICppdGVyLm1heCogZXQgKm5zdGFydCouCjMuIFBlcmZvcm0gYSBLLS1tZWFucyBjbGFzc2lmaWNhdGlvbgo0LiBJbnRlcnByZXQgcmVzdWx0cwo1LiBWYXJ5IHRoZSBudW1iZXIgb2YgY2xhc3NlcyBhbmQgZ2V0IHRoZSAib3B0aW1hbCIgbnVtYmVyIG9mIGNsYXNzZXMgKGlmIHlvdSdyZSBsb3N0IGVsYm93IHJ1bGUpCjYuIENyZWF0ZSBzb21lIGZpY3Rpb3VzIGZsb3dlcnMgYW5kIHByZWRpY3QgaXRzIHNwZWNpZXMuIChFdmVudHVhbGx5IGNyZWF0ZSBhIGZ1bmN0aW9uIGRlZGljYXRlZCB0byB0aGF0IHB1cnBvc2UpCjYuIFNpbmNlIHdlIGRvIGhhdmUgYWNjZXNzIHRvIHRoZSAqU3BlY2llcyogZm9yIGVhY2ggZmxvd2VyICg1dGggY29sdW1uKSwgY29tcHV0ZSB0aGUgY29uZnVzaW9uIG1hdHJpeC4gVGhlICp0YWJsZSogZnVuY3Rpb24gbWlnaHQgaGVscC4KNy4gT3B0aW9uYWwuIEkgd3JvdGUgYW4gaW5jb21wbGV0ZSBwaWVjZSBvZiBjb2RlIHdoaWNoIGlzIGEgKHJlKWltcGxlbWVudGF0aW9uIG9mIHRoZSBMTG95ZCBhbG9nb3JpdGhtLiBGaWxsIGluIGdhcHMuCmBgYHtyfQpteWttZWFucyA8LSBmdW5jdGlvbihkYXRhLCBuLmNsYXNzLCBpdGVyLm1heCA9IDUwKXsKICAjIyBkYXRhIDogbWF0cml4IG9mIHRoZSBkYXRhCiAgIyMgbi5jbGFzcyA6IG51bWJlciBvZiBjbGFzc2VzCiAgIyMgaXRlci5tYXggOiBtYXhpbWFsIG51bWJlciBvZiBpdGVyYXRpb25zCiAgCiAgbi5vYnMgPC0gbnJvdyhkYXRhKQogIGRhdGEgPC0gYXMubWF0cml4KGRhdGEpIyMgY29udmVydCBhIGRhdGEgZnJhbWUgdG8gbWF0cml4IChpZiBuZWVkZWQpCiAgCiAgIyMgQ2VudHJvaWQgaW5pdGlhbGl6YXRpb24gKHRha2VuIHJhbmRvbWx5IHdpdGhvdXQgcmVwbGFjZW1lbnQgaW50byBpbmRpdmlkdWFscykKICBjZW50cm9pZHMgPC0gZGF0YVtzYW1wbGUobi5vYnMsIG4uY2xhc3MpLF0KICAKICBmb3IgKGkgaW4gMTppdGVyLm1heCl7CiAgICAjIyBDb21wdXRlIHRoZSBkaXN0YW5jZSBmb3IgZWFjaCBpbmRpdmlkdWFscyB0byBlYWNoIGNlbnRyb2lkCiAgICBkaXN0IDwtIG1hdHJpeChOQSwgbi5vYnMsIG4uY2xhc3MpCiAgICBmb3IgKGNsYXNzIGluIDE6bi5jbGFzcykKICAgICAgCiAgICAgIGRpc3RbLGNsYXNzXSA8LSBjb2xTdW1zKCh0KGRhdGEpIC0gY2VudHJvaWRzW2NsYXNzLF0pXjIpIyMgbmFzdHkgcGllY2Ugb2YgY29kZSB0aGF0IEkgY2FuIGV4cGxhaW4KICAgICAgCiAgICAjIyBBc3NpZ24gbGFiZWxzIHRvIGVhY2ggaW5kaXZpZHVhbHMKICAgIGNsdXN0ZXIgPC0gMSMjIyMgVG8gYmUgZmlsbGVkCiAgICAgIAogICAgIyMgQ2VudHJvaWRzIHVwZGF0ZQogICAgY2VudHJvaWRzIDwtIDEjIyBUbyBiZSBmaWxsZWQKICAgIAogICAgaWYgKGFfc3RvcHBpbmdfcnVsZV90b19kZWZpbmUpCiAgICAgIGJyZWFrCiAgfQogIAogICMjIENvbXB1dGUgd2l0aGluIGFuZCBiZXR3ZWVuIGluZXJ0aWEKICB3aXRoaW4gPC0gMSMjIFRvIGJlIGZpbGxlZAogIGJldHdlZW4gPC0gMSMjIFRvIGJlIGZpbGxlZAogIAogIGlmIChpdGVyID09IGl0ZXIubWF4KQogICAgd2FybmluZygiTWF4aW1hbCBudW1iZXIgb2YgaXRlcmF0aW9ucyBoYXMgYmVlbiByZWFjaGVkIDotKCIpCiAgCiAgcmV0dXJuKGxpc3QoY2x1c3RlciA9IGNsdXN0ZXIsIHdpdGhpbiA9IHdpdGhpbiwgYmV0d2VlbiA9IGJldHdlZW4sIGl0ZXIgPSBpdGVyLCBjZW50cm9pZHMgPSBjZW50cm9pZHMpKQp9CmBgYAoKIyMgQXBwbGljYXRpb246IEl0YWxpYW4gd2luZXMKClRoZSBkYXRhIGNhbiBiZSByZXRyaWV2ZWQgZnJvbSBbaGVyZV0oaHR0cHM6Ly9tcmliYXRldC5wZXJzby5tYXRoLmNucnMuZnIvQ2VudHJhbGVOYW50ZXMvQmlvU1RJQy9JdGFsaWFuV2luZS5SRGF0YSkuIFN0YXJ0IHdpdGggYSBkZXNjcmlwdGl2ZSBhbmFseXNpcyBvZiB0aGUgZGF0YSBhbmQsIG5leHQsIHBlcmZvcm0gYSBLLS1tZWFucyBjbGFzc2lmaWNhdGlvbi4gQ29tbWVudCByZXN1bHRzLiBQZXJmb3JtIHByZWRpY3Rpb25zIG9uIHRoZSBmb2xsb3dpbmcgbmV3IHdpbmVzCmBgYHtyfQpuZXcud2luZXMgPC0gcmJpbmQod2luZTEgPSBjKDEyLjM3LCAxLjE3LCAxLjkyLCAxOS42LCA3OCwgMi4xMSwgMi4wMCwgMC4yNywgMS4wNCwgNC42OCwgMS4xMiwgMy40OCwgNTEwKSwKICAgICAgICAgICAgICAgICAgd2luZTIgPSBjKDEzLjQ1LCAzLjcwLCAyLjYwLCAyMy4wLCAxMTEsIDEuNzAsIDAuOTIsIDAuNDMsIDEuNDYsIDEwLjY4LCAwLjg1LCAxLjU2LCA2OTUpKQpgYGAKCkdvb2QgbHVjayAhCgoKCgoKCgo=